How to determine the memory requirement?

I am running SpeedPPI on GPU nodes. But some of the jobs would run out of memory even with 250GB memory. The error says `RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.` which means it was requesting 16.5 GB of memory. Even if I multiply by 10 for each Recycle, that would be 160 GB which still leaves 90GB extra. So I don't know what's happening!!
```
E0000 00:00:1727970189.766616 2640766 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for 
plugin cuBLAS when one has already been registered
I0000 00:00:1727970528.756185 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491
 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
I0000 00:00:1727970528.816756 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727970529.000277 2640766 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
2024-10-03 18:50:37.111627: W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 4.01GiB (4304358978 bytes) by rematerialization; only reduced to 13.96GiB (14990920828 bytes), down from 13.96GiB (14990923372 bytes) originally
2024-10-03 18:51:12.979005: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.37GiB (rounded to 16508718336)requested by op 
2024-10-03 18:51:12.979733: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ***********************************************_____________________________________________________
E1003 18:51:12.980048 2640766 pjrt_stream_executor_client.cc:3067] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.
Traceback (most recent call last):
  File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 258, in <module>
    main(num_ensemble=1,
  File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 222, in main
    prediction_result = model_runner.predict(processed_feature_dict)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/alphafold/model/model.py", line 133, in predict
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to determine the memory requirement? #31

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to determine the memory requirement? #31

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions