Skip to content

How to determine the memory requirement? #31

@Rohit-Satyam

Description

@Rohit-Satyam

I am running SpeedPPI on GPU nodes. But some of the jobs would run out of memory even with 250GB memory. The error says RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes. which means it was requesting 16.5 GB of memory. Even if I multiply by 10 for each Recycle, that would be 160 GB which still leaves 90GB extra. So I don't know what's happening!!

E0000 00:00:1727970189.766616 2640766 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for 
plugin cuBLAS when one has already been registered
I0000 00:00:1727970528.756185 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491
 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
I0000 00:00:1727970528.816756 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727970529.000277 2640766 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
2024-10-03 18:50:37.111627: W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 4.01GiB (4304358978 bytes) by rematerialization; only reduced to 13.96GiB (14990920828 bytes), down from 13.96GiB (14990923372 bytes) originally
2024-10-03 18:51:12.979005: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.37GiB (rounded to 16508718336)requested by op 
2024-10-03 18:51:12.979733: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ***********************************************_____________________________________________________
E1003 18:51:12.980048 2640766 pjrt_stream_executor_client.cc:3067] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.
Traceback (most recent call last):
  File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 258, in <module>
    main(num_ensemble=1,
  File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 222, in main
    prediction_result = model_runner.predict(processed_feature_dict)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/alphafold/model/model.py", line 133, in predict
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions