-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Description
I am running SpeedPPI on GPU nodes. But some of the jobs would run out of memory even with 250GB memory. The error says RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes. which means it was requesting 16.5 GB of memory. Even if I multiply by 10 for each Recycle, that would be 160 GB which still leaves 90GB extra. So I don't know what's happening!!
E0000 00:00:1727970189.766616 2640766 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for
plugin cuBLAS when one has already been registered
I0000 00:00:1727970528.756185 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491
MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
I0000 00:00:1727970528.816756 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727970529.000277 2640766 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
2024-10-03 18:50:37.111627: W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 4.01GiB (4304358978 bytes) by rematerialization; only reduced to 13.96GiB (14990920828 bytes), down from 13.96GiB (14990923372 bytes) originally
2024-10-03 18:51:12.979005: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.37GiB (rounded to 16508718336)requested by op
2024-10-03 18:51:12.979733: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ***********************************************_____________________________________________________
E1003 18:51:12.980048 2640766 pjrt_stream_executor_client.cc:3067] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.
Traceback (most recent call last):
File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 258, in <module>
main(num_ensemble=1,
File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 222, in main
prediction_result = model_runner.predict(processed_feature_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/alphafold/model/model.py", line 133, in predict
result = self.apply(self.params, jax.random.PRNGKey(0), feat)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels