-
Notifications
You must be signed in to change notification settings - Fork 89
Description
Describe the bug
When trying to run FeTS-Challenge Task 1 in multi-GPU instance, it is crashing in send_model_to_device function.
To Reproduce
Steps to reproduce the behavior:
- Azure Instance with multi GPU support.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000001:00:00.0 Off | Off |
| N/A 40C P8 13W / 70W | 2MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000002:00:00.0 Off | Off |
| N/A 46C P8 12W / 70W | 2MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000003:00:00.0 Off | Off |
| N/A 46C P0 29W / 70W | 2MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000004:00:00.0 Off | Off |
| N/A 47C P0 28W / 70W | 2MiB / 16384MiB | 7% Default |
| | | N/A
-
export CUDA_VISIBLE_DEVICES=0,1,2,3
-
Run FeTS-Challenge from Migrating TaskRunner based FeTS Task_1 Challenge to Workflow API FeTS-AI/Challenge#204
-
It is crashing in send_model_to_device
` File "/home/azureuser/Work/Challenge/Task_1/fets_challenge/fets_challenge_model.py", line 110, in validate
epoch_valid_loss, epoch_valid_metric = validate_network(
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/GANDLF/compute/forward_pass.py", line 284, in validate_network
result = step(model, image, label, params, train=True)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/GANDLF/compute/step.py", line 78, in step
output = model(image)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
`
Expected behavior
Should run on multi-gpu without crash.
Media
If applicable, add images, screenshots or other relevant media to help explain your problem.
Environment information
GANDLF version: 0.1.0
Git hash: 4d614fe
Platform: Linux-6.11.0-1012-azure-x86_64-with-glibc2.39
Machine: x86_64
Processor: x86_64
Architecture: 64bit ELF
Python environment:
Version: 3.10.1
Implementation: CPython
Compiler: GCC 13.3.0
Build: main Apr 7 2025 07:01:16
Additional context
Add any other context about the problem here.