Skip to content

[BUG] Crash when using multi GPU #995

@kagrawa2

Description

@kagrawa2

Describe the bug

When trying to run FeTS-Challenge Task 1 in multi-GPU instance, it is crashing in send_model_to_device function.

To Reproduce

Steps to reproduce the behavior:

  1. Azure Instance with multi GPU support.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000001:00:00.0 Off |                  Off |
| N/A   40C    P8              13W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000002:00:00.0 Off |                  Off |
| N/A   46C    P8              12W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       Off | 00000003:00:00.0 Off |                  Off |
| N/A   46C    P0              29W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       Off | 00000004:00:00.0 Off |                  Off |
| N/A   47C    P0              28W /  70W |      2MiB / 16384MiB |      7%      Default |
|                                         |                      |                  N/A
  1. export CUDA_VISIBLE_DEVICES=0,1,2,3

  2. Run FeTS-Challenge from Migrating TaskRunner based FeTS Task_1 Challenge to Workflow API FeTS-AI/Challenge#204

  3. It is crashing in send_model_to_device

` File "/home/azureuser/Work/Challenge/Task_1/fets_challenge/fets_challenge_model.py", line 110, in validate
epoch_valid_loss, epoch_valid_metric = validate_network(

File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/GANDLF/compute/forward_pass.py", line 284, in validate_network
result = step(model, image, label, params, train=True)

File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/GANDLF/compute/step.py", line 78, in step
output = model(image)

File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)

File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)

File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
`

Expected behavior

Should run on multi-gpu without crash.

Media

If applicable, add images, screenshots or other relevant media to help explain your problem.

Environment information

GANDLF version: 0.1.0
Git hash: 4d614fe
Platform: Linux-6.11.0-1012-azure-x86_64-with-glibc2.39
Machine: x86_64
Processor: x86_64
Architecture: 64bit ELF
Python environment:
Version: 3.10.1
Implementation: CPython
Compiler: GCC 13.3.0
Build: main Apr  7 2025 07:01:16

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions