-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Description
Hi,
I currently testing out milo and it's parameters. I'am using the RTX50 dockerfile from #20 and the rtx 50 repo. I started the following training on the Truck demo data on different GPUs and always get an cudaMalloc exception.
Training command
python /workspace/MILo/milo/train.py -s /data/input/Truck -m /data/output/Truck --imp_metric outdoor --rasterizer radegs --eval --mesh_config default --decoupled_appearance --log_interval 200 --save_iterations 2000 4000 6000 8000 10000 12000 14000 16000 18000 --checkpoint_iterations 2000 4000 6000 8000 10000 12000 14000 16000 18000 --data_device cpu --config_path /workspace/MILo/milo/configs/fastTested GPUs
- Nvidia GTX 5060 TI, 16 GB RAM
- Nvidia GTX 5070 TI, 16 GB RAM
- Nvidia GTX 5090, 32 GB RAM
- Nvidia H100, 80 GB RAM
Exception
The crash does not crash at the same iteration on every gpu, so I don't think its a problem with the demo data.
Training progress: 71%|██████████████████████████████████████████████████████████████████████████████████████████▉ | 12790/18000 [31:50<15:18, 5.67it/s, Loss=0.0630130, DNLoss=0.0062229, MDLoss=0.0019296, MNLoss=0.0061558, OccLoss=0.0000477, OccLabLoss=0.0011094, N_Gauss=319469]
[INFO] Resetting occupancy labels at iteration 12800. [03/02 11:15:44]
Computing occupancy from mesh: 0%| | 0/219 [00:00<?, ?it/s]
Traceback (most recent call last):%| | 0/219 [00:00<?, ?it/s]
File "/workspace/MILo/milo/train.py", line 652, in <module>
training(
File "/workspace/MILo/milo/train.py", line 288, in training
mesh_regularization_pkg = compute_mesh_regularization(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/regularization/regularizer/mesh.py", line 535, in compute_mesh_regularization
voronoi_occupancy_labels, _ = evaluate_mesh_occupancy(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/regularization/sdf/depth_fusion.py", line 541, in evaluate_mesh_occupancy
render_pkg = mesh_renderer(
^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/scene/mesh.py", line 396, in forward
fragments, rast_out, pos = self.rasterizer(mesh, cameras, cam_idx, return_rast_out=True, return_positions=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/scene/mesh.py", line 341, in forward
nvdiff_rast_out = nvdiff_rasterization(
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/scene/mesh.py", line 127, in nvdiff_rasterization
rast_chunk, _ = dr.rasterize(
^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/nvdiffrast/torch/ops.py", line 135, in rasterize
return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/nvdiffrast/torch/ops.py", line 78, in forward
out, out_db = _nvdiffrast_c.rasterize_fwd_cuda(raster_ctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Cuda error: 2[cudaMalloc(&m_gpuPtr, bytes);]
Training progress: 71%|██████████████████████████████████████████████████████████████████████████████████████████▉ | 12790/18000 [31:52<12:59, 6.69it/s, Loss=0.0630130, DNLoss=0.0062229, MDLoss=0.0019296, MNLoss=0.0061558, OccLoss=0.0000477, OccLabLoss=0.0011094, N_Gauss=319469]Any ideas whats going wrong? Some missleading parameters?
If I use mesh_config=verylowres and set the sampling_factor=0.1 or 0.2 it works.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels