Skip to content

Bug report: training crashes when max_num_targets != -1 #1835

@sbAsma

Description

@sbAsma

What happened?

While trying to train on autoencoder mode with CAMS analysis, I set max_num_targets to 5 and got this error:

0: Traceback (most recent call last):
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/src/weathergen/run_train.py", line 198, in train_with_args
0:     trainer.run(cf, devices)
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/src/weathergen/train/trainer.py", line 443, in run
0:     self.train(mini_epoch)
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/src/weathergen/train/trainer.py", line 595, in train
0:     for bidx, batch in enumerate(dataset_iter):
0:                        ^^^^^^^^^^^^^^^^^^^^^^^
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 708, in __next__
0:     data = self._next_data()
0:            ^^^^^^^^^^^^^^^^^
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_data
0:     return self._process_data(data)
0:            ^^^^^^^^^^^^^^^^^^^^^^^^
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data
0:     data.reraise()
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/_utils.py", line 733, in reraise
0:     raise exception
0: TypeError: Caught TypeError in DataLoader worker process 0.
0: Original Traceback (most recent call last):
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
0:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
0:            ^^^^^^^^^^^^^^^^^^^^
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 42, in fetch
0:     data = next(self.dataset_iter)
0:            ^^^^^^^^^^^^^^^^^^^^^^^
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/src/weathergen/datasets/multi_stream_data_sampler.py", line 425, in __iter__
0:     (tt_cells, tc, tt_c, tt_t) = self.tokenizer.batchify_target(
0:                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0:   File "/p/project1/hclimrep/semcheddine1/WeatherGenerator/src/weathergen/datasets/tokenizer_masking.py", line 172, in batchify_target
0:     tt_lin = torch.cat(target_tokens)
0:              ^^^^^^^^^^^^^^^^^^^
0: ^^^^^
0: TypeError: expected Tensor as element 0 in argument 0, but got list
0: 
0: [6] > /p/project1/hclimrep/semcheddine1/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/_utils.py(733)reraise()

Please note that this happened on an earlier version of develop and the bug might've been solved, I can't tested it right now on the newest develop.

What are the steps to reproduce the bug?

No response

Hedgedoc link to logs and more information. This ticket is public, do not attach files directly.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions