-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
We ran into an issue with the calc-x training example, which throws the error below. Do you have any suggestions on how to address this?
ERROR Algorithm bundle crashed; signaling stop event client_server.py:155
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/agentlightning/execution/client_server.py",
line 144, in _execute_algorithm
await algorithm(wrapper_store, stop_evt)
File "/usr/local/lib/python3.12/dist-packages/agentlightning/trainer/trainer.py", line
527, in _algorithm_bundle
algorithm.run(
File
"/usr/local/lib/python3.12/dist-packages/agentlightning/algorithm/verl/interface.py", line
184, in run
run_ppo(
File "/usr/local/lib/python3.12/dist-packages/agentlightning/verl/entrypoint.py", line
78, in run_ppo
ray.get(
File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22,
in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line
104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2967, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1015, in
get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): [36mray::TaskRunner.run()[39m (pid=3508839,
ip=10.155.71.188, actor_id=d76b31cf33b73c5bf275b36607000000,
repr=<agentlightning.verl.entrypoint.TaskRunner object at 0x7f805a9d5b50>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/agentlightning/verl/entrypoint.py", line
244, in run
trainer.fit()
File "/usr/local/lib/python3.12/dist-packages/agentlightning/verl/trainer.py", line 507,
in fit
metrics = self._train_step(batch_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/agentlightning/verl/trainer.py", line 369,
in _train_step
metrics.update(compute_data_metrics(batch=batch, use_critic=self.use_critic,
suffix="_before_processing"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/agentlightning/verl/trainer.py", line 125,
in compute_data_metrics
"critic/advantages/max" + suffix: torch.max(valid_adv).detach().item(),
^^^^^^^^^^^^^^^^^^^^
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify
the reduction dim with the 'dim' argument.