-
Notifications
You must be signed in to change notification settings - Fork 83
Description
在跑QuickStart里面的样例 run_ppo_hotpotqa.sh 报错信息如下,想请教一下哪里出了问题?
2025-09-10 16:35:45,628 INFO worker.py:1942 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 (raylet) The node with node id: b01282dc336e43db3b88340e4a2ba6076b51b0e0cf71b4a594b446f1 and address: 10.39.2.46 and node name: 10.39.2.46 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. Error executing job with overrides: ['data.train_files=[data/hotpotqa/train.parquet]', 'data.val_files=[data/hotpotqa/validation.parquet]', 'data.train_batch_size=128', 'data.max_prompt_length=8192', 'data.max_response_length=8192', 'data.max_response_length_single_turn=1024', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.stop_token_ids=[151658]', 'actor_rollout_ref.rollout.stop=[]', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'critic.optim.lr=1e-5', 'critic.model.use_remove_padding=True', 'critic.model.path=Qwen/Qwen2.5-1.5B-Instruct', 'critic.model.enable_gradient_checkpointing=True', 'critic.ppo_micro_batch_size_per_gpu=2', 'critic.model.fsdp_config.param_offload=False', 'critic.model.fsdp_config.optimizer_offload=False', 'algorithm.adv_estimator=gae', 'algorithm.kl_ctrl.kl_coef=0.001', 'algorithm.use_kl_in_reward=True', 'trainer.critic_warmup=5', 'trainer.logger=[console,wandb]', 'trainer.project_name=hotpotqa', 'trainer.experiment_name=ppo-qwen2.5-1.5b-instruct', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=10', 'trainer.total_epochs=1', 'trainer.val_before_train=True', 'trainer.log_val_generations=0', 'tool.max_turns=5', 'tool.tools=[search]', 'tool.max_tool_response_length=2048'] (raylet) [2025-09-10 16:35:54,891 E 57558 57666] (raylet) agent_manager.cc:86: The raylet exited immediately because one Ray agent failed, agent_name = runtime_env_agent. (raylet) The raylet fate shares with the agent. This can happen because (raylet) - The version of grpciodoesn't follow Ray's requirement. Agent can segfault with the incorrectgrpcioversion. Check the grpcio versionpip freeze | grep grpcio. (raylet) - The agent failed to start because of unexpected error or port conflict. Read the log cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
Traceback (most recent call last):
File "/public/home/shenninggroup/yjyan/Agent-R1/agent_r1/src/main_agent.py", line 67, in main
run_agent(config)
File "/public/home/shenninggroup/yjyan/Agent-R1/agent_r1/src/main_agent.py", line 79, in run_agent
ray.get(runner.run.remote(config))
File "/public/home/shenninggroup/yjyan/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/public/home/shenninggroup/yjyan/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File "/public/home/shenninggroup/yjyan/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 2882, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/public/home/shenninggroup/yjyan/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 970, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: TaskRunner
actor_id: e8f2b574acb68373e54955f201000000
namespace: bc077af0-6205-4b63-a529-5919c27bcae2
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.39.2.46 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.`