I recently ran several RL experiments using this environment, but I found that my policy model tends to output the wait action repeatedly. In the current setup, a single wait action advances the environment by 10 iterations/steps, and if the agent performs more than 10 consecutive waits, the environment automatically considers the task completed.
I’m trying to understand how to address this issue. How can I prevent the model from overusing the wait action or adjust the environment so this doesn’t prematurely end the task?