Skip to content

PPO ratio is incorrectly calculated #2

@deketh

Description

@deketh

Currnetly, continuous PPO ratio is calculated as ratios = new_actor_log_probs/(actor_log_probs+EPSILON) this is derived from previous PPO algorithm ratio = actor_probs/(old_actor_probs+EPSILON) and that made sense as they were not log probabilites.

The line should actually be something like ratios = exp(new_actor_log_probs - actor_log_probs)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions