GitHub

Take state input
Compute probs for actions state and take action acc to to those probs
Store the probs
Do the action choosen.
Store the reward after each action.
Repeat 1-5 until the episode ends
Calculate discounted rewards for each step in the trajectory
Compute grads

The Objective function is:

The grads can be derived from:

where $$G_t$$ is the dsicounted rewards as a consequence of that actions

This was implementation of vanilla Policy Gradient now lets see A2C. The full form is Advantage Actor Critic In Vanilla implementation many times we take good actions and sometimes bad actions and those 2 cancel out each other and the agent doesnt learn whats actually bad and good.

So in A2C we introduce a Critic which tells how good was the action done in this particular step. It basically is the difference between how much we could have get in this state and how much we actually got. It creates a difference between individula steps instead of whole trajectory. This is the advantage part.

The changed Objective function is :

The grads will be:

Implementation:

There will be 2 NNs one predicting the Q values of each state and then one predicting the State values of the current state.
The function we want to maximize the Expected difference between the predicted max reward from this state , and the actual reward we get.

Change the name of envoirment to try out different envoirments

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
README.md		README.md
img.png		img.png
img_1.png		img_1.png
img_2.png		img_2.png
img_3.png		img_3.png
img_4.png		img_4.png
main.py		main.py
tester.py		tester.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

pthpth/A2C

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages