Hello!
I'm implementing Nadam/Radam for my NNTL project. Would you be so kind to give me some intuition or clarify one thing about the Nadam's algorithm, as it was posted in ICLR 2016 paper ?
Here is a line for \hat{n_t} expression in Algorithm2 in the paper:
\hat{n_t} \leftarrow \nu n_t / (1-\nu ^t)
It contains \nu coefficient before n_t in the numerator. What's the purpose of this scaling? Neither original Adam, nor your own implementation of Nadam here as well as original report paper, doesn't have that coefficient in the numerator.
Thanks!