The paper described that
In terms of ImageNet, we use the AdamW optimizer [18] to train the network for 100 epochs with a total batch size of 256. The initial learning rate is 2e-4 reduced by 0.1 at epochs 30, 60, and 90.
However, on ImageNet, most papers (e.g., CRD) adopt SGD optimizer with an initial learning rate 0.1 on ResNet34-ResNet18 models.
Why the authors choose an uncommon AdamW optimizer on ImageNet?
Can you provide the results of ICKD with the same strategy as previous works for fair comparisons?
Thanks :)