In Appendix B.3.3 (Frozen feature dense prediction), for ADE20K semantic segmentation, you mention using a “linear layer + convolutional layer” head.
Could you please clarify the projection dimension (i.e., output dimension) of the linear layer used in the segmentation head?
Thank you