Hi. thanks for open-sourcing the amazing Perception Encoder! Could you clarify two points about image preprocessing, especially referencing Table 33's description ("trained with dynamic tiling for different image sizes and aspect ratio; up to 4 image tiles of the encoder’s native resolution + a thumbnail"):
- When is the input resized to fixed native sizes (e.g., 336px for L-scale, 448px for G-scale)?
- When is dynamic tiling applied instead?