I noticed that the SampleBEV metrics reported in the original paper are much higher than what I observe when running the code in this repo. Could you help clarify what causes this gap (e.g., differing settings, data splits, visibility filters, evaluation protocols)?