Thanks for your code and contribution. I've conducted some testing on the legacy metrics, and it appears there might be an issue with the legacy metric-'best' score.
When testing with the 'LS07' test data utilizing bert-ls model with eval process, I obtained a 'best' score of 1.12, whereas with the 'ls14' test data, the 'best' score was 1.18. This is significantly different from the previous results. I believe there might be an issue with the metric calculations.