I successfully downloaded the official HuggingFace model weights and deployed the model using vLLM on my local server for evaluation. I then ran the provided main.py script to benchmark performance on the listed datasets.
Although the implementation appears correct, my results differ substantially from those reported in the paper, especially on the ETT datasets. For example, my average MSE on ETTm1 is around 33.9, while the paper reports 13.1.
I have verified that the evaluation script correctly averages results across all windows and variables. Could you please advise whether I might be missing any critical evaluation or preprocessing steps?