Significant Performance Gap Between Reproduced Results and Reported Benchmarks

I successfully downloaded the official HuggingFace model weights and deployed the model using vLLM on my local server for evaluation. I then ran the provided main.py script to benchmark performance on the listed datasets.

Although the implementation appears correct, my results differ substantially from those reported in the paper, especially on the ETT datasets. For example, my average MSE on ETTm1 is around 33.9, while the paper reports 13.1.

I have verified that the evaluation script correctly averages results across all windows and variables. Could you please advise whether I might be missing any critical evaluation or preprocessing steps?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significant Performance Gap Between Reproduced Results and Reported Benchmarks #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Significant Performance Gap Between Reproduced Results and Reported Benchmarks #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions