We can create separate test files to test the functionalities of: - [ ] generate.py - [ ] data py - [ ] benchmark.py - [ ] eval.py - [ ] correctness.py - [ ] sweep.py I suggest to create a separate PR for each file to make reviewing easier.