Slow YAML parser

Using windIO for loading YAML files, I have been puzzled by the slow load time. As an example, when loading the `IEA-22-280-RWT.yaml` file from the repo, I observe a load time of 1.5 s! For reference, it is a file of just 0.7 MB.

I then remembered that I had to force the YAML parser to use the "pure Python" parser, as the filename of the file being read is not possible to extract from the CParser (https://github.com/IEAWindSystems/windIO/blob/7a398db7bdd7c01ca945f08546fbf5d50ff49ef1/windIO/yaml.py#L67, I could at least not figure it out back then). 

The "pure Python" parser is only required when using the `!include` command, so I could switch to the C parser to see if that was substantially faster. That test prompted a simple investigation of the parser speed, and I stumbled over this stack overflow: https://stackoverflow.com/questions/27743711/can-i-speedup-yaml
With this plot:
![](https://i.sstatic.net/GhMvZ.png)
It did not look too good for the YAML parser - even when using the C-Parser it is still one of the lowest. I therefore made the simple script below:

```python
import json
import time
from pathlib import Path

import rtoml
import tomllib
import windIO.examples.turbine as wio_turb
from ruamel.yaml import YAML
from windIO import load_yaml, write_yaml


def compare_dict(d1, d2, equal_key_order=True):
    assert len(d1) == len(d2), "Length of dicts should be the same"
    if equal_key_order:
        assert json.dumps(d1) == json.dumps(
            d2
        ), "Data should be the same (comparing their JSON representation)"
    else:
        for name, val in d1.items():
            assert name in d2, f"{name} is not in d2"
            if isinstance(val, dict):
                compare_dict(val, d2[name], equal_key_order)
            elif isinstance(val, list) and isinstance(val[0], dict):
                assert len(val) == len(d2[name]), "Length of dicts should be the same"
                for iel, el in enumerate(val):
                    compare_dict(el, d2[name][iel], equal_key_order)
            else:
                assert json.dumps(val) == json.dumps(
                    d2[name]
                ), "Data should be the same (comparing their JSON representation)"


# Load IEA 22 and save as json and toml
wio_data_baseline = load_yaml(Path(wio_turb.__file__).parent / "IEA-22-280-RWT.yaml")

with open("IEA-22-280-RWT_default.json", "w") as file:
    json.dump(wio_data_baseline, file)

with open("IEA-22-280-RWT_indent2.json", "w") as file:
    json.dump(wio_data_baseline, file, indent=2)

with open("IEA-22-280-RWT.toml", "w") as file:
    rtoml.dump(wio_data_baseline, file)

write_yaml(wio_data_baseline, "IEA-22-280-RWT.yaml")
print("IEA-22 files was written")

# Load with current default loader (pure-python)
t0 = time.time()
wio_data = load_yaml("IEA-22-280-RWT.yaml")
t1 = time.time()
t_base = t1 - t0
compare_dict(wio_data_baseline, wio_data)
print(f"windIO YAML-(Default) load time: {t_base:1.3f} s")

# Load with C-loader
loader = YAML(typ="safe", pure=False)
t0 = time.time()
wio_data_cload = load_yaml(
    "IEA-22-280-RWT.yaml", loader
)
t1 = time.time()
t_cload = t1 - t0
compare_dict(wio_data_baseline, wio_data_cload)
print(f"YAML C-loader load time: {t_cload:1.3f} s ({t_base/t_cload-1:2.1f} times faster)")

# Load with json-default
t0 = time.time()
with open("IEA-22-280-RWT_default.json", "r") as file:
    wio_json1 = json.load(file)
t1 = time.time()
t_json1 = t1 - t0
compare_dict(wio_data_baseline, wio_json1)
print(f"JSON-default load time: {t_json1:1.3f} s ({t_base/t_json1-1:2.1f} times faster)")

# Load with json
t0 = time.time()
with open("IEA-22-280-RWT_indent2.json", "r") as file:
    wio_json2 = json.load(file)
t1 = time.time()
t_json2 = t1 - t0
compare_dict(wio_data_baseline, wio_json2)
print(f"JSON-indent=2 load time: {t_json2:1.3f} s ({t_base/t_json2-1:2.1f} times faster)")

# Load with tomllib 
t0 = time.time()
with open("IEA-22-280-RWT.toml", "rb") as file:
    wio_toml = tomllib.load(file)
t1 = time.time()
t_toml = t1 - t0
compare_dict(wio_data_baseline, wio_toml, False)
print(f"tomllib load time: {t_toml:1.3f} s ({t_base/t_toml-1:2.1f} times faster)")

# Load with rtoml 
t0 = time.time()
with open("IEA-22-280-RWT.toml", "r") as file:
    wio_toml3 = rtoml.load(file)
t1 = time.time()
t_toml3 = t1 - t0
compare_dict(wio_data_baseline, wio_toml3, False)
print(f"rtoml load time: {t_toml3:1.3f} s ({t_base/t_toml3-1:2.1f} times faster)")
```
Which on my machine resulted in the following output:

```
IEA-22 files was written
windIO YAML-(Default) load time: 1.543 s
YAML C-loader load time: 0.697 s (1.2 times faster)
JSON-default load time: 0.008 s (188.2 times faster)
JSON-indent=2 load time: 0.009 s (172.4 times faster)
tomllib load time: 0.115 s (12.4 times faster)
rtoml load time: 0.018 s (86.8 times faster)
```

- It shows that the C-parser is faster for YAML, but only roughly halfing the load time. 
- The standard JSON parser loading the same data is at least 170 times as fast (compared to our current default). JSON do in my opinion, lacks a critical feature of adding comments, and to me would be enough to disqualify it as a choice of data format for our default use case. In the script, I have compared loading data that has been written with default (one long string without indentation and line breaks) and another with `indent=2`, which to me is a file that is humanly readable (1D array of numbers would to me be better as one line). The load time is not significantly different, but the file size is 0.7 MB for the default case and 1.4 MB for the `indent=2` case. 
- TOML was added as part of the std. Python (`tomllib`) has the advantages of allowing for comments while being relatively fast (compared to YAML). It also allows for round-trip conversion (reading format, comments, structure, etc. via `tomlkit` - but that was really slow ~30s). 

I am not trying to suggest that we will make a forced switch from one file format to another, but rather that we could consider allowing for different file formats. I think the biggest issue when doing that would be to keep the possibility of reading other files via the YAML syntax `!include`. We would therefore need to switch to a fileparser agnostic method, which could be some post-processing where we could introduce keywords that (fx, `$ref`, `$external` ) that the post-processor could look for and read from. 

The issue raised here is mostly a reminder that we need to investigate the possibility of using the C-Parser in YAML and the remaining observations could be moved to other issues or PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow YAML parser #173

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow YAML parser #173

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions