A Python pipeline for downloading and processing HamSCI HF propagation spot data from the Madrigal database daily HDF5 files. It loads one or more days, applies filters (time, region, frequency, distance, dataset/source), converts to Polars, and can generate 2D histograms (time x distance). Includes optional Parquet caching for faster reruns.
- Downloads daily HDF5 files from Madrigal (via madrigalWeb / globalDownload.py)
- Loads daily HDF5 files named like rsdYYYY-MM-DD.01.hdf5
- Filters by:
- Date/time range
- Geographic bounds (lat/lon midpoint)
- Frequency range (single or multiple ranges)
- Distance range (km)
- Dataset/source (RBN, WSPR, PSK)
- Produces:
- Filtered Polars dataframe (Parquet/CSV/HDF5)
- 2D histogram (time vs distance) with metadata (Parquet/NetCDF/HDF5)
- Caches:
- Dataframes and histograms as Parquet for faster iterative runs
- run_loader.py
CLI entrypoint. Reads a JSON config and runs the pipeline day-by-day. - scripts/madrigal_loader.py
MadrigalHamSpotLoader implementation - scripts/json_loader.py
Config loading utilities. - scripts/regions.py
Named region bounding boxes. - scripts/utils_freq.py
Named frequency ranges and labels. - config/
Example config(s). - download_madrigal_daily_hdf5.sh
Helper script to download daily Madrigal HDF5 files via globalDownload.py.
- Python 3.10+ recommended
- Dependencies are listed in requirements.txt
pip install -r requirements.txt
This project expects daily Madrigal HDF5 files named like:
rsdYYYY-MM-DD.01.hdf5
We download these using the Madrigal remote Python API package madrigalWeb (included in requirements ), which installs command-line tools such as globalDownload.py.
download_madrigal_daily_hdf5.sh loops day-by-day and calls globalDownload.py (installed by madrigalWeb) to fetch daily HDF5 files.
In the script, you typically only need to set:
startDate/endDate--outputDir--user_fullname--user_email--user_affiliation
chmod +x download_madrigal_daily_hdf5.sh ./download_madrigal_daily_hdf5.sh
-
Put your HDF5 files in a directory, for example:
data/madrigal/rsd2019-12-01.01.hdf5 data/madrigal/rsd2019-12-02.01.hdf5
-
Create and edit JSON files in config folder:
config/example.json
-
Run:
python3 run_loader.py -p config/example.json
Example config/example.json:
{
"data_dir": "data/madrigal",
"cache_dir": "cache",
"use_cache": true,
"chunk_size": 100000,
"sDate": "2019-12-01T00:00:00",
"eDate": "2019-12-03T23:59:59",
"filters": {
"region_name": "CONUS",
"freq": [7, 14],
"distance_range": { "min_dist": 0, "max_dist": 3000 },
"datasets": ["RBN", "WSPR"]
},
"output": {
"output_dir": "output",
"dataframe": { "generate": true, "formats": ["csv"] },
"histogram": { "generate": true, "formats": ["csv"] }
}
}
- filters.freq can be:
- a single key string (example: "7MHz")
- a list of key strings (example: ["7MHz","14MHz"])
- filters.datasets is optional:
- If omitted or null, all datasets are included.
- Valid values: ["RBN","WSPR","PSK"]
- Region and frequency keys must exist in scripts/regions.py and scripts/utils_freq.py.
The pipeline runs day-by-day across the requested date range.
Outputs are written under your configured output.output_dir (default: output/), typically into:
- output/dataframes/
- output/histograms/
Histogram Parquet outputs include metadata stored in the Parquet schema under heatmap_meta.
When use_cache is true:
- Dataframes are cached in: cache/dataframes/
- Histograms are cached in: cache/heatmaps/
Cache filenames incorporate:
- date range
- region bounds
- frequency range(s)
- distance range
- dataset selection
If you change filters and rerun, it creates a new cache entry automatically.
PRs are welcome. Please keep large data and generated outputs out of git (see .gitignore).