A Python-based data conversion library that transforms MODAQ MCAP files into columnar time series data using Apache Parquet. Parquet offers an optimal balance of type safety, storage efficiency, and cross-language support, making it more efficient than CSV and easier to work with than raw MODAQ output.
- Database-like features including partitioning and filtering
- Schema enforcement with strict typing
- 10-20x smaller file sizes compared to CSV
- Wide language support through official and community libraries:
- Python:
- MATLAB: Built-in support
- C++: Arrow
- Rust: Polars
pip install git+https://github.com/MODAQ2/MODAQ_toolkit.gitFor detailed installation options including conda environments, specific versions, and development setup, see the Installation section below.
Basic usage:
modaq -i /path/to/mcap/files -o /path/to/outputAvailable arguments:
| Argument | Description | Required | Default |
|---|---|---|---|
-i, --input-dir |
Directory containing MCAP files | Yes | - |
-o, --output-dir |
Directory for output | No | ./data/ |
--async |
Enable asynchronous processing of MCAP files (Much faster, but uses more cores) | No | False |
Example with all options:
modaq -i /path/to/mcap/files -o /path/to/output --async# Convert multiple MCAP files
from modaq_toolkit import process_mcap_files
process_mcap_files("input_directory", "output_directory")
# Or work with a single file
from modaq_toolkit import MCAPParser
parser = MCAPParser("path/to/file.mcap")
parser.read_mcap()
parser.create_output("output_directory")The toolkit processes data in two stages:
-
a1_one_to_one/: Preserves the original data structure- Contains Parquet files organized by topic
- Includes metadata describing schemas and structure
-
a2_real_data/: Optimizes data for time series analysis- Expands array data for easier analysis
- Maintains same file organization as stage 1
Read all Parquet files from a directory:
from pathlib import Path
import pandas as pd
# Read all parquet files in a directory
# Path uses forward slash on Windows, MacOS and Linux
directory = Path("path/to/parquet/files").resolve()
df = pd.read_parquet(directory)
# Ensure that the data is sorted by time
df = df.sort_index()
# Print a summary of the df
print(df.info())
# Print the first 5 rows of df
print(df.head())Read a single Parquet file:
import pandas as pd
parquet_file = Path("path/to/file.parquet").resolve()
# Read single parquet file and sort by time
df = pd.read_parquet(parquet_file)
df = df.sort_values('time')
# Print a summary of the df
print(df.info())
# Print the first 5 rows of df
print(df.head())Read all Parquet files from a directory:
% Create a datastore for the directory
ds = parquetDatastore("path/to/parquet/files", "OutputType", "timetable");
% Preview the dataset
preview(ds)
% Read all data and convert to table
df = readall(ds);
% Sort by time
df = sortrows(df, 'time');Read a single Parquet file:
% Read single parquet file
df = parquetread("path/to/file.parquet");Note: MODAQ data always includes a time column which should be used for sorting and time series analysis.
For NLR Enterprise Users:
- Open "Portal Manager" on your NLR workstation
- Browse to the Anaconda package
- Click "Install" to get the full Anaconda distribution with Python
For Non-NLR Users:
- Download either:
Then create and activate a new environment:
# Create new environment with Python 3.10
conda create -n modaq python=3.10
# Activate the environment
# On Windows:
conda activate modaq
# On Unix-like systems:
source activate modaqFor most users who just want to use the package, this is the recommended method:
pip install git+https://github.nrel.gov/Water-Power/modaq_toolkit.gitFor developers or those who need to modify the code:
-
Clone the repository:
# HTTPS (Recommended for most users) git clone https://github.nrel.gov/Water-Power/modaq_toolkit # SSH (For contributors) git clone git@github.nrel.gov:Water-Power/modaq_toolkit.git
-
Navigate to repository:
cd modaq_toolkit # Adjust path as needed
-
Install in development mode:
# Standard development install pip install -e . # Full development install with testing tools pip install -e ".[dev]"
Development install benefits:
- Code changes take effect immediately
- Source remains in original location
- Enables contributing back to the project
- Includes development tools with
[dev]option
Would you like me to adjust any particular section or add more details to any part?
-
Missing Dependencies
- Make sure you're using Python 3.10+
- Check that all requirements are installed:
pip list - Try reinstalling in a fresh conda environment
-
Import Errors
- Verify your environment is activated
- Ensure you're in the correct directory when installing
- Check for any conflicting packages
-
Large Files
- Process files individually for better memory management
- Use the command line tool for batch processing
- Monitor system resources during conversion
- Avoid processing files directly from OneDrive, network drives, or remote servers
- These files must be downloaded locally first, which can significantly slow processing
- Copy files to a local drive before processing for best performance
-
Output Structure
- Check permissions in output directory
- Verify input MCAP file integrity
- Review logs for processing errors
For issues and feature requests, please use the GitHub issue tracker. Include:
- Your operating system and Python version
- Steps to reproduce the problem
- Any relevant error messages
- Example data (if possible)
