MODAQ Toolkit

A Python-based data conversion library that transforms MODAQ MCAP files into columnar time series data using Apache Parquet. Parquet offers an optimal balance of type safety, storage efficiency, and cross-language support, making it more efficient than CSV and easier to work with than raw MODAQ output.

Why Parquet?

Database-like features including partitioning and filtering
Schema enforcement with strict typing
10-20x smaller file sizes compared to CSV
Wide language support through official and community libraries:
- Python:
  - pandas
  - Polars
- MATLAB: Built-in support
- C++: Arrow
- Rust: Polars

Quick Start

Quick install from GitHub:

pip install git+https://github.com/MODAQ2/MODAQ_toolkit.git

For detailed installation options including conda environments, specific versions, and development setup, see the Installation section below.

Usage

Command Line

Basic usage:

modaq -i /path/to/mcap/files -o /path/to/output

Available arguments:

Argument	Description	Required	Default
`-i`, `--input-dir`	Directory containing MCAP files	Yes	-
`-o`, `--output-dir`	Directory for output	No	`./data/`
`--async`	Enable asynchronous processing of MCAP files (Much faster, but uses more cores)	No	`False`

Example with all options:

modaq -i /path/to/mcap/files -o /path/to/output --async

Python

# Convert multiple MCAP files
from modaq_toolkit import process_mcap_files
process_mcap_files("input_directory", "output_directory")

# Or work with a single file
from modaq_toolkit import MCAPParser
parser = MCAPParser("path/to/file.mcap")
parser.read_mcap()
parser.create_output("output_directory")

Output Structure

The toolkit processes data in two stages:

a1_one_to_one/: Preserves the original data structure
- Contains Parquet files organized by topic
- Includes metadata describing schemas and structure
a2_real_data/: Optimizes data for time series analysis
- Expands array data for easier analysis
- Maintains same file organization as stage 1

Using Parquet Data

Python with pandas

Read all Parquet files from a directory:

from pathlib import Path

import pandas as pd

# Read all parquet files in a directory
# Path uses forward slash on Windows, MacOS and Linux
directory = Path("path/to/parquet/files").resolve()

df = pd.read_parquet(directory)

# Ensure that the data is sorted by time
df = df.sort_index()

# Print a summary of the df
print(df.info())

# Print the first 5 rows of df
print(df.head())

Read a single Parquet file:

import pandas as pd

parquet_file = Path("path/to/file.parquet").resolve()
# Read single parquet file and sort by time
df = pd.read_parquet(parquet_file)
df = df.sort_values('time')

# Print a summary of the df
print(df.info())

# Print the first 5 rows of df
print(df.head())

MATLAB

Read all Parquet files from a directory:

% Create a datastore for the directory
ds = parquetDatastore("path/to/parquet/files", "OutputType", "timetable");

% Preview the dataset
preview(ds)

% Read all data and convert to table
df = readall(ds);

% Sort by time
df = sortrows(df, 'time');

Read a single Parquet file:

% Read single parquet file
df = parquetread("path/to/file.parquet");

Note: MODAQ data always includes a time column which should be used for sorting and time series analysis.

Installation

1. Prerequisites: Install Python/Anaconda

For NLR Enterprise Users:

Open "Portal Manager" on your NLR workstation

Browse to the Anaconda package
Click "Install" to get the full Anaconda distribution with Python

For Non-NLR Users:

Download either:
- Anaconda (full distribution with many pre-installed packages)
- Miniconda (minimal distribution, recommended for most users)

Then create and activate a new environment:

# Create new environment with Python 3.10
conda create -n modaq python=3.10

# Activate the environment
# On Windows:
conda activate modaq
# On Unix-like systems:
source activate modaq

2. Easy Install (pip from GitHub)

For most users who just want to use the package, this is the recommended method:

pip install git+https://github.nrel.gov/Water-Power/modaq_toolkit.git

3. Complete Install (Clone & Development Setup)

For developers or those who need to modify the code:

Clone the repository:

# HTTPS (Recommended for most users)
git clone https://github.nrel.gov/Water-Power/modaq_toolkit

# SSH (For contributors)
git clone git@github.nrel.gov:Water-Power/modaq_toolkit.git

Navigate to repository:

cd modaq_toolkit  # Adjust path as needed

Install in development mode:

# Standard development install
pip install -e .

# Full development install with testing tools
pip install -e ".[dev]"

Development install benefits:

Code changes take effect immediately
Source remains in original location
Enables contributing back to the project
Includes development tools with [dev] option

Would you like me to adjust any particular section or add more details to any part?

Common Issues and Solutions

Installation Problems

Missing Dependencies
- Make sure you're using Python 3.10+
- Check that all requirements are installed: pip list
- Try reinstalling in a fresh conda environment
Import Errors
- Verify your environment is activated
- Ensure you're in the correct directory when installing
- Check for any conflicting packages

Data Processing

Large Files
- Process files individually for better memory management
- Use the command line tool for batch processing
- Monitor system resources during conversion
- Avoid processing files directly from OneDrive, network drives, or remote servers
  - These files must be downloaded locally first, which can significantly slow processing
  - Copy files to a local drive before processing for best performance
Output Structure
- Check permissions in output directory
- Verify input MCAP file integrity
- Review logs for processing errors

Support

For issues and feature requests, please use the GitHub issue tracker. Include:

Your operating system and Python version
Steps to reproduce the problem
Any relevant error messages
Example data (if possible)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
src/modaq_toolkit		src/modaq_toolkit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
changelog.md		changelog.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MODAQ Toolkit

Why Parquet?

Quick Start

Quick install from GitHub:

Usage

Command Line

Python

Output Structure

Using Parquet Data

Python with pandas

MATLAB

Installation

1. Prerequisites: Install Python/Anaconda

2. Easy Install (pip from GitHub)

3. Complete Install (Clone & Development Setup)

Common Issues and Solutions

Installation Problems

Data Processing

Support

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

MODAQ2/MODAQ_toolkit

Folders and files

Latest commit

History

Repository files navigation

MODAQ Toolkit

Why Parquet?

Quick Start

Quick install from GitHub:

Usage

Command Line

Python

Output Structure

Using Parquet Data

Python with pandas

MATLAB

Installation

1. Prerequisites: Install Python/Anaconda

2. Easy Install (pip from GitHub)

3. Complete Install (Clone & Development Setup)

Common Issues and Solutions

Installation Problems

Data Processing

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages