closure is a flexible, reproducible, and distributed-ready machine learning framework for training on fields files of ECsim/iPiC3D Particle in Cell Codes, with a focus on fluid closure. It provides tools for data loading, preprocessing, model training, evaluation, and experiment management, supporting both single-node and distributed (multi-GPU/multi-node) workflows.
-
Experiment Management:
Organizes experiments in a clear directory structure. Each experiment (called a "run") is stored in its own subfolder, with all configs, logs, and results saved for reproducibility. -
Flexible Data Handling:
Loads and preprocesses large scientific datasets (e.g., from simulations or experiments), supporting normalization, filtering, patch extraction, and custom feature/target selection. -
Model Training:
Supports a variety of neural network architectures (MLP, FCNN, ResNet, etc.) and training protocols, including distributed training with PyTorch. -
Evaluation & Analysis:
Provides utilities for loss/metric evaluation, plotting, and comparison across runs or experiments. -
Reproducibility:
Every run is fully specified by aconfig.jsonfile, making it easy to reproduce or compare results.
closure/src/— Core source code (trainers, datasets, models, utilities, etc.)closure/examples/— Example notebooks and scripts (seetuto_train_haydn.ipynb)work_dir/— User-specified experiment directory; each run is a subfolder (e.g.,work_dir/0,work_dir/nosubsample, etc.)
Option 1: Install via conda:
conda env create -f gputorch.ymlOption 2: Install dependencies manually:
- Python 3.x
- PyTorch
- numpy, pandas, joblib, scipy, matplotlib
- Organize your simulation or experimental data as described in the tutorial notebook.
- Prepare CSV files listing your train/val/test samples.
See examples/tuto_train_haydn.ipynb for a full walkthrough.
from closure.src.trainers import Trainer
dataset_kwargs = {...} # See notebook for details
load_data_kwargs = {...}
model_kwargs = {...}
trainer = Trainer(
work_dir="experiments/my_experiment",
dataset_kwargs=dataset_kwargs,
load_data_kwargs=load_data_kwargs,
model_kwargs=model_kwargs,
device="cuda"
)
trainer.fit()A tutorial on the usage of the trainer can be found in Tutorial_trainer.pdf in the examples folder.
python -m closure.src.trainers --config work_dir=work_dir --config run=run_name --config model_kwargs.model_name=ResNet- Use
--config key=valueto override config values (supports nested keys with dot notation).
- The Trainer manages a parent directory (
work_dir). - Each run is a subfolder (e.g.,
work_dir/0,work_dir/nosubsample), containing:config.json— full configuration for the runmodel.pth— trained model weightsloss_dict.pkl— training/validation loss historyrun.log— log file for the run- normalization files (
X.pkl,y.pkl)
- To start a new run, change any config (e.g., hyperparameters, model, batch size) and call
trainer.fit(config=new_config). A new subfolder is created. - To reload or compare runs, use the
load_runmethod or the provided utilities.
See examples/tuto_train_haydn.ipynb for a step-by-step tutorial, including:
- Setting up dataset and model configs
- Training a model
- Running multiple experiments in the same folder
- Evaluating and comparing results
- Reloading previous runs
-
Configuration Consistency:
When running multiple experiments in the samework_dir, make sure the dataset configuration (dataset_kwargs) is consistent across runs. If you change the dataset, use a newwork_dirto avoid normalization mismatches. -
Run Naming:
Use therunkey in your config to create new subfolders for each experiment. This keeps results organized and prevents accidental overwrites. -
Distributed Setup:
The script automatically handles distributed training if environment variables (likeWORLD_SIZE) are set. Make sure your cluster/job environment is configured correctly. -
Logging:
Each run has its own log file, making it easy to debug and track progress. Logs include information about the host, rank, and device. -
Reproducibility:
All configs, normalization parameters, and model weights are saved for each run, ensuring you can always reproduce or analyze past experiments. -
Extensibility:
The Trainer is designed to be flexible—custom models, datasets, and training protocols can be plugged in with minimal changes.
- src/trainers.py — The
Trainerclass: manages configs, logging, datasets, models, and training. - src/datasets.py — Data loading utilities: distributed/serial samplers, normalization, filtering, patch extraction.
- src/models.py — Model definitions: MLP, FCNN, ResNet, and the
PyNettraining wrapper. - src/utilities.py — Utility functions: config handling, evaluation, plotting, and more.
- src/read_pic.py — Functions for reading iPiC3D/ECsim simulation data in h5 or pickle format.
closure provides a powerful and flexible system for loading, preprocessing, and sampling scientific datasets, designed for both single-node and distributed (multi-GPU) workflows.
-
Distributed and Serial Sampling:
Efficiently split and shuffle data across multiple GPUs or nodes using theDistributedSamplerandSubSamplerclasses. This ensures each process gets a unique subset of the data, enabling scalable training. -
Channel-based Data Loading:
TheChannelDataLoaderextends PyTorch's DataLoader to allow selection of specific feature and target channels (e.g., physical quantities like density, pressure, etc.), as well as patch-based cropping for image-like data. -
Flexible Subsampling and Shuffling:
Easily control the fraction of data used for training or validation via thesubsample_rateandsubsample_seedparameters. This is useful for quick prototyping or handling very large datasets. This is also useful for oversampling which is necessary if you are using random patch extraction. -
Patch Extraction:
For image or field data, you can extract random spatial patches on-the-fly, which is useful for data selection or training convolutional models. -
Feature/Target Normalization:
TheDataFrameDatasetclass supports normalization (mean/std) and pre-scaling (e.g., log transforms) of both features and targets. Normalization parameters are saved and reused for reproducibility. -
Filtering and Transformations:
Apply custom filters (e.g., smoothing, denoising) to features or targets using scipy or custom functions. Supports torchvision-style transforms for data augmentation. -
CSV/Metadata Integration:
Datasets are typically defined by CSV files listing sample filenames and metadata, making it easy to manage large and complex datasets.
from closure.src.datasets import DataFrameDataset, ChannelDataLoader
dataset = DataFrameDataset(
data_folder='/path/to/data',
norm_folder='/path/to/norm',
samples_file='/path/to/samples.csv',
feature_dtype='float32',
target_dtype='float32',
scaler_features=None,
scaler_targets=None,
transform={'RandomCrop': {'size': (16, 16)}, 'apply': ['train']},
datalabel='train'
)
loader = ChannelDataLoader(
dataset,
batch_size=32,
feature_channel_names=['Bx', 'By', 'Bz'],
target_channel_names=['Pxx_e', 'Pyy_e'],
subsample_rate=0.5,
subsample_seed=42,
patch_dim=[16, 16],
)
for features, targets in loader:
# Training loop here
pass- Prepare CSV files listing your train/val/test samples.
- Configure
dataset_kwargsto specify data location, features/targets, normalization, and any filters or transforms. - Create a
DataFrameDatasetfor each split (train/val/test). - Wrap with
ChannelDataLoaderfor batching, shuffling, and distributed support. - Pass loaders to the Trainer for model training and evaluation.
closure provides a modular and extensible system for defining, training, and evaluating neural network models, with a focus on scientific and physical data. The models are implemented using PyTorch and are designed to be easily configurable and compatible with distributed training.
-
Unified Training Wrapper (
PyNet):
ThePyNetclass wraps any supported model architecture and manages the full training lifecycle, including optimizer and scheduler setup, distributed training (via PyTorch DDP), checkpointing, early stopping, and logging. It supports custom optimizers, learning rate schedulers, and metrics. -
Model Zoo:
Several neural network architectures are provided out-of-the-box:- MLP (Multi-Layer Perceptron): For tabular or flattened data, with configurable layer sizes, activations, and dropout.
- FCNN (Fully Convolutional Neural Network): For image or field data, supports configurable channels, kernels, activations, batch normalization, and dropout.
- ResNet: Flexible residual network with user-defined skip connections, suitable for deep image-based models.
- CNet: Example convolutional network for image data.
-
Custom Initialization:
Models support custom weight and bias initialization, as well as flexible activation and dropout configuration. -
Distributed Training:
All models can be wrapped in PyTorch's DistributedDataParallel for multi-GPU or multi-node training. -
Checkpointing and Reloading:
Models and training histories can be saved and reloaded from disk, supporting experiment reproducibility and resuming.
- Author: George Miloshevich
- License: MIT License
- Projects: STRIDE, HELIOSKILL
- examples/tuto_train_haydn.ipynb — Full tutorial notebook
- Source code docstrings for detailed API documentation
closure is designed for flexibility, reproducibility, and ease of use in scientific ML workflows. Contributions and feedback are welcome!