All references to 'toolbox' are subject to change upon appropriate package naming.
The documentation for this package is available here
Please note that the documentation is still under construction.
Click here for a guide on how to build your own steps.
Python >= 3.10
For a local, editable version of the toolbox
git clone https://github.com/NOC-OBG-Autonomy/toolbox.git
cd toolbox
# create/activate a virtual environment
pip install -e . The toolbox pipeline provides a flexible, modular framework for defining, executing, and visualising multi-step data-processing workflows — such as those used for autonomous underwater glider missions.
Each pipeline is composed of a series of steps that are automatically discovered, registered, and executed in sequence. This enables users to build, extend, and visualise complex data workflows using a simple YAML configuration file.
The Pipeline class orchestrates the flow of data through a sequence of modular “steps.”
Each step performs a specific processing task (e.g., data loading, quality control, profile detection, export).
Key characteristics:
| Component | Description |
|---|---|
| Configuration-driven | Users define the workflow in a YAML file describing each step, its parameters, and diagnostics options. |
| Dynamic discovery | New steps are auto-registered via decorators and discovered dynamically when imported. |
| Composable | Steps can be nested to form sub-pipelines (e.g., a calibration block within a larger workflow). |
| Context-aware | Each step passes its results into a shared context, which is used as input for subsequent steps. |
| Visualisable | A Graphviz diagram can be automatically generated to visualise pipeline structure and dependencies. |
-
from toolbox.pipeline import Pipeline pipeline = Pipeline(config_path="my_pipeline.yaml")
- Loads the pipeline configuration from YAML.
- Discovers all available step classes using
@register_step. - Validates step dependencies defined in
STEP_DEPENDENCIES.
-
Each step inherits from
BaseStepand registers itself via a decorator:Registered steps are stored in a global registry (from toolbox.steps.base_step import register_step, BaseStep @register_step class LoadOG1(BaseStep): step_name = "Load OG1" def run(self): self.log("Loading NetCDF data...") # Do processing return self.context
REGISTERED_STEPS) and automatically imported at runtime by:toolbox.steps.discover_steps()
-
Running the pipeline executes each step in order, passing context forward:
Internally:
results = pipeline.run()
- Each step is instantiated via
create_step(). - The
run()method is called. - The returned context (e.g., processed
xarray.Dataset) is merged and passed to the next step.
- Each step is instantiated via
-
Steps can optionally include diagnostic plots or summaries by setting:
If the visualisation option is enabled, a Graphviz diagram of the pipeline is generated:
diagnostics: true
The pipeline renders a Graphviz diagram showing step dependencies and flow.pipeline: visualize: true
[Load OG1] → [Derive CTD*] → [Find Profiles] → [Data Export] *diagnostics enabled
-
The entire pipeline configuration can be exported to a YAML file for reproducibility:
pipeline.export_config("exported_pipeline.yaml")
An example YAML configuration for a simple pipeline:
pipeline:
name: "Data Processing Pipeline"
description: "Process and analyze multi-dimensional glider data"
visualisation: false
steps:
- name: "Load OG1"
parameters:
file_path: "../../examples/data/OG1/Doombar_648_R.nc"
add_meta: false
diagnostics: false
- name: "Derive CTD"
parameters:
interpolate_latitude_longitude: true
diagnostics: true
- name: "Find Profiles"
parameters:
gradient_thresholds: [0.02, -0.02]
diagnostics: false
- name: "Data Export"
parameters:
export_format: "hdf5"
output_path: "../../examples/data/OG1/exported_Doombar_648_R.nc"A full breakdown can be found here: Developer Guide.
TL;DR:
- Create a file under
toolbox/steps/custom/variables(e.g.,qc_salinity.py). - Define a class that inherits from
BaseStep. - Decorate it with
@register_stepand provide a unique step_name. - Implement a
run()method that processes data and returns an updated context.
- Declarative workflow: Define what happens, not how it’s executed.
- Extensible design: Add new steps without modifying core code.
- Integrated diagnostics: Flag and visualise data quality issues inline.
- Portable & reproducible: YAML configurations make it easy to rerun or share pipelines.