Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,15 @@ cmake-build-*
files_location*.txt
capio_logs

#Doxygen generated documentation
doxy/html
doxy/latex
doxy/doxygen-awesome-css-*
doxy/theme

# Other
debug
build

.devcontainer
.DS_Store
*.alive_connection
282 changes: 129 additions & 153 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,52 @@
# CAPIO
# CAPIO: Cross Application Programmable IO

CAPIO (Cross-Application Programmable I/O), is a middleware aimed at injecting streaming capabilities to workflow steps
without changing the application codebase. It has been proven to work with C/C++ binaries, Fortran Binaries, JAVA,
python and bash.
CAPIO is a middleware aimed at injecting streaming capabilities into workflow steps
without changing the application codebase. It has been proven to work with C/C++ binaries, Fortran, Java, Python, and
Bash.

[![codecov](https://codecov.io/gh/High-Performance-IO/capio/graph/badge.svg?token=6ATRB5VJO3)](https://codecov.io/gh/High-Performance-IO/capio)
![CI-Tests](https://github.com/High-Performance-IO/capio/actions/workflows/ci-tests.yaml/badge.svg)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://raw.githubusercontent.com/High-Performance-IO/capio/master/LICENSE)
[![codecov](https://codecov.io/gh/High-Performance-IO/capio/graph/badge.svg?token=6ATRB5VJO3)](https://codecov.io/gh/High-Performance-IO/capio) ![CI-Tests](https://github.com/High-Performance-IO/capio/actions/workflows/ci-tests.yaml/badge.svg) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://raw.githubusercontent.com/High-Performance-IO/capio/master/LICENSE)

## Build and run tests
> [!TIP]
> CAPIO is now multibackend and dynamic by nature: you do not need MPI, to benefit for the in-memory IO improvements!
> Just use a MTCL provided backend, if you want the in-memory IO, or fall back to the file system backend (default) if
> oy just want to coordinate IO operations between workflow steps!

Compatible on:
- ![Architecture](https://img.shields.io/badge/Architecture-x86__64_/_amd64-50C878.svg)
- ![Architecture](https://img.shields.io/badge/Architecture-RISC--V_(riscv64)-50C878.svg)
- ![Architecture](https://img.shields.io/badge/Architecture-ARM64_coming_soon-red.svg)

---
## Automatic install with SPACK

CAPIO is on SPACK! to install it automatically, just add the High Performance IO
repo to spack and then install CAPIO:
```bash
spack repo add https://github.com/High-Performance-IO/hpio-spack.git
spack install capio
```

> [!WARNING]
> To use this method, you need spack >= v1.0.0

## 🔧 Manual Build and Install

### Dependencies

CAPIO depends on the following software that needs to be manually installed:
**Required manually:**

- `cmake >=3.15`
- `c++17` or newer
- `openmpi`
- `cmake >= 3.15`
- `C++20`
- `pthreads`

The following dependencies are automatically fetched during cmake configuration phase, and compiled when required.
**Fetched/compiled during configuration:**

- [syscall_intercept](https://github.com/pmem/syscall_intercept) to intercept syscalls
- [Taywee/args](https://github.com/Taywee/args) to parse server command line inputs
- [simdjson/simdjson](https://github.com/simdjson/simdjson) to parse json configuration files
- [syscall_intercept](https://github.com/pmem/syscall_intercept) - Intercept and handles LINUX system calls
- [Taywee/args](https://github.com/Taywee/args) - Parse user input arguments
- [simdjson/simdjson](https://github.com/simdjson/simdjson) - Parse fast JSON files
- [MTCL](https://github.com/ParaGroup/MTCL) - Provides abstractions over multiple communication backends

### Compile capio
### Compile CAPIO

```bash
git clone https://github.com/High-Performance-IO/capio.git capio && cd capio
Expand All @@ -35,169 +56,124 @@ cmake --build . -j$(nproc)
sudo cmake --install .
```

It is also possible to enable log in CAPIO, by defining `-DCAPIO_LOG=TRUE`.
To enable logging support, pass `-DCAPIO_LOG=TRUE` during the CMake configuration phase.

## Use CAPIO in your code
---

Good news! You don't need to modify your code to benefit from the features of CAPIO. You have only to do three steps (
the first is optional).
## 🧑‍💻 Using CAPIO in Your Code

1) Write a configuration file for injecting streaming capabilities to your workflow
Good news! You **don’t need to modify your application code**. Just follow these steps:

2) Launch the CAPIO daemons with MPI passing the (eventual) configuration file as argument on the machines in which you
want to execute your program (one daemon for each node). If you desire to specify a custom folder
for capio, set `CAPIO_DIR` as a environment variable.
```bash
[CAPIO_DIR=your_capiodir] [mpiexec -N 1 --hostfile your_hostfile] capio_server -c conf.json
```
### 1. Create a Configuration File *(optional but recommended)*

> [!NOTE]
> if `CAPIO_DIR` is not specified when launching capio_server, it will default to the current working directory of
> capio_server.
Write a CAPIO-CL configuration file to inject streaming into your workflow. Refer to
the [CAPIO-CL Docs](https://capio.hpc4ai.it/docs/coord-language/) for details.

3) Launch your programs preloading the CAPIO shared library like this:
```bash
CAPIO_DIR=your_capiodir \
CAPIO_WORKFLOW_NAME=wfname \
CAPIO_APP_NAME=appname \
LD_PRELOAD=libcapio_posix.so \
./your_app <args>
```
### 2 Launch the workflow with CAPIO

> [!WARNING]
> `CAPIO_DIR` must be specified when launching a program with the CAPIO library. if `CAPIO_DIR` is not specified, CAPIO
> will not intercept syscalls.
To launch your workflow with capio you can follow two routes:

### Available environment variables
#### A) Use `capiorun` for simplified operations

CAPIO can be controlled through the usage of environment variables. The available variables are listed below:
You can simplify the execution of workflow steps with CAPIO using the `capiorun` utility. See the
[`capiorun` documentation](capio-run/readme.md) for usage and examples. `capiorun` provides an easier way to manage
daemon startup and environment preparation, so that the user do not need to manually prepare the environment.

#### Global environment variable
#### B) Manually launch CAPIO

- `CAPIO_DIR` This environment variable tells to both server and application the mount point of capio;
- `CAPIO_LOG_LEVEL` this environment tells both server and application the log level to use. This variable works only
if `-DCAPIO_LOG=TRUE` was specified during cmake phase;
- `CAPIO_LOG_PREFIX` This environment variable is defined only for capio_posix applications and specifies the prefix of
the logfile name to which capio will log to. The default value is `posix_thread_`, which means that capio will log by
default to a set of files called `posix_thread_*.log`. An equivalent behaviour can be set on the capio server using
the `-l` option;
- `CAPIO_LOG_DIR` This environment variable is defined only for capio_posix applications and specifies the directory
name to which capio will be created. If this variable is not defined, capio will log by default to `capio_logs`. An
equivalent behaviour can be set on the capio server using the `-d` option;
- `CAPIO_CACHE_LINES`: This environment variable controls how many lines of cache are presents between posix and server
applications. defaults to 10 lines;
- `CAPIO_CACHE_LINE_SIZE`: This environment variable controls the size of a single cache line. defaults to 256KB;
Launch the CAPIO Daemons: start one daemon per node. Optionally set `CAPIO_DIR` to define the CAPIO mount point:

#### Server only environment variable
```bash
[CAPIO_DIR=your_capiodir] capio_server -c conf.json
```

- `CAPIO_FILE_INIT_SIZE`: This environment variable defines the default size of pre allocated memory for a new file
handled by capio. Defaults to 4MB. Bigger sizes will reduce the overhead of malloc but will fill faster node memory.
Value has to be expressed in bytes;
- `CAPIO_PREFETCH_DATA_SIZE`: If this variable is set, then data transfers between nodes will be always, at least of the
given value in bytes;
> [!CAUTION]
> If `CAPIO_DIR` is not set, it defaults to the current working directory.

#### Posix only environment variable
You can now start your application. Just set the right environment variable and remember to set `LD_PRELOAD` to the
`libcapio_posix.so` intercepting library:

> [!WARNING]
> The following variables are mandatory. If not provided to a posix, application, CAPIO will not be able to correctly
> handle the
> application, according to the specifications given from the json configuration file!

- `CAPIO_WORKFLOW_NAME`: This environment variable is used to define the scope of a workflow for a given step. Needs to
be the same one as the field `"name"` inside the json configuration file;
- `CAPIO_APP_NAME`: This environment variable defines the app name within a workflow for a given step;

## How to inject streaming capabilities into your workflow

With CAPIO is possible to run the applications of your workflow that communicates through files concurrently. CAPIO will
synchronize transparently the concurrent reads and writes on those files. If a file is never modified after it is closed
you can set the streaming semantics equals to "on_close" on the configuration file. In this way, all the reads done on
this file will hung until the writer closes the file, allowing the consumer application to read the file even if the
producer is still running.
Another supported file streaming semantics is "append" in which a read is satisfied when the producer writes the
requested data. This is the most aggressive (and efficient) form of streaming semantics (because the consumer can start
reading while the producer is writing the file). This semantic must be used only if the producer does not modify a piece
of data after it is written.
The streaming semantic on_termination tells CAPIO to not allowing streaming on that file. This is the default streaming
semantics if a semantics for a file is not specified.
The following is an example of a simple configuration:

```json
{
"name": "my_workflow",
"IO_Graph": [
{
"name": "writer",
"output_stream": [
"file0.dat",
"file1.dat",
"file2.dat"
],
"streaming": [
{
"name": ["file0.dat"],
"committed": "on_close"
},
{
"name": ["file1.dat"],
"committed": "on_close",
"mode": "no_update"
},
{
"name": ["file2.dat"],
"committed": "on_termination"
}
]
},
{
"name": "reader",
"input_stream": [
"file0.dat",
"file1.dat",
"file2.dat"
]
}
]
}
```bash
CAPIO_DIR=your_capiodir
CAPIO_WORKFLOW_NAME=wfname
CAPIO_APP_NAME=appname
LD_PRELOAD=libcapio_posix.so
./your_app <args>

killall -USR1 capio_server
```

> [!NOTE]
> We are working on an extension of the possible streaming semantics and in a detailed
> documentation about the configuration file!
> [!CAUTION]
> if `CAPIO_APP_NAME` and `CAPIO_WORKFLOW_NAME` are not set (or are set but do not match the values present in the
> CAPIO-CL configuration file), CAPIO will not be able to operate correctly!

## Examples
> [!tip]
> To gracefully shut down the capio server instance, just send the SIGUSR1 signal.
> the capio_server process will then automatically clean up and terminate itself!

The [examples](examples) folder contains some examples that shows how to use mpi_io with CAPIO.
There are also examples on how to write JSON configuration files for the semantics implemented by CAPIO:
---

- [on_close](https://github.com/High-Performance-IO/capio/wiki/Examples#on_close-semantic): A pipeline composed by a
producer and a consumer with "on_close" semantics
- [no_update](https://github.com/High-Performance-IO/capio/wiki/Examples#noupdate-semantics): A pipeline composed by a
producer and a consumer with "no_update" semantics
- [mix_semantics](https://github.com/High-Performance-IO/capio/wiki/Examples#mixed-semantics): A pipeline composed by a
producer and a consumer with mix semantics
## ⚙️ Environment Variables

## Report bugs + get help
### 🔄 Global

[Create a new issue](https://github.com/High-Performance-IO/capio/issues/new)
| Variable | Description |
|-------------------------|----------------------------------------------------|
| `CAPIO_DIR` | Shared mount point for server and application |
| `CAPIO_LOG_LEVEL` | Logging level (requires `-DCAPIO_LOG=TRUE`) |
| `CAPIO_LOG_PREFIX` | Log file name prefix (default: `posix_thread_`) |
| `CAPIO_LOG_DIR` | Directory for log files (default: `capio_logs`) |
| `CAPIO_CACHE_LINE_SIZE` | Size of a single CAPIO cache line (default: 256KB) |

[Get help](https://github.com/High-Performance-IO/capio/wiki)
### 🖥️ Server-Only

> [!TIP]
> A [wiki](https://github.com/High-Performance-IO/capio/wiki) is in development! You might want to check the wiki to get
> more in depth information about CAPIO!
| Variable | Description |
|----------------------|----------------------------------------------------------------------------|
| `CAPIO_METADATA_DIR` | Directory for metadata files. Defaults to `CAPIO_DIR`. Must be accessible. |

### 📁 POSIX-Only (Mandatory)

> ⚠️ These are required by CAPIO-POSIX. Without them, your app will not behave as configured in the JSON file.

| Variable | Description |
|-----------------------|-------------------------------------------------|
| `CAPIO_WORKFLOW_NAME` | Must match `"name"` field in your configuration |
| `CAPIO_APP_NAME` | Name of the step within your workflow |

---

## 📖 Extended documentation

Documentation and examples are available on the official site:

🌐 [https://capio.hpc4ai.it/docs](https://capio.hpc4ai.it/docs)

---

## 🐞 Report Bugs & Get Help

- [Create an issue](https://github.com/High-Performance-IO/capio/issues/new)
- [Official Documentation](https://capio.hpc4ai.it/docs)

---

## 👥 CAPIO Team

Made with ❤️ by:

- Marco Edoardo Santimaria – <marcoedoardo.santimaria@unito.it> (Designer & Maintainer)
- Iacopo Colonnelli – <iacopo.colonnelli@unito.it> (Workflow Support & Maintainer)
- Massimo Torquati – <massimo.torquati@unipi.it> (Designer)
- Marco Aldinucci – <marco.aldinucci@unito.it> (Designer)

## CAPIO Team
**Former Members:**

Made with :heart: by:
- Alberto Riccardo Martinelli – <albertoriccardo.martinelli@unito.it> (Designer & Maintainer)

Alberto Riccardo Martinelli <albertoriccardo.martinelli@unito.it> (designer and maintainer) \
Marco Edoardo Santimaria <marcoedoardo.santimaria@unito.it> (Designer and maintainer) \
Iacopo Colonnelli <iacopo.colonnelli@unito.it> (Workflows expert and maintainer) \
Massimo Torquati <massimo.torquati@unipi.it> (Designer) \
Marco Aldinucci <marco.aldinucci@unito.it> (Designer)
---

## Papers
[![CAPIO](https://img.shields.io/badge/CAPIO-10.1109/HiPC58850.2023.00031-red)]([https://arxiv.org/abs/2206.10048](https://dx.doi.org/10.1109/HiPC58850.2023.00031))
## 📚 Publications

[![CAPIO](https://img.shields.io/badge/CAPIO-10.1109/HiPC58850.2023.00031-red)](https://dx.doi.org/10.1109/HiPC58850.2023.00031)

[![](https://img.shields.io/badge/CAPIO--CL-10.1007%2Fs10766--025--00789--0-green?style=flat&logo=readthedocs)](https://doi.org/10.1007/s10766-025-00789-0)
Loading