evolve is currently in early development and consistently undergoes breaking changes to the core api and functionality. Expect a more stable version to be released in a couple of weeks.
evolve is an open-source and platform agnostic Python framework that enables your data teams to efficiently integrate data from a wide variety of structured or unstructured data sources into your database, data warehouse, or data lake(house) — blazingly fast with minimal memory overhead thanks to the Apache Arrow ecosystem.
It is built for developers with a code-first mindset. You will not find any low-code, clickops, or drag-and-drop shenanigans here. evolve offers you full control of how your data is read, parsed, handled in-memory, transformed, and finally written to any destination you need.
- Composable - Design your own data pipelines to fit into your own stack, and add any extra (possibly proprietary) sources or targets that you might possibly need, all possible through evolve's intuitive and lightweight framework philosophy.
- Blazing fast - Zero-copy principles by leveraging Apache Arrow gives you extremely rapid in-memory operations perfect for OLAP and easy interoperability with DuckDB, Polars, Spark, DataFusion and many more query engines.
- Customizable - You choose the backend that you want to use. Do you prefer DataFrames? Use Polars! Or perhaps you prefer to work on data using SQL? Then use the DuckDB backend! It is completely up to you.
- Platform agnostic - Run your ETL/ELT using evolve on your own infrastructure, no vendor lock-in, never.
flowchart TD
%% Sources (Connectors)
subgraph Sources
CSV[Local CSV Source]
JSON[HDFS JSON Source]
Parquet[S3 Parquet Source]
SQL[SQL Source]
Custom[Custom Source]
end
%% Intermediate Representation
subgraph Backend
Arrow[Apache Arrow / Polars / DuckDB / Custom]
end
%% Targets (Connectors)
subgraph Targets
S3[S3 object store]
Local[Local file system]
HDFS[Hadoop file system]
DW[Data Warehouse]
ML[ML Pipeline]
CustomOut[Custom Format]
end
%% Mapping logic
CSV -->|Map to Arrow| Arrow
JSON -->|Map to Arrow| Arrow
SQL -->|Map to Arrow| Arrow
Custom -->|Conditional Mapping| Arrow
Parquet -->|Direct Mapping| S3
Arrow --> S3
Arrow --> Local
Arrow --> HDFS
Arrow --> DW
Arrow --> ML
Arrow --> Viz
Arrow --> CustomOut
import evolve as ev
# Pipelines are lazy - only run when told to
pipeline = ev.Pipeline("parquet-ingestion") \
.with_source(ev.io.FixedWidthFile(...)) \
.with_target(ev.io.ParquetFile(...)) \
.with_transform(DropNulls(columns=(..., )))
pipeline.run() # runs the ETLYou can configure it with yaml or json!
source:
type: postgres
host: localhost
db: prod
user: admin
password: secret
schema: sales
tables: orders
transforms:
- type: drop_nulls
columns: ["order_id", "amount"]
- type: rename_columns
mapping:
order_id: id
amount: total
- type: filter_rows
condition: "total > 100"
target:
type: parquet
path: s3://prod/sales/orders.parquetevolve is distributed under the terms of both the MIT License and the Apache License (version 2.0).
See LICENSE-APACHE and LICENSE-MIT for details.
