Skip to content

A highly efficient, composable, and lightweight ETL and data integration framework.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

firelink-sh/evolve-py

Repository files navigation

evolve logo

A highly efficient, composable, and lightweight ETL and data integration framework.


CI Tests codecov


evolve is currently in early development and consistently undergoes breaking changes to the core api and functionality. Expect a more stable version to be released in a couple of weeks.

evolve is an open-source and platform agnostic Python framework that enables your data teams to efficiently integrate data from a wide variety of structured or unstructured data sources into your database, data warehouse, or data lake(house)blazingly fast with minimal memory overhead thanks to the Apache Arrow ecosystem.

It is built for developers with a code-first mindset. You will not find any low-code, clickops, or drag-and-drop shenanigans here. evolve offers you full control of how your data is read, parsed, handled in-memory, transformed, and finally written to any destination you need.

  • Composable - Design your own data pipelines to fit into your own stack, and add any extra (possibly proprietary) sources or targets that you might possibly need, all possible through evolve's intuitive and lightweight framework philosophy.
  • Blazing fast - Zero-copy principles by leveraging Apache Arrow gives you extremely rapid in-memory operations perfect for OLAP and easy interoperability with DuckDB, Polars, Spark, DataFusion and many more query engines.
  • Customizable - You choose the backend that you want to use. Do you prefer DataFrames? Use Polars! Or perhaps you prefer to work on data using SQL? Then use the DuckDB backend! It is completely up to you.
  • Platform agnostic - Run your ETL/ELT using evolve on your own infrastructure, no vendor lock-in, never.

Architecture (alpha version)

flowchart TD
    %% Sources (Connectors)
    subgraph Sources
        CSV[Local CSV Source]
        JSON[HDFS JSON Source]
        Parquet[S3 Parquet Source]
        SQL[SQL Source]
        Custom[Custom Source]
    end

    %% Intermediate Representation
    subgraph Backend
        Arrow[Apache Arrow / Polars / DuckDB / Custom]
    end

    %% Targets (Connectors)
    subgraph Targets
        S3[S3 object store]
        Local[Local file system]
        HDFS[Hadoop file system]
        DW[Data Warehouse]
        ML[ML Pipeline]
        CustomOut[Custom Format]
    end

    %% Mapping logic
    CSV -->|Map to Arrow| Arrow
    JSON -->|Map to Arrow| Arrow
    SQL -->|Map to Arrow| Arrow
    Custom -->|Conditional Mapping| Arrow
    Parquet -->|Direct Mapping| S3

    Arrow --> S3
    Arrow --> Local
    Arrow --> HDFS
    Arrow --> DW
    Arrow --> ML
    Arrow --> Viz
    Arrow --> CustomOut
Loading

Example usage

import evolve as ev

# Pipelines are lazy - only run when told to
pipeline = ev.Pipeline("parquet-ingestion") \
    .with_source(ev.io.FixedWidthFile(...)) \
    .with_target(ev.io.ParquetFile(...)) \
    .with_transform(DropNulls(columns=(..., )))

pipeline.run()  # runs the ETL

You can configure it with yaml or json!

source:
  type: postgres
  host: localhost
  db: prod
  user: admin
  password: secret
  schema: sales
  tables: orders

transforms:
  - type: drop_nulls
    columns: ["order_id", "amount"]
  - type: rename_columns
    mapping:
      order_id: id
      amount: total
  - type: filter_rows
    condition: "total > 100"

target:
  type: parquet
  path: s3://prod/sales/orders.parquet

License

evolve is distributed under the terms of both the MIT License and the Apache License (version 2.0).

See LICENSE-APACHE and LICENSE-MIT for details.

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •