GitHub - ariaghora/villard: A pipeline framework for data science projects

A tiny layer to organize your data science project

About

Sometimes you only need to accommodate frequent experiment pipeline changes and track everything. Almost always, your project is not FAANG-scaled and you just need a simple tool to organize your experiment code. Maybe you will like this.

Villard manages your data science pipelines by splitting a big project into smaller discrete steps. This encourages maintainable and reproducible workflow. Perhaps you don't even need to change your existing code too much.

What would you would expect from Villard:

An experiment pipeline management framework
An experiment tracker and explorer

Installation

pip install git+https://github.com/ariaghora/villard

Quick start

For starters, use the following command to create a project:

$ villard create example
$ cd example

This will give you a directory named example with the following structure (and example data):

villard-readme
├── config.jsonnet
├── data
│   ├── 01_raw
│   │   ├── employee.csv
│   │   └── position.csv
│   ├── 02_intermediate
│   ├── 03_output
│   └── 04_report
└── steps.py

The steps.py file contains the following code:

from villard import pipeline


@pipeline.step("merge_data")
def merge_data(df_employee, df_position):
    merged = df_employee.merge(df_position, on="id")
    return merged

@pipeline.step("sort_data")
def sort_data(df_merged, by, ascending):
    merged_and_sorted = df_merged.sort_values(by=by, ascending=ascending)
    pipeline.write_data("merged_and_sorted", merged_and_sorted)

The steps are simply to merge two dataframes, sort them, and write them to disk. We can see how they are glued together by inspecting the config.jsonnet file:

...
    pipeline_definition: {
        _default: {
            merge_data: {
                df_employee: "data::employee",
                df_position: "data::position",
            },
            sort_data: {
                df_merged: "ref::merge_data",
                by: "name",
                ascending: true,
            },
        },
    },
...

We can run the default pipeline by invoking the following command:

$ villard run config.jsonnet

You will see following output:

No pipeline name specified. Using default pipeline name: _default
  Executing `merge_data`...
⦿ Completed `merge_data`
  Executing `sort_data`...
⦿ Completed `sort_data`
Using default experiment directory: /Users/ghora/.villard/experiments

╒════════════╤════════════════╤══════════════════╕
│ Step       │ Dependencies   │ Execution Time   │
╞════════════╪════════════════╪══════════════════╡
│ merge_data │ []             │ 0:00:00.003964   │
├────────────┼────────────────┼──────────────────┤
│ sort_data  │ ['merge_data'] │ 0:00:00.001557   │
╘════════════╧════════════════╧══════════════════╛

and soon we will have a data/02_intermediate/merged_and_sorted.csv file.

Please have some time to read the documentation at https://ariaghora.github.io/villard/.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
assets		assets
bin		bin
docs		docs
tests		tests
villard		villard
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

What would you would expect from Villard:

Installation

Quick start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

ariaghora/villard

Folders and files

Latest commit

History

Repository files navigation

About

What would you would expect from Villard:

Installation

Quick start

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages