ML-Flows

This repo implements a series of Machine Learning Prefect dataflows.

For example SHRAG: Structured Heriarchical Retrieval Augmented Generation using LlamaIndex as .

💡 Once installed, check all the flows with:

$ python -m flows ls # Or: pdm flows ls

Which should output something like:

╒════╤═════════════╤═════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│    │ Flow Name   │ From                    │ Flow Parameters                                                                              │
╞════╪═════════════╪═════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ playbook-qa │ flows/shrag/__init__.py │ - playbook_json: Path to the playbook JSON file                                              │
│    │             │                         │ - chroma_collection_name: Name of the ChromaDB collection                                    │
│    │             │                         │ - chroma_host: ChromaDB host. Defaults to CHROMA_HOST_DEFAULT.                               │
│    │             │                         │ - chroma_port: ChromaDB port. Defaults to CHROMA_PORT_DEFAULT.                               │
│    │             │                         │ - llm_backend: LLM backend to use. Defaults to LLM_BACKEND_DEFAULT.                          │
│    │             │                         │ - llm_id: LLM model ID to use. Defaults to LLM_MODEL_DEFAULT.                                │
│    │             │                         │ - embedding_model: Embedding model to use. Defaults to EMBEDDING_MODEL_DEFAULT.              │
│    │             │                         │ - reranker_model: Reranker model to use. Defaults to None.                                   │
│    │             │                         │ - similarity_top_k: Number of top results to retrieve. Defaults to SIMILARITY_TOP_K_DEFAULT. │
│    │             │                         │ - similarity_cutoff: Similarity cutoff for retrieval. Defaults to SIMILARITY_CUTOFF_DEFAULT. │
│    │             │                         │ - meta_filters: Metadata filters for retrieval. Defaults to {}.                              │
╘════╧═════════════╧═════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛

How to

Setup

Create a virtual environment:

[!INFO] You'll need python 3.12 or greater

# Install pdm
curl -sSL https://pdm-project.org/install-pdm.py | python3 -

# Optionally install uv and set it as dependency resolver
curl -LsSf https://astral.sh/uv/install.sh | sh
pdm config use_uv true
pdm config python.install_root $(uv python dir)

Install dependencies:
```
pdm sync -G:all
```
Auxiliary Services:
- Ensure you have a ChromaDB instance running: inv local-chroma
- Ensure you have the Prefect server running: inv local-prefect
- Ensure you have a running Prefect pool: inv start-worker-pool -p process -n test --overwrite

Run

First update the .env file as necessary. (See .example.env for guidance)

Auxiliary Services

Local auxiliary services:

inv local-chroma &               # start chroma
inv local-prefect &              # start prefect
inv local-worker-pool process &  # start a prefect process worker pool

We also make use of Ray Serve deployments for OCR and Document Parsing available via a REST API. Check out DoPARSE

Additionally, you might want to configure Prefect to store results when running locally:

prefect config set PREFECT_LOCAL_STORAGE_PATH="~/.prefect/storage"
prefect config view # check modified settings

Then there are two ways to run flows;

Flows - As python modules

Simply, as you'd run the module's CLI present in each submodule's __main__.py:

source .env; # For ease so we don't need to pass tons of params

# Running a flow directly from its __main__ entrypoint
# E.g.: Run a QA-Playbook filtering by document name
python -m flows.shrag run-playbook-qa \
   data/playbook_sample.json \
   <your-chromaDB-collection-name> \
   -m 'name:<document-name-to-filter-by>'

Flows - As Prefect deployments

Create a deployment:

pdm flows deploy playbook-qa DEUS process local-process-pool -t qa -t playbook
pdm flows deploy index-files DEUS process local-process-pool -t preproc -t markdown

Run either from the dashboard or programatically:

prefect deployment run 'playbook-qa/DEUS'

PDM requirements and image build notes

This repo uses pdm to manage dependencies and geenrating a requirements.txt for building Docker images or other tooling:

# Note that as-is, this step removes nvidia-* dependencies from the requirements.txt
pdm reqs

# To leave it untouched, use instead:
INCLUDE_GPU=true pdm reqs

Note that the deployment/ecs/1_build...sh script is not aware of which type of image is building, and therefore, potentially overwritting the image in ECS

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
.vscode		.vscode
deployments		deployments
flows		flows
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.example.env		.example.env
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prefectignore		.prefectignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML-Flows

How to

Setup

Run

Auxiliary Services

Flows - As python modules

Flows - As Prefect deployments

PDM requirements and image build notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

josemarcosrf/ML-Flows

Folders and files

Latest commit

History

Repository files navigation

ML-Flows

How to

Setup

Run

Auxiliary Services

Flows - As python modules

Flows - As Prefect deployments

PDM requirements and image build notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages