Skip to content

πŸ”€ ML-Dataflows using Prefect (e.g.: Structured Hierarchical RAG)

Notifications You must be signed in to change notification settings

josemarcosrf/ML-Flows

Repository files navigation

ML-Flows

This repo implements a series of Machine Learning Prefect dataflows.

For example SHRAG: Structured Heriarchical Retrieval Augmented Generation using LlamaIndex as .

πŸ’‘ Once installed, check all the flows with:

$ python -m flows ls # Or: pdm flows ls

Which should output something like:

╒════╀═════════════╀═════════════════════════╀══════════════════════════════════════════════════════════════════════════════════════════════╕
β”‚    β”‚ Flow Name   β”‚ From                    β”‚ Flow Parameters                                                                              β”‚
β•žβ•β•β•β•β•ͺ═════════════β•ͺ═════════════════════════β•ͺ══════════════════════════════════════════════════════════════════════════════════════════════║
β”‚  0 β”‚ playbook-qa β”‚ flows/shrag/__init__.py β”‚ - playbook_json: Path to the playbook JSON file                                              β”‚
β”‚    β”‚             β”‚                         β”‚ - chroma_collection_name: Name of the ChromaDB collection                                    β”‚
β”‚    β”‚             β”‚                         β”‚ - chroma_host: ChromaDB host. Defaults to CHROMA_HOST_DEFAULT.                               β”‚
β”‚    β”‚             β”‚                         β”‚ - chroma_port: ChromaDB port. Defaults to CHROMA_PORT_DEFAULT.                               β”‚
β”‚    β”‚             β”‚                         β”‚ - llm_backend: LLM backend to use. Defaults to LLM_BACKEND_DEFAULT.                          β”‚
β”‚    β”‚             β”‚                         β”‚ - llm_id: LLM model ID to use. Defaults to LLM_MODEL_DEFAULT.                                β”‚
β”‚    β”‚             β”‚                         β”‚ - embedding_model: Embedding model to use. Defaults to EMBEDDING_MODEL_DEFAULT.              β”‚
β”‚    β”‚             β”‚                         β”‚ - reranker_model: Reranker model to use. Defaults to None.                                   β”‚
β”‚    β”‚             β”‚                         β”‚ - similarity_top_k: Number of top results to retrieve. Defaults to SIMILARITY_TOP_K_DEFAULT. β”‚
β”‚    β”‚             β”‚                         β”‚ - similarity_cutoff: Similarity cutoff for retrieval. Defaults to SIMILARITY_CUTOFF_DEFAULT. β”‚
β”‚    β”‚             β”‚                         β”‚ - meta_filters: Metadata filters for retrieval. Defaults to {}.                              β”‚
β•˜β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•›

How to

Setup

  1. Create a virtual environment:

    [!INFO] You'll need python 3.12 or greater

    # Install pdm
    curl -sSL https://pdm-project.org/install-pdm.py | python3 -
    
    # Optionally install uv and set it as dependency resolver
    curl -LsSf https://astral.sh/uv/install.sh | sh
    pdm config use_uv true
    pdm config python.install_root $(uv python dir)
  2. Install dependencies:

    pdm sync -G:all
  3. Auxiliary Services:

    • Ensure you have a ChromaDB instance running: inv local-chroma
    • Ensure you have the Prefect server running: inv local-prefect
    • Ensure you have a running Prefect pool: inv start-worker-pool -p process -n test --overwrite

Run

First update the .env file as necessary. (See .example.env for guidance)

Auxiliary Services

Local auxiliary services:

inv local-chroma &               # start chroma
inv local-prefect &              # start prefect
inv local-worker-pool process &  # start a prefect process worker pool

We also make use of Ray Serve deployments for OCR and Document Parsing available via a REST API. Check out DoPARSE

Additionally, you might want to configure Prefect to store results when running locally:

prefect config set PREFECT_LOCAL_STORAGE_PATH="~/.prefect/storage"
prefect config view # check modified settings

Then there are two ways to run flows;

Flows - As python modules

Simply, as you'd run the module's CLI present in each submodule's __main__.py:

source .env; # For ease so we don't need to pass tons of params

# Running a flow directly from its __main__ entrypoint
# E.g.: Run a QA-Playbook filtering by document name
python -m flows.shrag run-playbook-qa \
   data/playbook_sample.json \
   <your-chromaDB-collection-name> \
   -m 'name:<document-name-to-filter-by>'

Flows - As Prefect deployments

  1. Create a deployment:
pdm flows deploy playbook-qa DEUS process local-process-pool -t qa -t playbook
pdm flows deploy index-files DEUS process local-process-pool -t preproc -t markdown
  1. Run either from the dashboard or programatically:
prefect deployment run 'playbook-qa/DEUS'

PDM requirements and image build notes

This repo uses pdm to manage dependencies and geenrating a requirements.txt for building Docker images or other tooling:

# Note that as-is, this step removes nvidia-* dependencies from the requirements.txt
pdm reqs

# To leave it untouched, use instead:
INCLUDE_GPU=true pdm reqs

Note that the deployment/ecs/1_build...sh script is not aware of which type of image is building, and therefore, potentially overwritting the image in ECS

About

πŸ”€ ML-Dataflows using Prefect (e.g.: Structured Hierarchical RAG)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •