This repo implements a series of Machine Learning Prefect dataflows.
For example SHRAG: Structured Heriarchical Retrieval Augmented Generation using LlamaIndex as .
π‘ Once installed, check all the flows with:
$ python -m flows ls # Or: pdm flows lsWhich should output something like:
ββββββ€ββββββββββββββ€ββββββββββββββββββββββββββ€βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Flow Name β From β Flow Parameters β
ββββββͺββββββββββββββͺββββββββββββββββββββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β 0 β playbook-qa β flows/shrag/__init__.py β - playbook_json: Path to the playbook JSON file β
β β β β - chroma_collection_name: Name of the ChromaDB collection β
β β β β - chroma_host: ChromaDB host. Defaults to CHROMA_HOST_DEFAULT. β
β β β β - chroma_port: ChromaDB port. Defaults to CHROMA_PORT_DEFAULT. β
β β β β - llm_backend: LLM backend to use. Defaults to LLM_BACKEND_DEFAULT. β
β β β β - llm_id: LLM model ID to use. Defaults to LLM_MODEL_DEFAULT. β
β β β β - embedding_model: Embedding model to use. Defaults to EMBEDDING_MODEL_DEFAULT. β
β β β β - reranker_model: Reranker model to use. Defaults to None. β
β β β β - similarity_top_k: Number of top results to retrieve. Defaults to SIMILARITY_TOP_K_DEFAULT. β
β β β β - similarity_cutoff: Similarity cutoff for retrieval. Defaults to SIMILARITY_CUTOFF_DEFAULT. β
β β β β - meta_filters: Metadata filters for retrieval. Defaults to {}. β
ββββββ§ββββββββββββββ§ββββββββββββββββββββββββββ§βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ-
Create a virtual environment:
[!INFO] You'll need python 3.12 or greater
# Install pdm curl -sSL https://pdm-project.org/install-pdm.py | python3 - # Optionally install uv and set it as dependency resolver curl -LsSf https://astral.sh/uv/install.sh | sh pdm config use_uv true pdm config python.install_root $(uv python dir)
-
Install dependencies:
pdm sync -G:all
-
Auxiliary Services:
- Ensure you have a ChromaDB instance running:
inv local-chroma - Ensure you have the Prefect server running:
inv local-prefect - Ensure you have a running Prefect pool:
inv start-worker-pool -p process -n test --overwrite
- Ensure you have a ChromaDB instance running:
First update the .env file as necessary. (See .example.env for guidance)
Local auxiliary services:
inv local-chroma & # start chroma
inv local-prefect & # start prefect
inv local-worker-pool process & # start a prefect process worker poolWe also make use of Ray Serve deployments for OCR and Document Parsing available via a REST API. Check out DoPARSE
Additionally, you might want to configure Prefect to store results when running locally:
prefect config set PREFECT_LOCAL_STORAGE_PATH="~/.prefect/storage"
prefect config view # check modified settingsThen there are two ways to run flows;
Simply, as you'd run the module's CLI present in each submodule's __main__.py:
source .env; # For ease so we don't need to pass tons of params
# Running a flow directly from its __main__ entrypoint
# E.g.: Run a QA-Playbook filtering by document name
python -m flows.shrag run-playbook-qa \
data/playbook_sample.json \
<your-chromaDB-collection-name> \
-m 'name:<document-name-to-filter-by>'Flows - As Prefect deployments
- Create a deployment:
pdm flows deploy playbook-qa DEUS process local-process-pool -t qa -t playbook
pdm flows deploy index-files DEUS process local-process-pool -t preproc -t markdown- Run either from the dashboard or programatically:
prefect deployment run 'playbook-qa/DEUS'This repo uses pdm to manage dependencies and geenrating a requirements.txt
for building Docker images or other tooling:
# Note that as-is, this step removes nvidia-* dependencies from the requirements.txt
pdm reqs
# To leave it untouched, use instead:
INCLUDE_GPU=true pdm reqsNote that the deployment/ecs/1_build...sh script is not aware of which type of image is building, and therefore, potentially overwritting the image in ECS