This repository serves as a template for creating Dockerized data refinement tasks that transform raw user data into normalized (and potentially anonymized) SQLite-compatible databases. Once created, it's designed to be stored in Vana's Data Registry, and indexed for querying by Vana's Query Engine.
This template provides a structure for building data refinement tasks that:
- Read raw data files from the
/inputdirectory - Transform the data into a normalized SQLite database schema (specifically libSQL, a modern fork of SQLite)
- Optionally mask or remove PII (Personally Identifiable Information)
- Encrypt the refined data with a derivative of the original file encryption key
- Upload the encrypted data to IPFS
- Output the schema and IPFS URL to the
/outputdirectory
refiner/: Contains the main refinement logicrefine.py: Core refinement implementationconfig.py: Environment variables and settings needed to run your refinement__main__.py: Entry point for the refinement executionmodels/: Pydantic and SQLAlchemy data models (for both unrefined and refined data)transformer/: Data transformation logicutils/: Utility functions for encryption, IPFS upload, etc.
input/: Contains raw data files to be refinedoutput/: Contains refined outputs:schema.json: Database schema definitiondb.libsql: SQLite database filedb.libsql.pgp: Encrypted database file
Dockerfile: Defines the container image for the refinement taskrequirements.txt: Python package dependencies
- Fork this repository
- Modify the config to match your environment, or add a .env file at the root
- Update the schemas in
refiner/models/to define your raw and normalized data models - Modify the refinement logic in
refiner/transformer/to match your data structure - Build and test your refinement container
To run the refinement locally for testing:
# With Python
pip install --no-cache-dir -r requirements.txt
python -m refiner
# Or with Docker
docker build -t refiner .
docker run \
--rm \
--volume $(pwd)/input:/input \
--volume $(pwd)/output:/output \
--env PINATA_API_KEY=your_key \
--env PINATA_API_SECRET=your_secret \
refinerIf you have suggestions for improving this template, please open an issue or submit a pull request.