DataFlow AI – Intelligent Data Processing for AI and RAG Systems

🌐 Version française disponible ici 📚 Complete documentation available in the /documentation folder

📑 Overview

DataFlow AI is a complete solution for processing, analyzing, and transforming JSON files and PDF documents to prepare them for AI systems, RAG (Retrieval Augmented Generation), and knowledge bases.

🚀 Key Features

Intelligent PDF Processing: Extract text and analyze images with GPT-4.1
JSON Processing: Automatic structure detection, cleaning, and optimization
Unified Processing: Match and enrich JIRA and Confluence files
Flexible Access: Use either the web interface or the CLI
LLM Enrichment: Enhance your data with AI-powered analysis
Security Built-in: Automatic removal of sensitive data
Task Orchestration: Resilient task management with 86% faster PDF processing

🖥️ Getting Started

Using the Web Interface

For a user-friendly experience, DataFlow AI provides a modern web interface:

Start the API and frontend:

docker-compose up -d

Access the interface at http://localhost:80
Use the intuitive drag-and-drop interface to process your files

Using the Interactive CLI

For power users and automation, use the interactive command-line interface:

# Launch interactive mode with guided assistant
python -m cli.cli interactive

# Or directly run specific commands
python -m cli.cli extract-images complete file.pdf --max-images 10

The interactive CLI provides a guided experience with:

File and folder selection through an interactive browser
Step-by-step guidance for all operations
Clear summaries before each action
Detailed notifications at the end of each process

📋 Quick Reference

Task	Web Interface	CLI Command
Process PDF	Upload on Home page	`python -m cli.cli extract-images complete file.pdf`
Process JSON	JSON Processing tab	`python -m cli.cli process file.json --llm`
Match JIRA & Confluence	Unified Processing tab	`python -m cli.cli unified jira.json --confluence conf.json`
Clean Sensitive Data	JSON Processing tab	`python -m cli.cli clean file.json`

🧰 Available Tools

Tool	Description	Web	CLI
PDF Extraction	Extract text and analyze images from PDF files	✅	✅
JSON Processing	Process and structure JSON data	✅	✅
JIRA/Confluence Matching	Match and enrich data between sources	✅	✅
Data Cleaning	Remove sensitive information	✅	✅
Chunking	Split large files into manageable pieces	✅	✅
LLM Enrichment	Enhance data with AI analysis	✅	✅
Compression	Optimize file size	✅	✅
Batch Processing	Process multiple files at once	✅	✅
Interactive Assistant	Guided workflow	❌	✅

🔍 Why DataFlow AI?

Intelligent Structure Detection: Automatically adapts to any JSON structure
Advanced PDF Analysis: Combines text extraction with AI image analysis
Data Preservation: Never modifies source files directly
Robust Processing: Handles errors and inconsistencies automatically
Detailed Reports: Automatically generates comprehensive summaries
Flexible Output: Optimized for RAG systems and AI applications

⚙️ Installation

⚠️ IMPORTANT: DataFlow AI requires Python 3.12 specifically. Other versions (including newer ones) may not work correctly with the Outlines library.

Quick Start with Docker

The easiest way to get started with both the API and web interface:

# Clone the repository
git clone https://github.com/stranxik/dataflow-ai.git
cd dataflow-ai

# Create environment files
cp .env.example .env
cp frontend/.env.example frontend/.env

# Start services
docker-compose up -d

Manual Installation

For more control or development purposes:

# Clone and access the repository
git clone https://github.com/stranxik/dataflow-ai.git
cd dataflow-ai

# Create a virtual environment with Python 3.12
python3.12 -m venv venv
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate     # Windows

# Set up environment
cp .env.example .env
# Edit .env file to configure your settings

# Install dependencies
pip install -r requirements.txt

# Start the API
python run_api.py

# In another terminal, start the frontend
cd frontend
npm install
npm run dev

📘 Note: See the complete installation guide for detailed instructions.

📚 Documentation

Comprehensive documentation is available in the /documentation folder:

API Documentation: API endpoints and usage
CLI Documentation: Command-line interface guide
Frontend Documentation: Web interface manual
PDF Processing: PDF extraction capabilities
JSON Processing: JSON handling features
Security: Data security features
Task Orchestrator: Advanced task management system
Migration & Scalability Guide: Architecture, migration to Temporal/Supabase, scalability

🔒 Security

DataFlow AI includes features to protect sensitive data:

Automatic detection and removal of API keys, credentials, and personal information
Local processing of files, with no permanent storage
API key authentication for all endpoints

For more information, see the security documentation.

🐳 Docker Deployment

DataFlow AI is designed to be easily deployed with Docker:

# Deploy everything
docker-compose up -d

# Run CLI commands in Docker
docker-compose run cli interactive

🇬🇧 Support the project

DataFlow-AI is a free and ambitious project. If you find it useful and would like to support its development, consider donating via Ko-fi.
It helps us maintain the project, add new features, and respond to your feedback faster.

Thank you for your support, even symbolic 🙏

📜 License

This project is distributed under the Polyform Small Business License 1.0.0.

For full license details, see the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github		.github
api		api
cli		cli
documentation		documentation
examples		examples
extract		extract
frontend		frontend
tests		tests
tools		tools
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.language		.language
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.fr.md		README.fr.md
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_api.py		run_api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFlow AI – Intelligent Data Processing for AI and RAG Systems

📑 Overview

🚀 Key Features

🖥️ Getting Started

Using the Web Interface

Using the Interactive CLI

📋 Quick Reference

🧰 Available Tools

🔍 Why DataFlow AI?

⚙️ Installation

Quick Start with Docker

Manual Installation

📚 Documentation

🔒 Security

🐳 Docker Deployment

🇬🇧 Support the project

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

stranxik/dataflow-ai

Folders and files

Latest commit

History

Repository files navigation

DataFlow AI – Intelligent Data Processing for AI and RAG Systems

📑 Overview

🚀 Key Features

🖥️ Getting Started

Using the Web Interface

Using the Interactive CLI

📋 Quick Reference

🧰 Available Tools

🔍 Why DataFlow AI?

⚙️ Installation

Quick Start with Docker

Manual Installation

📚 Documentation

🔒 Security

🐳 Docker Deployment

🇬🇧 Support the project

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages