🌐 Version française disponible ici 📚 Complete documentation available in the
/documentationfolder
DataFlow AI is a complete solution for processing, analyzing, and transforming JSON files and PDF documents to prepare them for AI systems, RAG (Retrieval Augmented Generation), and knowledge bases.
- Intelligent PDF Processing: Extract text and analyze images with GPT-4.1
- JSON Processing: Automatic structure detection, cleaning, and optimization
- Unified Processing: Match and enrich JIRA and Confluence files
- Flexible Access: Use either the web interface or the CLI
- LLM Enrichment: Enhance your data with AI-powered analysis
- Security Built-in: Automatic removal of sensitive data
- Task Orchestration: Resilient task management with 86% faster PDF processing
For a user-friendly experience, DataFlow AI provides a modern web interface:
- Start the API and frontend:
docker-compose up -d-
Access the interface at http://localhost:80
-
Use the intuitive drag-and-drop interface to process your files
For power users and automation, use the interactive command-line interface:
# Launch interactive mode with guided assistant
python -m cli.cli interactive
# Or directly run specific commands
python -m cli.cli extract-images complete file.pdf --max-images 10The interactive CLI provides a guided experience with:
- File and folder selection through an interactive browser
- Step-by-step guidance for all operations
- Clear summaries before each action
- Detailed notifications at the end of each process
| Task | Web Interface | CLI Command |
|---|---|---|
| Process PDF | Upload on Home page | python -m cli.cli extract-images complete file.pdf |
| Process JSON | JSON Processing tab | python -m cli.cli process file.json --llm |
| Match JIRA & Confluence | Unified Processing tab | python -m cli.cli unified jira.json --confluence conf.json |
| Clean Sensitive Data | JSON Processing tab | python -m cli.cli clean file.json |
| Tool | Description | Web | CLI |
|---|---|---|---|
| PDF Extraction | Extract text and analyze images from PDF files | ✅ | ✅ |
| JSON Processing | Process and structure JSON data | ✅ | ✅ |
| JIRA/Confluence Matching | Match and enrich data between sources | ✅ | ✅ |
| Data Cleaning | Remove sensitive information | ✅ | ✅ |
| Chunking | Split large files into manageable pieces | ✅ | ✅ |
| LLM Enrichment | Enhance data with AI analysis | ✅ | ✅ |
| Compression | Optimize file size | ✅ | ✅ |
| Batch Processing | Process multiple files at once | ✅ | ✅ |
| Interactive Assistant | Guided workflow | ❌ | ✅ |
- Intelligent Structure Detection: Automatically adapts to any JSON structure
- Advanced PDF Analysis: Combines text extraction with AI image analysis
- Data Preservation: Never modifies source files directly
- Robust Processing: Handles errors and inconsistencies automatically
- Detailed Reports: Automatically generates comprehensive summaries
- Flexible Output: Optimized for RAG systems and AI applications
⚠️ IMPORTANT: DataFlow AI requires Python 3.12 specifically. Other versions (including newer ones) may not work correctly with the Outlines library.
The easiest way to get started with both the API and web interface:
# Clone the repository
git clone https://github.com/stranxik/dataflow-ai.git
cd dataflow-ai
# Create environment files
cp .env.example .env
cp frontend/.env.example frontend/.env
# Start services
docker-compose up -dFor more control or development purposes:
# Clone and access the repository
git clone https://github.com/stranxik/dataflow-ai.git
cd dataflow-ai
# Create a virtual environment with Python 3.12
python3.12 -m venv venv
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate # Windows
# Set up environment
cp .env.example .env
# Edit .env file to configure your settings
# Install dependencies
pip install -r requirements.txt
# Start the API
python run_api.py
# In another terminal, start the frontend
cd frontend
npm install
npm run dev📘 Note: See the complete installation guide for detailed instructions.
Comprehensive documentation is available in the /documentation folder:
- API Documentation: API endpoints and usage
- CLI Documentation: Command-line interface guide
- Frontend Documentation: Web interface manual
- PDF Processing: PDF extraction capabilities
- JSON Processing: JSON handling features
- Security: Data security features
- Task Orchestrator: Advanced task management system
- Migration & Scalability Guide: Architecture, migration to Temporal/Supabase, scalability
DataFlow AI includes features to protect sensitive data:
- Automatic detection and removal of API keys, credentials, and personal information
- Local processing of files, with no permanent storage
- API key authentication for all endpoints
For more information, see the security documentation.
DataFlow AI is designed to be easily deployed with Docker:
# Deploy everything
docker-compose up -d
# Run CLI commands in Docker
docker-compose run cli interactiveDataFlow-AI is a free and ambitious project. If you find it useful and would like to support its development, consider donating via Ko-fi.
It helps us maintain the project, add new features, and respond to your feedback faster.
Thank you for your support, even symbolic 🙏
This project is distributed under the Polyform Small Business License 1.0.0.
For full license details, see the LICENSE file.




