Skip to content

stranxik/dataflow-ai

Repository files navigation

DataFlow AI – Intelligent Data Processing for AI and RAG Systems

Version Python License: Polyform-SBL

🌐 Version française disponible ici 📚 Complete documentation available in the /documentation folder

📑 Overview

DataFlow AI is a complete solution for processing, analyzing, and transforming JSON files and PDF documents to prepare them for AI systems, RAG (Retrieval Augmented Generation), and knowledge bases.

DataFlow AI Overview

🚀 Key Features

  • Intelligent PDF Processing: Extract text and analyze images with GPT-4.1
  • JSON Processing: Automatic structure detection, cleaning, and optimization
  • Unified Processing: Match and enrich JIRA and Confluence files
  • Flexible Access: Use either the web interface or the CLI
  • LLM Enrichment: Enhance your data with AI-powered analysis
  • Security Built-in: Automatic removal of sensitive data
  • Task Orchestration: Resilient task management with 86% faster PDF processing

🖥️ Getting Started

Using the Web Interface

For a user-friendly experience, DataFlow AI provides a modern web interface:

  1. Start the API and frontend:
docker-compose up -d
  1. Access the interface at http://localhost:80

  2. Use the intuitive drag-and-drop interface to process your files

Web Interface

Using the Interactive CLI

For power users and automation, use the interactive command-line interface:

# Launch interactive mode with guided assistant
python -m cli.cli interactive

# Or directly run specific commands
python -m cli.cli extract-images complete file.pdf --max-images 10

The interactive CLI provides a guided experience with:

  • File and folder selection through an interactive browser
  • Step-by-step guidance for all operations
  • Clear summaries before each action
  • Detailed notifications at the end of each process

CLI Interactive Mode

📋 Quick Reference

Task Web Interface CLI Command
Process PDF Upload on Home page python -m cli.cli extract-images complete file.pdf
Process JSON JSON Processing tab python -m cli.cli process file.json --llm
Match JIRA & Confluence Unified Processing tab python -m cli.cli unified jira.json --confluence conf.json
Clean Sensitive Data JSON Processing tab python -m cli.cli clean file.json

🧰 Available Tools

Tool Description Web CLI
PDF Extraction Extract text and analyze images from PDF files
JSON Processing Process and structure JSON data
JIRA/Confluence Matching Match and enrich data between sources
Data Cleaning Remove sensitive information
Chunking Split large files into manageable pieces
LLM Enrichment Enhance data with AI analysis
Compression Optimize file size
Batch Processing Process multiple files at once
Interactive Assistant Guided workflow

🔍 Why DataFlow AI?

  • Intelligent Structure Detection: Automatically adapts to any JSON structure
  • Advanced PDF Analysis: Combines text extraction with AI image analysis
  • Data Preservation: Never modifies source files directly
  • Robust Processing: Handles errors and inconsistencies automatically
  • Detailed Reports: Automatically generates comprehensive summaries
  • Flexible Output: Optimized for RAG systems and AI applications

JSON Processing

⚙️ Installation

⚠️ IMPORTANT: DataFlow AI requires Python 3.12 specifically. Other versions (including newer ones) may not work correctly with the Outlines library.

Quick Start with Docker

The easiest way to get started with both the API and web interface:

# Clone the repository
git clone https://github.com/stranxik/dataflow-ai.git
cd dataflow-ai

# Create environment files
cp .env.example .env
cp frontend/.env.example frontend/.env

# Start services
docker-compose up -d

Manual Installation

For more control or development purposes:

# Clone and access the repository
git clone https://github.com/stranxik/dataflow-ai.git
cd dataflow-ai

# Create a virtual environment with Python 3.12
python3.12 -m venv venv
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate     # Windows

# Set up environment
cp .env.example .env
# Edit .env file to configure your settings

# Install dependencies
pip install -r requirements.txt

# Start the API
python run_api.py

# In another terminal, start the frontend
cd frontend
npm install
npm run dev

📘 Note: See the complete installation guide for detailed instructions.

📚 Documentation

Comprehensive documentation is available in the /documentation folder:

JIRA and Confluence Integration

🔒 Security

DataFlow AI includes features to protect sensitive data:

  • Automatic detection and removal of API keys, credentials, and personal information
  • Local processing of files, with no permanent storage
  • API key authentication for all endpoints

For more information, see the security documentation.

🐳 Docker Deployment

DataFlow AI is designed to be easily deployed with Docker:

# Deploy everything
docker-compose up -d

# Run CLI commands in Docker
docker-compose run cli interactive

🇬🇧 Support the project

DataFlow-AI is a free and ambitious project. If you find it useful and would like to support its development, consider donating via Ko-fi.
It helps us maintain the project, add new features, and respond to your feedback faster.

Support via Ko-fi

Thank you for your support, even symbolic 🙏

📜 License

This project is distributed under the Polyform Small Business License 1.0.0.

For full license details, see the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published