DataSage — LLM Data Quality Profiler

Understand messy datasets fast, with privacy-first analysis, clear visualisations, and optional AI-generated summaries.

Why DataSage?

Privacy-first: your data stays on your machine by default (local model).
Actionable insights: concise, explainable summaries of quality issues.
Two ways to use: a friendly web UI and a simple CLI.
Batteries included: distributions, correlations, missingness patterns, and outlier detection with professional-looking charts.
Optional AI: plug in an API key to upgrade summaries; otherwise DataSage still produces useful, deterministic reports.

What it does

Given a CSV (or a pandas DataFrame), DataSage:

Profiles the dataset (shape, memory, types, uniques, missingness, outliers, correlations).
Builds a short, plain-English report (local LLM by default; optionally OpenAI).
Lets you explore interactively (histograms, heatmaps, missingness, outliers) or script it from the CLI.
Exports a clean Markdown report you can paste into a notebook, issue, or slide.

Quick start

1) Install

git clone https://github.com/Marini97/datasage.git
cd datasage
python -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -r requirements.txt

2) Run the web app (recommended)

python -m datasage.cli web
# Then open http://localhost:8501

3) Or run from the CLI

# Basic profiling (local, no API keys needed)
python -m datasage.cli profile path/to/data.csv

# Save a Markdown report
python -m datasage.cli profile path/to/data.csv -o report.md

Optional: enhanced AI summaries

If you want higher-quality narrative summaries (on top of the local model / statistical fallback), export your key:

export OPENAI_API_KEY="sk-..."
# (Optional) custom base: export OPENAI_API_BASE="https://api.openai.com/v1"

Then re-run the same commands; DataSage will automatically use the remote model when available.

Prefer local only? Do nothing — DataSage runs with a local model or a purely statistical fallback.

Configuration

Environment variables (all optional):

# Local model name (Hugging Face)
export DATASAGE_MODEL="google/flan-t5-base"

# Generation controls (keep low for determinism)
export DATASAGE_MAX_TOKENS=300
export DATASAGE_TEMPERATURE=0.0

CLI flags (selected):

python -m datasage.cli profile data.csv   -o report.md   --summary-only   --debug

Features at a glance

Dataset overview: rows/columns, memory footprint, per-type breakdown.
Per-column stats: type, non-null %, uniques, sample values.
Numerical: min/max/mean/std, quartiles & IQR, zero/negative counts, outliers.
Categorical: top-k values with frequencies, rare categories.
Datetime: min/max, coverage period.
Text: average length, empty-string rate.
Quality flags: high missingness, likely IDs, skew, duplicates, many rares.
Visuals: histograms/KDE, correlation heatmaps, missingness patterns, boxplots.
Reports: Markdown with Overview, Quality Summary, Column Profiles.
Chat (web app): ask questions about your data; responses are grounded in the profile.

Example

# Data Quality Report

## Dataset Overview
- 244 rows × 7 columns (≈0.13 MB)
- Mixed numeric and categorical types
- 6 outliers across 2 columns
- High correlation (0.89) between total_bill and tip

## Key Insights
- Tipping patterns scale with bill amount
- No missing values detected
- 3 potential data entry errors in “tip”

(Numbers above are illustrative; your dataset will differ.)

Project structure

.
├─ datasage/                # app + CLI + profiling + prompts + reporting
├─ examples/                # sample CSVs / demo assets
├─ tests/                   # unit tests
├─ .github/workflows/       # CI configuration
├─ README.md
├─ requirements.txt
├─ pyproject.toml
└─ LICENSE                  # MIT

Core modules (in datasage/):

cli.py — entry point for CLI & web app
profiler.py — dataframe → stats & quality signals
prompt_builder.py — compact prompts from stats
model.py — local LLM loader + generation helpers
enhanced_generator.py — optional OpenAI + fallbacks
report_formatter.py — stitch chunks into Markdown
web_app.py — Streamlit interface

Architecture

CSV → pandas → profiler ─┐
                         ├─→ prompt_builder → { local LLM | OpenAI | statistical fallback } → report_formatter → Markdown
Profiles & visuals ──────┘
                                 │
                                 └─────────────────────────────→ Streamlit UI (explore + chat)

Deterministic by default: temperature 0.0, bounded tokens.
Graceful degradation: if remote LLMs are absent/unavailable, DataSage still produces a clear, useful report.
No data leaves the machine unless you opt into a remote API.

Development

# Dev install
pip install -r requirements.txt
pip install -e .

# Tests
pytest tests/

# Lint/format (if configured in pyproject)
ruff check datasage/
black datasage/

Licence

MIT — see LICENSE.

Built with 🧙‍♂️ magic and ☕ coffee. Your data stays private, your insights go far!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataSage — LLM Data Quality Profiler

Why DataSage?

What it does

Quick start

1) Install

2) Run the web app (recommended)

3) Or run from the CLI

Optional: enhanced AI summaries

Configuration

Features at a glance

Example

Project structure

Architecture

Development

Licence

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
datasage		datasage
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

Marini97/datasage

Folders and files

Latest commit

History

Repository files navigation

DataSage — LLM Data Quality Profiler

Why DataSage?

What it does

Quick start

1) Install

2) Run the web app (recommended)

3) Or run from the CLI

Optional: enhanced AI summaries

Configuration

Features at a glance

Example

Project structure

Architecture

Development

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages