The Text2SQL Analytics System allows users to interact with a normalized Northwind PostgreSQL database using natural language queries. Queries are converted into SQL via the Gemini API, validated for safety, executed on the database, and returned as structured outputs (JSON or pandas DataFrame).
Architecture Diagram:
[User Input (Natural Language)]
|
v
[Text2SQL Engine - Gemini API]
|
v
[Query Sanitizer & Validator]
|
v
[PostgreSQL DB]
|
v
[Results: pandas DataFrame / JSON]- Create a Python 3.10+ virtual environment and activate it.
python3 -m venv .venv source .venv/bin/activate # macOS / Linux .venv\Scripts\activate # Windows (PowerShell)
- Install dependencies:
pip install -r requirements.txt
- Fill
.envbased on.env.exampleand start a local Postgres (or Docker Compose). - Use
scripts/setup_database.pyto prepare schema and optionally load CSVs. - Use
run_query.pyto try sample queries (currently uses mocked LLM responses).
text2sql-analytics/
├── README.md
├── requirements.txt
├── docker-compose
├── .env.example
├── .gitignore
├── setup.py
├── data/
│ ├── normalized/*
│ ├── raw/
│ │ └── northwind.xlsx
│ └── schema/
│ └── schema.sql
├── src/
│ ├── __init__.py
│ ├── config.py
│ ├── data_loader.py
│ ├── text2sql_engine.py
│ ├── query_validator.py
│ └── utils.py
├── tests/
│ ├── __init__.py
│ ├── conftest.py
│ ├── test_data_loader.py
| |── test_db_connection.py
| |── test_mock_utils_db.py
│ ├── test_query_validator.py
│ ├── test_text2sql_engine.py
│ ├── test_utils.py
│ └── mocks/
│ ├── mock_gemini_client.py
├── notebooks/
│ └── analysis.ipynb
└── scripts/
├── setup_database.py
-
Ensure Docker is installed and running
-
Start PostgreSQL via Docker Compose
From the project root directory:
docker-compose up -d
This will:
- Start a PostgreSQL 17 container named
text2sql-db-1 - Expose port
5432 - Create a database
northwind_dbwith usernorthwind_adminand passwordnorthwind123 - Persist data inside
data/postgres/
- Start a PostgreSQL 17 container named
-
Verify the container is running
docker ps
-
Connect to the database
docker exec -it text2sql-db-1 psql -U northwind_admin -d northwind_db
-
Create a
.envfile in the project root (or copy.env.example):cp .env.example .env
-
Update it with your local database credentials:
# Example environment variables DB_HOST=localhost DB_PORT=5432 DB_NAME=northwind_db DB_USER=northwind_admin DB_PASSWORD=northwind123 GEMINI_API_KEY=your_gemini_api_key_here
To confirm the connection works:
python3 tests/test_db_connection.pyIf you see:
Connected to: northwind_db
You’re good to go!
ER Diagram
Load the data
python3 src/data_loader.py --excel data/raw/northwind.xlsxRun
python3 src/text2sql_engine.pyoutput:
python3 src/text2sql_engine.py
Generated SQL Query: SELECT * FROM cities;
/home/fahad/text2sql/src/../src/utils.py:141: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
df = pd.read_sql_query(query, conn)
Query: Find all unique city names
Query Results (JSON):
[
{
"city_id":1,
"city_name":"Berlin",
"region_id":1,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":2,
"city_name":"M\u00e9xico D.F.",
"region_id":2,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":3,
"city_name":"London",
"region_id":3,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":4,
"city_name":"Lule\u00e5",
"region_id":4,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":5,
"city_name":"Mannheim",
"region_id":1,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":6,
"city_name":"Strasbourg",
"region_id":5,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":7,
"city_name":"Madrid",
"region_id":6,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":8,
"city_name":"Marseille",
"region_id":5,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":9,
"city_name":"Tsawassen",
"region_id":7,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":10,
"city_name":"Buenos Aires",
"region_id":8,
"created_at":1759732635038,
"updated_at":1759732635038
},
{
"city_id":11,
"city_name":"Bern",
"region_id":9,
"created_at":1759732635038,
"updated_at":1759732635038
},
....
{
"city_id":95,
"city_name":"Annecy",
"region_id":5,
"created_at":1759732635709,
"updated_at":1759732635709
},
{
"city_id":96,
"city_name":"Ste-Hyacinthe",
"region_id":25,
"created_at":1759732635709,
"updated_at":1759732635709
},
{
"city_id":97,
"city_name":"Colchester",
"region_id":45,
"created_at":1759732830414,
"updated_at":1759732830414
}
]The Text2SQL Analytics System exposes a RESTful API built with FastAPI that converts natural language queries into SQL and executes them against a PostgreSQL database. The API includes built-in security features, rate limiting, and comprehensive error handling.
http://localhost:8000
- Rate Limiting: 5 requests per 10 seconds per IP address
- Request Timeout: Monitored via
X-Process-Timeheader - SQL Injection Protection: Built-in query validation and sanitization
- Error Handling: Structured error responses with appropriate HTTP status codes
GET /
Returns the API health status.
Response:
{
"status": "ok",
"message": "Text2SQL API running."
}POST /generate-sql
Converts natural language to SQL, validates the query, and executes it against the database.
Request Body:
{
"question": "Show all orders shipped in 1997"
}Response Schema:
{
"sql_query": "string", // Raw SQL generated by Gemini
"sanitized_query": "string", // SQL after sanitization
"validate_query": "string", // Final validated SQL
"result_json": "string" // Query results as JSON string
}Example Request:
curl -X POST "http://localhost:8000/generate-sql" \
-H "Content-Type: application/json" \
-d '{"question": "Find all customers from Germany"}'Example Response:
{
"sql_query": "SELECT * FROM customers WHERE country = 'Germany';",
"sanitized_query": "SELECT * FROM customers WHERE country = 'Germany'",
"validate_query": "SELECT * FROM customers WHERE country = 'Germany'",
"result_json": "[{\"customer_id\":1,\"company_name\":\"Alfreds Futterkiste\",\"country\":\"Germany\"}]"
}{
"detail": "Validation error: Invalid SQL syntax"
}{
"message": "Too many requests"
}{
"detail": "Internal error: Database connection failed"
}Text2SQLRequest:
question(string, required): Natural language query- Default: "Show all orders shipped in 1997"
SQLResponseModel:
sql_query(string): Original SQL generated by Gemini APIsanitized_query(string): SQL after sanitization processvalidate_query(string): Final validated SQL ready for executionresult_json(string): Query results serialized as JSON
- Process Time Tracking: Adds
X-Process-Timeheader to all responses - Rate Limiting: IP-based request limiting (5 req/10sec)
- CORS: Cross-origin resource sharing support
- Natural Language Input: User provides English question
- Prompt Engineering: Question converted to optimized prompt
- SQL Generation: Gemini API generates SQL query
- Sanitization: Remove dangerous SQL constructs
- Validation: Ensure SQL syntax and safety
- Execution: Run validated query against PostgreSQL
- Serialization: Convert results to JSON format
- SELECT queries: Data retrieval operations
- Aggregations: COUNT, SUM, AVG, MIN, MAX
- Joins: Inner, left, right joins across tables
- Filtering: WHERE clauses with various conditions
- Grouping: GROUP BY with HAVING clauses
- Sorting: ORDER BY operations
- INSERT, UPDATE, DELETE statements
- DROP, ALTER, CREATE statements
- System function calls
- Subqueries with potential security risks
FastAPI automatically generates interactive API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
To start the development server with hot-reload, run:
uvicorn src.main:app --reloadThe API will be available at http://localhost:8000.
For production deployment, use:
# With specific host and port
uvicorn src.main:app --host 0.0.0.0 --port 8000
# With multiple workers
uvicorn src.main:app --host 0.0.0.0 --port 8000 --workers 4From root use:
pytest -vGenerate text coverage html
pytest --cov=src --cov-report=htmlto see HTML coverage open htmlcov/index.html in browser:
http://localhost:5500/htmlcov/
