diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000..6c50672 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,31 @@ +# converSQL Copilot Guide + +## System map +- `app.py` is the Streamlit shell: it hydrates cached data via `initialize_app_data()`, gates the UI through `simple_auth_wrapper`, and delegates all heavy work to `src/core.py` and `src/ai_service.py`. +- `src/core.py` owns DuckDB execution and orchestrates data prep. `scan_parquet_files()` will run `scripts/sync_data.py` if `data/processed/*.parquet` are missing, so keep a local Parquet copy handy during tests to avoid network pulls. +- `src/ai_service.py` routes natural-language prompts into adapter implementations in `src/ai_engines/`. The prompt embeds the mortgage risk heuristics baked into `src/data_dictionary.py`; reuse `AIService._build_sql_prompt()` instead of crafting ad-hoc prompts. + +## Data + ontology expectations +- Loan metadata lives in `data/processed/data.parquet`; schema text comes from `generate_enhanced_schema_context()` which stitches DuckDB types with ontology metadata from `src/data_dictionary.py` and `docs/DATA_DICTIONARY.md`. +- When adding derived features, update both the Parquet schema and the ontology entry so AI output and the Ontology Explorer tab stay in sync. +- The Streamlit Ontology tab imports `LOAN_ONTOLOGY` and `PORTFOLIO_CONTEXT`; breaking their shape (dict → FieldMetadata) will crash the UI. + +## AI engine adapters +- Adapters must subclass `AIEngineAdapter` in `src/ai_engines/base.py`, expose `provider_id`, `name`, `is_available()`, and `generate_sql()`, then be exported via `src/ai_engines/__init__.py` and registered inside `AIService.adapters`. +- Use `clean_sql_response()` to strip markdown fences, and return `(sql, "")` on success; downstream callers treat any non-empty error string as failure. +- Keep `AI_PROVIDER` fallbacks working—tests rely on `AIService` surviving with zero credentials, so default to "unavailable" rather than raising. + +## Developer workflows +- Install deps with `pip install -r requirements.txt`; prefer `make setup` for a clean environment (installs + cleanup). +- Fast test cycle: `make test-unit` skips integration markers; `make test` mirrors CI (pytest + coverage). Integration adapters are ignored by default via `pytest.ini`; remove the `--ignore` flags there if you really need live API coverage. +- Lint/format stack is Black 120 cols + isort + flake8 + mypy. `make ci` runs the whole suite and matches the GitHub Actions workflow. + +## Environment & secrets +- Copy `.env.example` to `.env`, then set one provider block (`CLAUDE_API_KEY`, `AWS_*`, or `GEMINI_API_KEY`). Without credentials the UI drops to “AI unavailable” but manual SQL still works. +- Data sync needs Cloudflare R2 keys (`R2_ACCESS_KEY_ID`, `R2_SECRET_ACCESS_KEY`, `R2_ENDPOINT_URL`). In offline dev, set `FORCE_DATA_REFRESH=false` and place Parquet files under `data/processed/`. +- Authentication defaults to Google OAuth (`ENABLE_AUTH=true`); set it to `false` for local hacking or provide `GOOGLE_CLIENT_ID/SECRET` plus HTTPS when deploying. + +## Practical tips +- Clear Streamlit caches with `streamlit cache clear` if schema or ontology changes; otherwise stale `@st.cache_data` results linger. +- When writing new ingest code, mirror the type-casting helpers in `notebooks/pipeline_csv_to_parquet*.ipynb` so DuckDB types stay compatible. +- Logging to Cloudflare D1 is optional—`src/d1_logger.py` silently no-ops without `CLOUDFLARE_*` secrets, so you can call it safely even in tests. diff --git a/.github/workflows/format-code.yml b/.github/workflows/format-code.yml deleted file mode 100644 index 9aba171..0000000 --- a/.github/workflows/format-code.yml +++ /dev/null @@ -1,61 +0,0 @@ -name: Auto-format Code - -on: - push: - branches: [ enhance-pipeline ] - paths: - - 'src/**/*.py' - - 'tests/**/*.py' - -permissions: - contents: write - -jobs: - format: - name: Format Python Code - runs-on: ubuntu-latest - if: github.event_name == 'push' && !contains(github.event.head_commit.message, '[skip ci]') - - steps: - - name: Checkout code - uses: actions/checkout@v4 - with: - token: ${{ secrets.GITHUB_TOKEN }} - fetch-depth: 0 - - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: '3.11' - cache: 'pip' - - - name: Install formatting tools - run: | - python -m pip install --upgrade pip - pip install black isort - - - name: Run Black formatter - run: | - black --line-length 120 src/ tests/ || true - - - name: Run isort - run: | - isort --profile black src/ tests/ || true - - - name: Check for changes - id: changes - run: | - if git diff --quiet; then - echo "changed=false" >> $GITHUB_OUTPUT - else - echo "changed=true" >> $GITHUB_OUTPUT - fi - - - name: Commit formatted code - if: steps.changes.outputs.changed == 'true' - run: | - git config --local user.email "github-actions[bot]@users.noreply.github.com" - git config --local user.name "github-actions[bot]" - git add -A - git commit -m "style: auto-format code with black and isort [skip ci]" - git push \ No newline at end of file diff --git a/.streamlit/config.toml b/.streamlit/config.toml index eb125bf..2e5248e 100644 --- a/.streamlit/config.toml +++ b/.streamlit/config.toml @@ -4,7 +4,7 @@ enableCORS = true port = 5000 [theme] -primaryColor = "#1f77b4" -backgroundColor = "#ffffff" -secondaryBackgroundColor = "#f0f2f6" -textColor = "#262730" \ No newline at end of file +primaryColor = "#B45F4D" +backgroundColor = "#FAF6F0" +secondaryBackgroundColor = "#FDFDFD" +textColor = "#3A3A3A" \ No newline at end of file diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 88f6d03..9ff536d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -22,11 +22,11 @@ There are many ways to contribute to converSQL: ```bash # Fork the repository on GitHub, then clone your fork -git clone https://github.com/YOUR_USERNAME/conversql.git -cd conversql +git clone https://github.com/YOUR_USERNAME/converSQL.git +cd converSQL # Add upstream remote -git remote add upstream https://github.com/ravishan16/conversql.git +git remote add upstream https://github.com/ravishan16/converSQL.git ``` ### 2. Set Up Development Environment @@ -83,6 +83,14 @@ black src/ app.py flake8 src/ app.py --max-line-length=100 ``` +### Front-end Styling + +When updating Streamlit UI components: + +- Re-use the CSS custom properties defined in `app.py` (`--color-background`, `--color-accent-primary`, etc.) instead of hard-coded hex values. +- Mirror changes in `.streamlit/config.toml` when altering primary/secondary colors so the Streamlit theme and custom CSS stay aligned. +- Include before/after screenshots in your pull request whenever you adjust layout, typography, or palette usage. + **Example:** ```python def execute_sql_query(sql_query: str, parquet_files: List[str]) -> pd.DataFrame: diff --git a/README.md b/README.md index 9164a40..d82cf75 100644 --- a/README.md +++ b/README.md @@ -1,86 +1,60 @@ -# Conver--- +

+ converSQL logo +

-## 📖 The Story Behind converSQL - -### The Problem - -Data is everywhere, but accessing it remains a technical barrier. Analysts spend hours writing SQL queries. Business users wait for reports. Data scientists translate questions into complex joins and aggregations. Meanwhile, the insights trapped in your data remain just out of reach for those who need them most. - -Traditional BI tools offer pre-built dashboards, but they're rigid. They can't answer the questions you didn't anticipate. And when you need a custom query, you're back to writing SQL or waiting in the queue for engineering support. - -### The Open Data Opportunity - -What if we could turn this around? What if anyone could ask questions in plain English and get instant, accurate SQL queries tailored to their specific data domain? - -That's where converSQL comes in. Built on the principle that **data should be conversational**, converSQL combines: -- **Ontological modeling**: Structured knowledge about your data domains, relationships, and business rules -- **AI-powered generation**: Multiple AI engines (Bedrock, Claude, Gemini, Ollama) that understand context and generate accurate SQL -- **Open data focus**: Showcasing what's possible with publicly available datasets like Fannie Mae's Single Family Loan Performance Data - -### Our Mission - -We believe data analysis should be: -- **Accessible**: Ask questions in natural language, get answers in seconds -- **Intelligent**: Understand business context, not just column names -- **Extensible**: Easy to adapt to any domain with any data structure -- **Open**: Built on open-source principles, welcoming community contributions +# converSQL ---- +![CI](https://github.com/ravishan16/converSQL/actions/workflows/ci.yml/badge.svg) +![Format & Lint](https://github.com/ravishan16/converSQL/actions/workflows/format-code.yml/badge.svg) +![License: MIT](https://img.shields.io/badge/License-MIT-CA9C72.svg) +![Built with Streamlit](https://img.shields.io/badge/Built%20with-Streamlit-FF4B4B.svg?logo=streamlit&logoColor=white) -## 🏡 Flagship Implementation: Single Family Loan Analytics +> Transform natural language questions into production-ready SQL with ontological context and warm, human-centered design. -To demonstrate converSQL's capabilities, we've built a production-ready application analyzing **9+ million mortgage loan records** from Fannie Mae's public dataset. +## Why converSQL -### Why This Matters +### The challenge +- Business teams wait on backlogs of custom SQL while analysts juggle endless report tweaks. +- Complex domains like mortgage analytics demand institutional knowledge that traditional BI tools cannot encode. +- Open data is abundant, but combining it with AI safely and accurately remains tedious. -The Single Family Loan Performance Data represents one of the most comprehensive public datasets on U.S. mortgage markets. It contains granular loan-level data spanning originations, performance, modifications, and defaults. But with 110+ columns and complex domain knowledge required, it's challenging to analyze effectively. +### Our approach +- **Ontology-first modeling** captures relationships, risk logic, and business vocabulary once and reuses it everywhere. +- **Adapter-based AI orchestration** lets you swap Claude, Bedrock, Gemini, or local engines without touching the UI. +- **Streamlit experience design** bridges analysts and executives with curated prompts, cached schemas, and explainable results. -**converSQL makes it conversational:** +## Flagship implementation: Single Family Loan Analytics +The reference app ships with 9M+ rows of Fannie Mae loan performance data. Ask the AI for “high-risk California loans under 620 credit score” and get DuckDB-ready SQL plus rich metrics at a glance. -🔍 **Natural Language Query:** -*"Show me high-risk loans in California with credit scores below 620"* +### Spotlight features +- 🧠 110+ fields grouped into 15 ontology domains with risk heuristics baked into prompts. +- ⚡ CSV ➜ Parquet pipeline with enforced types, 10× compression, and predicate pushdown via DuckDB. +- 🔐 Google OAuth guardrails with optional Cloudflare D1 logging. +- 🤖 Multi-provider AI adapters (Bedrock, Claude, Gemini) with graceful fallbacks and prompt caching. -✨ **Generated SQL:** ```sql SELECT LOAN_ID, STATE, CSCORE_B, OLTV, DTI, DLQ_STATUS, CURRENT_UPB FROM data -WHERE STATE = 'CA' +WHERE STATE = 'CA' AND CSCORE_B < 620 AND CSCORE_B IS NOT NULL ORDER BY CSCORE_B ASC, OLTV DESC -LIMIT 20 +LIMIT 20; ``` -📊 **Instant Results** — with context-aware risk metrics and portfolio insights. - -# converSQL - -> **Transform Natural Language into SQL — Intelligently** - -**converSQL** is an open-source framework that bridges the gap between human questions and database queries. Using ontological data modeling and AI-powered query generation, converSQL makes complex data analysis accessible to everyone — from analysts to executives — without requiring SQL expertise. - -## 🚀 Why Conversational SQL? - -Stop writing complex SQL by hand! With Conversational SQL, you can: -- Ask questions in plain English and get optimized SQL instantly -- Integrate with multiple AI providers (Anthropic Claude, AWS Bedrock, local models) -- Extend to any domain with ontological data modeling -- Build interactive dashboards, query builders, and analytics apps - -## 🏆 Flagship Use Case: Single Family Loan Analytics - -This repo features a production-grade implementation for mortgage loan portfolio analysis. It’s a showcase of how Conversational SQL can power real-world, domain-specific analytics. - -### Key Features - -- **🧠 Ontological Intelligence**: 110+ fields organized into 15 business domains (Credit Risk, Geographic, Temporal, Performance, etc.) -- **🎯 Domain-Aware Context**: AI understands mortgage terminology — "high-risk" automatically considers credit scores, LTV ratios, and DTI -- **⚡ High-Performance Pipeline**: Pipe-separated CSVs → Parquet with schema enforcement, achieving 10x compression and instant query performance -- **🔐 Enterprise Security**: Google OAuth integration with Cloudflare D1 query logging -- **🚀 Multiple AI Engines**: Out-of-the-box support for AWS Bedrock, Claude API, and extensible to Gemini, Ollama, and more - ---- +## Architecture at a glance +``` +Streamlit UI (app.py) + └─ Core orchestration (src/core.py) + ├─ DuckDB execution + ├─ Cached schema + ontology context + └─ Data sync checks (scripts/sync_data.py) + └─ AI service (src/ai_service.py) + ├─ Adapter registry (src/ai_engines/*) + ├─ Prompt construction with risk framework + └─ Clean SQL post-processing +``` ## 🏗️ Architecture @@ -120,195 +94,75 @@ Our showcase implementation demonstrates a complete data engineering workflow: 📄 **[Learn more about the data pipeline →](docs/DATA_PIPELINE.md)** ---- - -## �️ Quick Start - -### Prerequisites -- Python 3.11+ -- Google OAuth credentials -- AI Provider (Claude API or AWS Bedrock) -- Cloudflare R2 or local data storage - -### Installation -```bash -git clone -cd converSQL -pip install -r requirements.txt -``` - -### Configuration -```bash -# Copy environment template -cp .env.example .env - -# Configure your settings -# See setup guides for detailed instructions -``` - -### Launch -```bash -streamlit run app.py -``` - - -## 📖 Developer Setup Guides - -All setup and deployment guides are located in the `docs/` directory: - -- **[Google OAuth Setup](docs/GOOGLE_OAUTH_SETUP.md)** — Authentication configuration -- **[Cloud Storage Setup](docs/R2_SETUP.md)** — Cloudflare R2 data storage configuration -- **[Cloudflare D1 Setup](docs/D1_SETUP.md)** — Logging user activity with Cloudflare D1 -- **[Environment Setup](docs/ENVIRONMENT_SETUP.md)** — Environment variables and dependencies -- **[Deployment Guide](docs/DEPLOYMENT.md)** — Deploy to Streamlit Cloud or locally - - - -## � Documentation - -### Setup Guides -- **[Environment Setup](docs/ENVIRONMENT_SETUP.md)** — Configure environment variables and dependencies -- **[Data Pipeline Setup](docs/DATA_PIPELINE.md)** — Understand and customize the data pipeline -- **[Google OAuth Setup](docs/GOOGLE_OAUTH_SETUP.md)** — Enable authentication -- **[Cloud Storage Setup](docs/R2_SETUP.md)** — Configure Cloudflare R2 -- **[Deployment Guide](docs/DEPLOYMENT.md)** — Deploy to production - -### Developer Guides -- **[Contributing Guide](CONTRIBUTING.md)** — How to contribute to converSQL -- **[AI Engine Development](docs/AI_ENGINES.md)** — Add support for new AI providers -- **[Architecture Overview](docs/ARCHITECTURE.md)** — Deep dive into system design - ---- - -## 🤝 Contributing - -We welcome contributions from the community! Whether you're: -- 🐛 Reporting bugs -- 💡 Suggesting features -- 🔧 Adding new AI engine adapters -- 📖 Improving documentation -- 🎨 Enhancing the UI - -**Your contributions make converSQL better for everyone.** - -### How to Contribute - -1. **Fork the repository** -2. **Create a feature branch**: `git checkout -b feature/your-feature-name` -3. **Make your changes** with clear commit messages -4. **Test thoroughly** — ensure existing functionality still works -5. **Submit a pull request** with a detailed description - -📄 **[Read the full contributing guide →](CONTRIBUTING.md)** - -### Adding New AI Engines - -converSQL uses an adapter pattern for AI engines. Adding a new provider is straightforward: - -1. Implement the `AIEngineAdapter` interface -2. Add configuration options -3. Register in the AI service -4. Test and document - -📄 **[AI Engine Development Guide →](docs/AI_ENGINES.md)** - ---- - -## 🎯 Use Cases Beyond Loan Analytics - -While our flagship implementation focuses on mortgage data, converSQL is designed for **any domain** with tabular data: - -### Financial Services -- Credit card transaction analysis -- Investment portfolio performance -- Fraud detection patterns -- Regulatory reporting - -### Healthcare -- Patient outcomes analysis -- Clinical trial data exploration -- Hospital performance metrics -- Insurance claims analytics - -### E-commerce -- Customer behavior patterns -- Inventory optimization -- Sales performance tracking -- Supply chain analytics - -### Your Domain -**Bring your own data** — converSQL adapts through ontological modeling. Define your domains, specify relationships, and let AI handle the query generation. - ---- - -## 🌟 Why converSQL? - -### For Analysts -- **Stop writing SQL by hand** — describe what you want, get optimized queries -- **Explore data faster** — try different angles without syntax barriers -- **Focus on insights** — spend time analyzing, not coding - -### For Data Engineers -- **Modular architecture** — swap AI providers, storage backends, or UI components -- **Production-ready** — authentication, logging, caching, error handling built-in -- **Extensible ontology** — encode business logic once, reuse everywhere - -### For Organizations -- **Democratize data access** — empower non-technical users to explore data -- **Reduce bottlenecks** — less waiting for custom reports and queries -- **Open source** — no vendor lock-in, full transparency, community-driven development - ---- - -## 🛣️ Roadmap - -### Current Focus (v1.0) -- ✅ Multi-AI engine support (Bedrock, Claude, Gemini) -- ✅ Bedrock Guardrails integration for content filtering -- ✅ Ontological data modeling -- ✅ Single Family Loan Analytics showcase -- 🔄 Ollama adapter implementation -- 🔄 Enhanced query validation and optimization - -### Future Enhancements (v2.0+) -- Multi-table query generation with JOIN intelligence -- Query explanation and visualization -- Historical query learning and optimization -- More domain-specific implementations (healthcare, e-commerce, etc.) -- API server mode for programmatic access -- Web-based ontology editor - -**Have ideas?** [Open an issue](https://github.com/ravishan16/conversql/issues) or join the discussion! - ---- - -## 📄 License - -**MIT License** — Free to use, modify, and distribute. - -See the [LICENSE](LICENSE) file for details. - ---- - -## 🙏 Acknowledgments - -- **Fannie Mae** for making Single Family Loan Performance Data publicly available -- **DuckDB** team for an incredible analytical database engine -- **Anthropic** and **AWS** for powerful AI models -- **Streamlit** for making data apps beautiful and easy -- **Open source community** for inspiration and contributions - ---- - -## 📬 Stay Connected - -- **⭐ Star this repo** to follow development -- **🐦 Share your use cases** — we'd love to hear how you're using converSQL -- **💬 Join discussions** — ask questions, share ideas, help others -- **🐛 Report issues** — help us improve - ---- - -**Built with ❤️ by the converSQL community** - -*Making data conversational, one query at a time.* \ No newline at end of file +## Brand palette +| Token | Hex | Description | +| --- | --- | --- | +| `--color-background` | `#FAF6F0` | Ivory linen canvas across the app | +| `--color-background-alt` | `#FDFDFD` | Porcelain surfaces for cards and modals | +| `--color-text-primary` | `#3A3A3A` | Charcoal Plum headings | +| `--color-text-secondary` | `#7C6F64` | Warm Taupe body copy | +| `--color-accent-primary` | `#DDBEA9` | Soft Clay primary accent | +| `--color-accent-primary-darker` | `#B45F4D` | Terracotta hover and emphasis | +| `--color-border-light` | `#E4C590` | Gold Sand borders, dividers, and tags | + +## Quick start +1. **Install prerequisites** + ```bash + git clone https://github.com/ravishan16/converSQL.git + cd converSQL + pip install -r requirements.txt + ``` +2. **Configure environment** + ```bash + cp .env.example .env + # Enable one AI block (CLAUDE_API_KEY, AWS_* for Bedrock, or GEMINI_API_KEY) + # Provide Google OAuth or set ENABLE_AUTH=false for local dev + ``` +3. **Launch the app** + ```bash + streamlit run app.py + ``` + +## Key documentation +- [Architecture](docs/ARCHITECTURE.md) – layered design and component interactions. +- [Data pipeline](docs/DATA_PIPELINE.md) – ingest, transformation, and Parquet strategy. +- [AI engines](docs/AI_ENGINES.md) – adapter contracts and extension guides. +- [Environment setup](docs/ENVIRONMENT_SETUP.md) – required variables for auth, data, and providers. + +## Developer workflow +- `make setup` – clean install + cache purge. +- `make test-unit` / `make test` – pytest with coverage that mirrors CI. +- `make format` and `make lint` – Black (120 cols), isort, flake8, mypy. +- Cached helpers such as `scan_parquet_files()` trigger `scripts/sync_data.py` when Parquet is missing—keep `data/processed/` warm during tests. + +## Contributing +1. Fork and branch: `git checkout -b feature/my-update`. +2. Run formatting + tests before committing. +3. Open a PR describing the change, provider credentials (if applicable), and test strategy. + +See [CONTRIBUTING.md](CONTRIBUTING.md) for templates, AI adapter expectations, and review checklists. + +## Broader use cases +- **Financial services** – credit risk, portfolio concentrations, regulatory stress tests. +- **Healthcare** – patient outcomes, clinical trial cohorts, claims analytics. +- **E-commerce** – customer segments, inventory velocity, supply chain exceptions. +- **Any ontology-driven domain** – define your schema metadata and let converSQL converse. + +## Roadmap snapshot +- ✅ Multi-AI adapter support with prompt caching and fallbacks. +- ✅ Mortgage analytics reference implementation. +- 🔄 Ollama adapter and enhanced SQL validation. +- 🔮 Upcoming: multi-table joins, query explanations, historical learning, self-serve ontology editor. + +## License +Released under the [MIT License](LICENSE). + +## Acknowledgments +- Fannie Mae for the Single Family Loan Performance dataset. +- The DuckDB, Streamlit, and Anthropic/AWS/Google teams for exceptional tooling. +- The converSQL community for ideas, issues, and adapters. + +## Stay connected +- ⭐ Star the repo to follow releases. +- 💬 Join discussions or open issues at [github.com/ravishan16/converSQL/issues](https://github.com/ravishan16/converSQL/issues). +- 📨 Share what you build—data should feel conversational. \ No newline at end of file diff --git a/app.py b/app.py index 434bb5c..5827478 100644 --- a/app.py +++ b/app.py @@ -14,6 +14,7 @@ # Import AI service with new adapter pattern from src.ai_service import generate_sql_with_ai, get_ai_service +from src.branding import get_favicon_path, get_logo_data_uri # Import core functionality from src.core import ( @@ -29,9 +30,11 @@ from src.simple_auth_components import simple_auth_wrapper # Configure page with professional styling +favicon_path = get_favicon_path() + st.set_page_config( page_title="converSQL - Natural Language to SQL", - page_icon="💬", + page_icon=str(favicon_path) if favicon_path else "💬", layout="wide", initial_sidebar_state="expanded", ) @@ -40,56 +43,266 @@ st.markdown( """ """, @@ -116,6 +329,7 @@ def format_file_size(size_bytes: int) -> str: def display_results(result_df: pd.DataFrame, title: str, execution_time: float = None): """Display query results with download option and performance metrics.""" if not result_df.empty: + st.markdown("
", unsafe_allow_html=True) # Compact performance header performance_info = f"✅ {title}: {len(result_df):,} rows" if execution_time: @@ -145,7 +359,9 @@ def display_results(result_df: pd.DataFrame, title: str, execution_time: float = # Use full width for the dataframe with responsive height height = min(600, max(200, len(result_df) * 35 + 50)) # Dynamic height based on rows - st.dataframe(result_df, width="stretch", height=height) + st.dataframe(result_df, use_container_width=True, height=height) + + st.markdown("
", unsafe_allow_html=True) else: st.warning("⚠️ No results found") @@ -207,31 +423,27 @@ def main(): st.error("❌ No data files found. Please ensure Parquet files are in the data/processed/ directory.") return + logo_data_uri = get_logo_data_uri() + # Professional sidebar with enhanced styling with st.sidebar: - # Header with better styling - st.markdown( - """ -
-

📊 System Status

+ if logo_data_uri: + st.markdown( + f""" + """, - unsafe_allow_html=True, - ) + unsafe_allow_html=True, + ) - # Data overview with metrics styling - parquet_files = st.session_state.get("parquet_files", []) st.markdown( """ -
- Data Files: - {} + - """.format( - len(parquet_files) - ), + """, unsafe_allow_html=True, ) @@ -244,9 +456,9 @@ def main(): provider_name = ai_status["active_provider"].title() st.markdown( """ -
-
+
🤖 AI Assistant: {}
@@ -303,12 +515,12 @@ def main(): else: st.markdown( """ -
-
+
🤖 AI Assistant: Unavailable
-
+
Configure Claude API or Bedrock access
@@ -320,9 +532,9 @@ def main(): if DEMO_MODE: st.markdown( """ -
-
+
🧪 Demo Mode Active
@@ -367,7 +579,7 @@ def main(): st.markdown(f"- **Enable Auth**: {os.getenv('ENABLE_AUTH', 'true')}") st.markdown( - "
", + "
", unsafe_allow_html=True, ) @@ -378,12 +590,12 @@ def main(): for file_path in parquet_files: table_name = os.path.splitext(os.path.basename(file_path))[0] st.markdown( - f"
{table_name}
", + f"
{table_name}
", unsafe_allow_html=True, ) else: st.markdown( - "
No tables loaded
", + "
No tables loaded
", unsafe_allow_html=True, ) @@ -391,7 +603,7 @@ def main(): with st.expander("📈 Portfolio Overview", expanded=True): if st.session_state.parquet_files: try: - import duckdb + import duckdb # type: ignore[import-not-found] # Use in-memory connection for stats only with duckdb.connect() as conn: @@ -426,33 +638,10 @@ def main(): ) else: st.markdown( - "
No data loaded
", + "
No data loaded
", unsafe_allow_html=True, ) - # Professional header with subtle styling - st.markdown( - """ -
-

- 💬 converSQL -

-

- Natural Language to SQL Query Generation -

-

- Multi-Provider AI Intelligence -

-
- Dataset: - 🏠 Single Family Loan Analytics -
-
- """, - unsafe_allow_html=True, - ) - # Enhanced tab layout with ontology exploration tab1, tab2, tab3 = st.tabs(["🔍 Query Builder", "🗺️ Data Ontology", "🔧 Advanced"]) @@ -461,14 +650,11 @@ def main(): with tab1: st.markdown( """ -
-

- Ask Questions About Your Loan Data -

-

- Use natural language to query your loan portfolio data -

-
+
+
+

Ask Questions About Your Loan Data

+

Use natural language to query your loan portfolio data.

+
""", unsafe_allow_html=True, ) @@ -476,26 +662,23 @@ def main(): # More compact analyst question dropdown analyst_questions = get_analyst_questions() - col1, col2 = st.columns([4, 1]) - with col1: + query_col1, query_col2 = st.columns([4, 1], gap="medium") + with query_col1: selected_question = st.selectbox( "💡 **Common Questions:**", [""] + list(analyst_questions.keys()), help="Select a pre-defined question", ) - with col2: - st.write("") # Add spacing to align button - if st.button("🎯 Use", disabled=not selected_question): + with query_col2: + st.write("") + if st.button("🎯 Use", disabled=not selected_question, use_container_width=True): if selected_question in analyst_questions: st.session_state.user_question = analyst_questions[selected_question] st.rerun() # Professional question input with better styling - st.markdown( - "
", - unsafe_allow_html=True, - ) + st.markdown("", unsafe_allow_html=True) user_question = st.text_area( "Your Question", value=st.session_state.get("user_question", ""), @@ -522,7 +705,7 @@ def main(): generate_button = st.button( f"🤖 Generate SQL with {provider_name}", type="primary", - width="stretch", + use_container_width=True, disabled=not is_ai_ready, help="Enter a question above to generate SQL" if not is_ai_ready else None, ) @@ -569,7 +752,7 @@ def main(): execute_button = st.button( "✅ Execute Query", type="primary", - width="stretch", + use_container_width=True, disabled=not has_sql, help="Generate SQL first to execute" if not has_sql else None, ) @@ -590,7 +773,7 @@ def main(): with col2: edit_button = st.button( "✏️ Edit", - width="stretch", + use_container_width=True, disabled=not has_sql, help="Generate SQL first to edit" if not has_sql else None, ) @@ -609,7 +792,7 @@ def main(): col1, col2 = st.columns([3, 1]) with col1: - if st.button("🚀 Run Edited Query", type="primary", width="stretch"): + if st.button("🚀 Run Edited Query", type="primary", use_container_width=True): with st.spinner("⚡ Running edited query..."): try: start_time = time.time() @@ -623,18 +806,20 @@ def main(): st.error(f"❌ Query execution failed: {str(e)}") st.info("💡 Check your SQL syntax and try again") with col2: - if st.button("❌ Cancel", width="stretch"): + if st.button("❌ Cancel", use_container_width=True): st.session_state.show_edit_sql = False st.rerun() + st.markdown("
", unsafe_allow_html=True) + with tab2: st.markdown( """
-

+

🗺️ Data Ontology Explorer

-

+

Explore the structured organization of all 110+ data fields across 15 business domains

@@ -657,8 +842,10 @@ def main(): # st.metric( # label="📅 Data Vintage", # value=PORTFOLIO_CONTEXT['overview']['vintage_range'] + + st.markdown("
", unsafe_allow_html=True) # ) - # with col3: + st.markdown("---") # st.metric( # label="🎯 Loss Rate", # value=PORTFOLIO_CONTEXT['performance_summary']['lifetime_loss_rate'] @@ -683,7 +870,7 @@ def main(): # Domain header st.markdown( f""" -

{selected_domain.replace('_', ' ').title()} @@ -745,8 +932,8 @@ def main(): # Field details card st.markdown( f""" -
-
{selected_field}
+
+
{selected_field}

Domain: {field_meta.domain}

Data Type: {field_meta.data_type}

Description: {field_meta.description}

@@ -774,7 +961,7 @@ def main(): st.markdown("### ⚖️ Risk Assessment Framework") st.markdown( f""" -
+

Credit Triangle: {PORTFOLIO_CONTEXT['risk_framework']['credit_triangle']}

  • Super Prime: {PORTFOLIO_CONTEXT['risk_framework']['risk_tiers']['super_prime']}
  • @@ -790,10 +977,10 @@ def main(): st.markdown( """
    -

    +

    🔧 Advanced Options

    -

    +

    Manual SQL queries and database schema exploration

    @@ -871,20 +1058,20 @@ def main(): # Create colored cards for each domain colors = [ - "#3498db", - "#e74c3c", - "#f39c12", - "#2ecc71", - "#9b59b6", + "#F3E5D9", + "#E7C8B2", + "#F6EDE2", + "#E4C590", + "#ECD9C7", ] color = colors[i // 3 % len(colors)] st.markdown( f""" -
    +
    {domain_name.replace('_', ' ').title()}
    -

    +

    {field_count} fields

    @@ -977,18 +1164,23 @@ def main(): st.markdown( f""" -
    -
    +
    💬 converSQL - Natural Language to SQL Query Generation Platform
    -
    +
    Powered by StreamlitDuckDB{ai_provider_text}Ontological Data Intelligence
    Implementation Showcase: Single Family Loan Analytics
    +
    """, unsafe_allow_html=True, diff --git a/assets/banner.png b/assets/banner.png new file mode 100644 index 0000000..90a29f9 Binary files /dev/null and b/assets/banner.png differ diff --git a/assets/banner_960x640.png b/assets/banner_960x640.png new file mode 100644 index 0000000..4901967 Binary files /dev/null and b/assets/banner_960x640.png differ diff --git a/assets/conversql_logo.svg b/assets/conversql_logo.svg new file mode 100644 index 0000000..8c34de2 --- /dev/null +++ b/assets/conversql_logo.svg @@ -0,0 +1,51 @@ + + + + + + + + + + + + + + + + + + + ConverSQL Logo - Corrected Layout + Logo for ConverSQL with corrected spacing and no overlap, using the brand palette. + + + + + + + + + + + + + + + + + + + + + + conver + SQL + + + + Talk to your data. Query → Execute → Visualize. + + \ No newline at end of file diff --git a/assets/favicon.ico b/assets/favicon.ico new file mode 100644 index 0000000..189dace Binary files /dev/null and b/assets/favicon.ico differ diff --git a/assets/favicon.png b/assets/favicon.png new file mode 100644 index 0000000..f19f8fe Binary files /dev/null and b/assets/favicon.png differ diff --git a/assets/google_oauth_logo.png b/assets/google_oauth_logo.png new file mode 100644 index 0000000..13ffdcf Binary files /dev/null and b/assets/google_oauth_logo.png differ diff --git a/docs/AI_ENGINES.md b/docs/AI_ENGINES.md index 65f26eb..e74263a 100644 --- a/docs/AI_ENGINES.md +++ b/docs/AI_ENGINES.md @@ -1,5 +1,9 @@ # AI Engine Development Guide +

    + converSQL logo +

    + ## Overview converSQL uses a modular **adapter pattern** for AI engine integration, making it easy to add support for new AI providers. This guide walks you through creating a new AI engine adapter from scratch. diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 8bc5c25..73e7ba4 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -1,5 +1,9 @@ # converSQL Architecture +

    + converSQL logo +

    + ## Overview converSQL follows a **clean, layered architecture** designed for modularity, extensibility, and maintainability. This document provides a deep dive into the system design, component interactions, and architectural decisions. diff --git a/requirements.txt b/requirements.txt index 95bf4b6..ea0fa6f 100644 --- a/requirements.txt +++ b/requirements.txt @@ -25,6 +25,7 @@ black>=23.0.0 flake8>=6.1.0 isort>=5.12.0 mypy>=1.7.0 +types-requests>=2.31.0.10 coverage>=7.3.0 # Optional: Enhanced performance and monitoring diff --git a/src/ai_engines/bedrock_adapter.py b/src/ai_engines/bedrock_adapter.py index 0492c1a..6c63ed3 100644 --- a/src/ai_engines/bedrock_adapter.py +++ b/src/ai_engines/bedrock_adapter.py @@ -5,7 +5,7 @@ import json import os -from typing import Any, Dict, Tuple +from typing import Any, Dict, Optional, Tuple from .base import AIEngineAdapter @@ -30,11 +30,11 @@ def __init__(self, config: Dict[str, Any] = None): - guardrail_id: Optional Bedrock Guardrail ID - guardrail_version: Optional Bedrock Guardrail version """ - self.client = None - self.model_id = None - self.region = None - self.guardrail_id = None - self.guardrail_version = None + self.client: Optional[Any] = None + self.model_id: Optional[str] = None + self.region: Optional[str] = None + self.guardrail_id: Optional[str] = None + self.guardrail_version: Optional[str] = None super().__init__(config) def _initialize(self) -> None: @@ -135,20 +135,31 @@ def generate_sql(self, prompt: str) -> Tuple[str, str]: } # Prepare invoke_model parameters + model_id = self.model_id + if model_id is None: + return "", "Bedrock client not available. Check AWS configuration." + invoke_params = { - "modelId": self.model_id, + "modelId": model_id, "body": json.dumps(request_body), "contentType": "application/json", "accept": "application/json", } # Add guardrails if configured - if self.guardrail_id: - invoke_params["guardrailIdentifier"] = self.guardrail_id - invoke_params["guardrailVersion"] = self.guardrail_version + guardrail_id = self.guardrail_id + guardrail_version = self.guardrail_version + if guardrail_id is not None: + invoke_params["guardrailIdentifier"] = guardrail_id + if guardrail_version is not None: + invoke_params["guardrailVersion"] = guardrail_version # Call Bedrock API - response = self.client.invoke_model(**invoke_params) + client = self.client + if client is None: + return "", "Bedrock client not available. Check AWS credentials and configuration." + + response = client.invoke_model(**invoke_params) # Parse response response_body = json.loads(response["body"].read()) @@ -193,7 +204,7 @@ def provider_id(self) -> str: def get_model_info(self) -> Dict[str, Any]: """Get Bedrock model configuration details.""" - info = { + info: Dict[str, Any] = { "provider": "Amazon Bedrock", "model_id": self.model_id, "region": self.region, diff --git a/src/ai_engines/claude_adapter.py b/src/ai_engines/claude_adapter.py index d3f4985..358c373 100644 --- a/src/ai_engines/claude_adapter.py +++ b/src/ai_engines/claude_adapter.py @@ -4,7 +4,7 @@ """ import os -from typing import Any, Dict, Tuple +from typing import Any, Dict, Optional, Tuple from .base import AIEngineAdapter @@ -27,10 +27,10 @@ def __init__(self, config: Dict[str, Any] = None): - model: Model name (default from env) - max_tokens: Maximum response tokens """ - self.client = None - self.api_key = None - self.model = None - self.max_tokens = None + self.client: Optional[Any] = None + self.api_key: Optional[str] = None + self.model: Optional[str] = None + self.max_tokens: Optional[int] = None super().__init__(config) def _initialize(self) -> None: @@ -69,7 +69,7 @@ def _initialize(self) -> None: def is_available(self) -> bool: """Check if Claude API client is initialized and ready.""" - return self.client is not None and self.api_key is not None + return self.client is not None and self.api_key is not None and self.model is not None def generate_sql(self, prompt: str) -> Tuple[str, str]: """ @@ -86,9 +86,15 @@ def generate_sql(self, prompt: str) -> Tuple[str, str]: try: # Call Claude API - response = self.client.messages.create( - model=self.model, - max_tokens=self.max_tokens, + client = self.client + model = self.model + max_tokens = self.max_tokens + if client is None or model is None or max_tokens is None: + return "", "Claude API not available. Check CLAUDE_API_KEY configuration." + + response = client.messages.create( + model=model, + max_tokens=max_tokens, temperature=0.0, # Deterministic for SQL generation messages=[{"role": "user", "content": prompt}], ) @@ -117,7 +123,7 @@ def generate_sql(self, prompt: str) -> Tuple[str, str]: elif "rate" in error_lower or "quota" in error_lower: error_msg += "\nAPI rate limit or quota exceeded" elif "model" in error_lower: - error_msg += f"\nModel {self.model} may not be available or accessible" + error_msg += f"\nModel {model} may not be available or accessible" return "", error_msg diff --git a/src/ai_engines/gemini_adapter.py b/src/ai_engines/gemini_adapter.py index ff486d3..4649795 100644 --- a/src/ai_engines/gemini_adapter.py +++ b/src/ai_engines/gemini_adapter.py @@ -4,7 +4,7 @@ """ import os -from typing import Any, Dict, Tuple +from typing import Any, Dict, Optional, Tuple, cast from .base import AIEngineAdapter @@ -28,11 +28,11 @@ def __init__(self, config: Dict[str, Any] = None): - max_tokens: Maximum response tokens - temperature: Temperature for generation (0.0-1.0) """ - self.client = None - self.model = None - self.api_key = None - self.max_output_tokens = None - self.temperature = None + self.client: Optional[Any] = None + self.model: Optional[Any] = None + self.api_key: Optional[str] = None + self.max_output_tokens: int = 4000 + self.temperature: float = 0.0 super().__init__(config) def _initialize(self) -> None: @@ -40,31 +40,49 @@ def _initialize(self) -> None: # Get configuration self.api_key = self.config.get("api_key", os.getenv("GOOGLE_API_KEY") or os.getenv("GEMINI_API_KEY")) model_name = self.config.get("model", os.getenv("GEMINI_MODEL", "gemini-1.5-pro")) - self.max_output_tokens = self.config.get("max_output_tokens", 4000) - self.temperature = self.config.get("temperature", 0.0) + if not isinstance(model_name, str) or not model_name.strip(): + model_name = "gemini-1.5-pro" + max_output_tokens = self.config.get("max_output_tokens", 4000) + try: + parsed_max_tokens = int(max_output_tokens) + self.max_output_tokens = parsed_max_tokens if parsed_max_tokens > 0 else 4000 + except (TypeError, ValueError): + self.max_output_tokens = 4000 + + temperature_config = self.config.get("temperature", 0.0) + try: + parsed_temperature = float(temperature_config) + self.temperature = parsed_temperature if 0.0 <= parsed_temperature <= 1.0 else 0.0 + except (TypeError, ValueError): + self.temperature = 0.0 if not self.api_key: return try: - import google.generativeai as genai + import importlib + + genai = cast(Any, importlib.import_module("google.generativeai")) # Configure API key genai.configure(api_key=self.api_key) # Initialize model with generation config - generation_config = { + generation_config: Dict[str, Any] = { "temperature": self.temperature, "max_output_tokens": self.max_output_tokens, "top_p": 0.95, "top_k": 40, } - self.model = genai.GenerativeModel(model_name=model_name, generation_config=generation_config) + model_instance = cast( + Any, genai.GenerativeModel(model_name=str(model_name), generation_config=cast(Any, generation_config)) + ) + self.model = model_instance # Optional: Test with minimal request to verify model initialization try: - self.model.generate_content("test") + model_instance.generate_content("test") # If we get here, the model is working except Exception: # Keep model - might work for actual requests @@ -98,7 +116,11 @@ def generate_sql(self, prompt: str) -> Tuple[str, str]: try: # Generate content - response = self.model.generate_content(prompt) + model = self.model + if model is None: + return "", "Gemini not available. Check GOOGLE_API_KEY configuration." + + response = model.generate_content(prompt) # Extract text from response if response.text: @@ -178,25 +200,33 @@ def set_safety_settings(self, safety_settings: Dict[str, Any]) -> None: 'HARM_CATEGORY_HATE_SPEECH': 'BLOCK_MEDIUM_AND_ABOVE', } """ - if not self.model: + model = self.model + if model is None: print("⚠️ Cannot set safety settings: Model not initialized") return try: - import google.generativeai as genai + import importlib + + genai = cast(Any, importlib.import_module("google.generativeai")) # Reconstruct model with new safety settings - generation_config = { + generation_config: Dict[str, Any] = { "temperature": self.temperature, "max_output_tokens": self.max_output_tokens, "top_p": 0.95, "top_k": 40, } - model_name = self.model.model_name if hasattr(self.model, "model_name") else "gemini-1.5-pro" + model_name = model.model_name if hasattr(model, "model_name") else "gemini-1.5-pro" - self.model = genai.GenerativeModel( - model_name=model_name, generation_config=generation_config, safety_settings=safety_settings + self.model = cast( + Any, + genai.GenerativeModel( + model_name=str(model_name), + generation_config=cast(Any, generation_config), + safety_settings=safety_settings, + ), ) except Exception as e: diff --git a/src/ai_service.py b/src/ai_service.py index 70df333..7913776 100644 --- a/src/ai_service.py +++ b/src/ai_service.py @@ -13,6 +13,7 @@ # Import new adapters from src.ai_engines import BedrockAdapter, ClaudeAdapter, GeminiAdapter +from src.prompts import build_sql_generation_prompt # Load environment variables load_dotenv() @@ -115,83 +116,7 @@ def _create_prompt_hash(self, user_question: str, schema_context: str) -> str: def _build_sql_prompt(self, user_question: str, schema_context: str) -> str: """Build the SQL generation prompt.""" - return f"""You are an expert Single Family Loan loan performance data analyst. Write a single, clean DuckDB-compatible SQL query. - -Database Schema Context: -{schema_context} - -User Question: {user_question} - -LOAN PERFORMANCE DOMAIN EXPERTISE: -You are an expert in single-family mortgage loan analytics with deep understanding of loan performance, risk assessment, and portfolio management. -This represents a comprehensive single-family loan portfolio with deep performance history. - -CRITICAL FIELD CONTEXT & RISK FRAMEWORKS: - -**Geographic Fields:** -- STATE: 2-letter codes (TX, CA, FL lead volumes) - California 13.6% of portfolio -- ZIP: First 3 digits (902=CA, 100=NY metro, 750=TX) - use for regional concentration -- MSA: Metropolitan Statistical Area codes - major markets drive performance trends - -**Credit Risk Tiers (Use these exact breakpoints):** -- CSCORE_B Credit Scores: - * 740+ = Super Prime (premium pricing, <1% default risk) - * 680-739 = Prime (standard pricing, moderate risk) - * 620-679 = Near Prime (risk-based pricing, elevated risk) - * <620 = Subprime (highest risk, limited origination post-2008) -- OLTV/CLTV Loan-to-Value: - * ≤80% = Low Risk (traditional lending sweet spot) - * 80-90% = Moderate Risk (standard conforming) - * 90-95% = Elevated Risk (PMI required) - * >95% = High Risk (limited recent origination) -- DTI Debt-to-Income: - * ≤28% = Conservative (prime borrower profile) - * 28-36% = Standard (typical mortgage underwriting) - * 36-45% = Elevated (requires compensating factors) - * >45% = High Risk (rare in dataset) - -**Performance & Vintage Intelligence:** -- DLQ_STATUS: '00'=Current (96%+), '01'=30-59 days (2-3%), '02'=60-89 days (<1%), higher = serious distress -- LOAN_AGE: Critical for vintage analysis - 2008 crisis (age 180+ months), 2020-2021 refi boom (age 24-48 months) -- ORIG_DATE: Key vintages = 2008-2012 (post-crisis), 2020-2021 (refi boom), 2022+ (rising rates) - -**Product & Channel Analysis:** -- PURPOSE: P=Purchase (portfolio growth), R=Refinance (rate optimization), C=Cash-out (credit event) -- PROP: SF=Single Family (85%+), PU=Planned Development (10%), CO=Condo (5%), others minimal -- CHANNEL: R=Retail (direct), C=Correspondent (volume), B=Broker (specialized) -- SELLER: Top institutions drive volume - use for counterparty analysis - -**Financial Intelligence:** -- ORIG_UPB vs CURR_UPB: Track paydown behavior (faster = prime, slower = risk) -- ORIG_RATE: Historical context - 2020-2021 ultra-low (2-3%), 2022+ rising (5-7%+) -- MI_PCT: Mortgage insurance percentage - indicators of LTV >80% - -ADVANCED QUERY PATTERNS: - -**Risk Segmentation Queries:** -- Always use credit score tiers: CASE WHEN CSCORE_B >= 740 THEN 'Super Prime' WHEN CSCORE_B >= 680 THEN 'Prime'... -- LTV risk analysis: CASE WHEN OLTV <= 80 THEN 'Low Risk' WHEN OLTV <= 90 THEN 'Moderate'... -- Combine risk factors for comprehensive view - -**Vintage & Performance Analysis:** -- Use LOAN_AGE for cohort analysis: WHERE LOAN_AGE BETWEEN 24 AND 36 (2020-2021 vintage) -- Performance trends: GROUP BY EXTRACT(YEAR FROM ORIG_DATE) for annual cohorts -- Current performance: WHERE DLQ_STATUS = '00' for current loans - -**Geographic & Market Intelligence:** -- High-volume states: WHERE STATE IN ('CA','TX','FL','NY','PA') for 50%+ of portfolio -- Market concentration: Use ZIP first 3 digits for metro area analysis -- State-level risk: Compare delinquency rates by STATE for market risk assessment - -**Business Intelligence Defaults:** -- Loan counts: Use COUNT(*) for portfolio metrics, COUNT(DISTINCT LOAN_SEQ_NO) for unique loans -- Dollar amounts: ROUND(SUM(CURR_UPB)/1000000,1) for millions, /1000000000 for billions -- Rates: ROUND(AVG(ORIG_RATE),3) for precision, compare to market benchmarks -- Performance: Calculate current/delinquent ratios, use weighted averages for UPB -- Always filter NULL values: WHERE field IS NOT NULL for meaningful analysis -- Use LIMIT 20 for top analyses unless specified otherwise - -Write ONLY the SQL query - no explanations:""" + return build_sql_generation_prompt(user_question, schema_context) @st.cache_data(ttl=PROMPT_CACHE_TTL) def _cached_generate_sql(_self, user_question: str, schema_context: str, provider: str) -> Tuple[str, str]: @@ -277,7 +202,8 @@ def initialize_ai_client() -> Tuple[Optional[AIService], str]: """Initialize AI client - backward compatibility.""" service = get_ai_service() if service.is_available(): - return service, service.get_active_provider() + provider = service.get_active_provider() or "none" + return service, provider return None, "none" diff --git a/src/branding.py b/src/branding.py new file mode 100644 index 0000000..4ec8110 --- /dev/null +++ b/src/branding.py @@ -0,0 +1,52 @@ +"""Branding helpers for converSQL assets.""" + +from __future__ import annotations + +import base64 +import re +from functools import lru_cache +from pathlib import Path +from typing import Optional + +_PROJECT_ROOT = Path(__file__).resolve().parent.parent +_ASSETS_DIR = _PROJECT_ROOT / "assets" +_LOGO_PATH = _ASSETS_DIR / "conversql_logo.svg" +_FAVICON_PATH = _ASSETS_DIR / "favicon.png" + + +@lru_cache(maxsize=1) +def get_logo_svg() -> Optional[str]: + """Return the SVG markup for the converSQL logo if available.""" + try: + raw_svg = _LOGO_PATH.read_text(encoding="utf-8") + + # Remove XML declaration which prevents inline rendering in HTML contexts + cleaned_svg = re.sub(r"^\s*<\?xml[^>]*\?>", "", raw_svg, count=1, flags=re.IGNORECASE | re.MULTILINE) + + return cleaned_svg.strip() + except FileNotFoundError: + return None + + +def get_logo_path() -> Path: + """Return the filesystem path to the converSQL logo asset.""" + return _LOGO_PATH + + +@lru_cache(maxsize=1) +def get_logo_data_uri() -> Optional[str]: + """Return a data URI suitable for embedding the SVG logo in HTML ``img`` tags.""" + svg = get_logo_svg() + if not svg: + return None + + encoded = base64.b64encode(svg.encode("utf-8")).decode("ascii") + return f"data:image/svg+xml;base64,{encoded}" + + +@lru_cache(maxsize=1) +def get_favicon_path() -> Optional[Path]: + """Return the favicon path if it exists.""" + if _FAVICON_PATH.exists(): + return _FAVICON_PATH + return None diff --git a/src/core.py b/src/core.py index 44b1d48..dabf3bb 100644 --- a/src/core.py +++ b/src/core.py @@ -6,7 +6,7 @@ import glob import os -from typing import Dict, List, Optional, Tuple +from typing import Any, Dict, List, Optional, Tuple import duckdb import pandas as pd @@ -61,10 +61,10 @@ def sync_data_if_needed(force: bool = False) -> bool: conn = duckdb.connect() # Quick validation - try to read first file test_query = f"SELECT COUNT(*) FROM '{parquet_files[0]}'" - result = conn.execute(test_query).fetchone() + row = conn.execute(test_query).fetchone() conn.close() - if result and result[0] > 0: + if row and row[0] > 0: print(f"✅ Found {len(parquet_files)} valid parquet file(s) with data") return True else: @@ -84,15 +84,15 @@ def sync_data_if_needed(force: bool = False) -> bool: if force: sync_args.append("--force") - result = subprocess.run(sync_args, capture_output=True, text=True) + sync_result = subprocess.run(sync_args, capture_output=True, text=True) - if result.returncode == 0: + if sync_result.returncode == 0: print("✅ R2 sync completed successfully") return True else: - print(f"⚠️ R2 sync failed: {result.stderr}") - if result.stdout: - print(f"📋 Sync output: {result.stdout}") + print(f"⚠️ R2 sync failed: {sync_result.stderr}") + if sync_result.stdout: + print(f"📋 Sync output: {sync_result.stdout}") return False except Exception as e: @@ -148,7 +148,8 @@ def initialize_ai_client() -> Tuple[Optional[object], str]: """Initialize AI client - uses new AI service.""" service = get_ai_service() if service.is_available(): - return service, service.get_active_provider() + provider = service.get_active_provider() or "none" + return service, provider return None, "none" @@ -193,7 +194,7 @@ def get_analyst_questions() -> Dict[str, str]: } -def get_ai_service_status() -> Dict[str, any]: +def get_ai_service_status() -> Dict[str, Any]: """Get AI service status for UI display.""" service = get_ai_service() return { diff --git a/src/d1_logger.py b/src/d1_logger.py index 00784e5..d091796 100644 --- a/src/d1_logger.py +++ b/src/d1_logger.py @@ -5,9 +5,9 @@ """ import os -from typing import Dict, Optional +from typing import Any, Dict, List, Optional -import requests +import requests # type: ignore[import-untyped] from dotenv import load_dotenv load_dotenv() @@ -26,7 +26,7 @@ def is_enabled(self) -> bool: """Check if D1 logging is enabled.""" return self.enabled - def _execute_query(self, sql: str, params: list = None) -> Optional[Dict]: + def _execute_query(self, sql: str, params: Optional[List[Any]] = None) -> Optional[Dict[str, Any]]: """Execute a D1 query via REST API.""" if not self.enabled: return None @@ -35,7 +35,7 @@ def _execute_query(self, sql: str, params: list = None) -> Optional[Dict]: headers = {"Authorization": f"Bearer {self.api_token}", "Content-Type": "application/json"} - payload = {"sql": sql} + payload: Dict[str, Any] = {"sql": sql} if params: payload["params"] = params @@ -77,7 +77,7 @@ def log_user_query( self._execute_query(sql, [user_id, email, question, sql_query, ai_provider, execution_time]) - def get_user_stats(self, user_id: str) -> Dict: + def get_user_stats(self, user_id: str) -> Dict[str, Any]: """Get basic user statistics.""" if not self.enabled: return {} @@ -98,7 +98,7 @@ def get_user_stats(self, user_id: str) -> Dict: # Global logger instance -_d1_logger = None +_d1_logger: Optional[D1Logger] = None def get_d1_logger() -> D1Logger: diff --git a/src/prompts/__init__.py b/src/prompts/__init__.py new file mode 100644 index 0000000..36d170b --- /dev/null +++ b/src/prompts/__init__.py @@ -0,0 +1,5 @@ +"""Prompt library for converSQL AI interactions.""" + +from .sql_generation import build_sql_generation_prompt + +__all__ = ["build_sql_generation_prompt"] diff --git a/src/prompts/sql_generation.py b/src/prompts/sql_generation.py new file mode 100644 index 0000000..49797ac --- /dev/null +++ b/src/prompts/sql_generation.py @@ -0,0 +1,91 @@ +"""SQL generation prompts used by converSQL AI adapters.""" + +from __future__ import annotations + +from typing import Final + +_PROMPT_PREAMBLE: Final[str] = ( + "You are an expert Single Family Loan loan performance data analyst. " + "Write a single, clean DuckDB-compatible SQL query." +) + + +def build_sql_generation_prompt(user_question: str, schema_context: str) -> str: + """Construct the full SQL generation prompt for the AI adapters.""" + return f"""{_PROMPT_PREAMBLE} + +Database Schema Context: +{schema_context} + +User Question: {user_question} + +LOAN PERFORMANCE DOMAIN EXPERTISE: +You are an expert in single-family mortgage loan analytics with deep understanding of loan performance, risk assessment, and portfolio management. +This represents a comprehensive single-family loan portfolio with deep performance history. + +CRITICAL FIELD CONTEXT & RISK FRAMEWORKS: + +**Geographic Fields:** +- STATE: 2-letter codes (TX, CA, FL lead volumes) - California 13.6% of portfolio +- ZIP: First 3 digits (902=CA, 100=NY metro, 750=TX) - use for regional concentration +- MSA: Metropolitan Statistical Area codes - major markets drive performance trends + +**Credit Risk Tiers (Use these exact breakpoints):** +- CSCORE_B Credit Scores: + * 740+ = Super Prime (premium pricing, <1% default risk) + * 680-739 = Prime (standard pricing, moderate risk) + * 620-679 = Near Prime (risk-based pricing, elevated risk) + * <620 = Subprime (highest risk, limited origination post-2008) +- OLTV/CLTV Loan-to-Value: + * ≤80% = Low Risk (traditional lending sweet spot) + * 80-90% = Moderate Risk (standard conforming) + * 90-95% = Elevated Risk (PMI required) + * >95% = High Risk (limited recent origination) +- DTI Debt-to-Income: + * ≤28% = Conservative (prime borrower profile) + * 28-36% = Standard (typical mortgage underwriting) + * 36-45% = Elevated (requires compensating factors) + * >45% = High Risk (rare in dataset) + +**Performance & Vintage Intelligence:** +- DLQ_STATUS: '00'=Current (96%+), '01'=30-59 days (2-3%), '02'=60-89 days (<1%), higher = serious distress +- LOAN_AGE: Critical for vintage analysis - 2008 crisis (age 180+ months), 2020-2021 refi boom (age 24-48 months) +- ORIG_DATE: Key vintages = 2008-2012 (post-crisis), 2020-2021 (refi boom), 2022+ (rising rates) + +**Product & Channel Analysis:** +- PURPOSE: P=Purchase (portfolio growth), R=Refinance (rate optimization), C=Cash-out (credit event) +- PROP: SF=Single Family (85%+), PU=Planned Development (10%), CO=Condo (5%), others minimal +- CHANNEL: R=Retail (direct), C=Correspondent (volume), B=Broker (specialized) +- SELLER: Top institutions drive volume - use for counterparty analysis + +**Financial Intelligence:** +- ORIG_UPB vs CURR_UPB: Track paydown behavior (faster = prime, slower = risk) +- ORIG_RATE: Historical context - 2020-2021 ultra-low (2-3%), 2022+ rising (5-7%+) +- MI_PCT: Mortgage insurance percentage - indicators of LTV >80% + +ADVANCED QUERY PATTERNS: + +**Risk Segmentation Queries:** +- Always use credit score tiers: CASE WHEN CSCORE_B >= 740 THEN 'Super Prime' WHEN CSCORE_B >= 680 THEN 'Prime'... +- LTV risk analysis: CASE WHEN OLTV <= 80 THEN 'Low Risk' WHEN OLTV <= 90 THEN 'Moderate'... +- Combine risk factors for comprehensive view + +**Vintage & Performance Analysis:** +- Use LOAN_AGE for cohort analysis: WHERE LOAN_AGE BETWEEN 24 AND 36 (2020-2021 vintage) +- Performance trends: GROUP BY EXTRACT(YEAR FROM ORIG_DATE) for annual cohorts +- Current performance: WHERE DLQ_STATUS = '00' for current loans + +**Geographic & Market Intelligence:** +- High-volume states: WHERE STATE IN ('CA','TX','FL','NY','PA') for 50%+ of portfolio +- Market concentration: Use ZIP first 3 digits for metro area analysis +- State-level risk: Compare delinquency rates by STATE for market risk assessment + +**Business Intelligence Defaults:** +- Loan counts: Use COUNT(*) for portfolio metrics, COUNT(DISTINCT LOAN_SEQ_NO) for unique loans +- Dollar amounts: ROUND(SUM(CURR_UPB)/1000000,1) for millions, /1000000000 for billions +- Rates: ROUND(AVG(ORIG_RATE),3) for precision, compare to market benchmarks +- Performance: Calculate current/delinquent ratios, use weighted averages for UPB +- Always filter NULL values: WHERE field IS NOT NULL for meaningful analysis +- Use LIMIT 20 for top analyses unless specified otherwise + +Write ONLY the SQL query - no explanations:""" diff --git a/src/simple_auth_components.py b/src/simple_auth_components.py index ae03d1d..487adb7 100644 --- a/src/simple_auth_components.py +++ b/src/simple_auth_components.py @@ -5,10 +5,11 @@ """ import os -import time import streamlit as st +from .branding import get_logo_data_uri +from .core import get_ai_service_status from .simple_auth import get_auth_service, handle_oauth_callback @@ -16,114 +17,259 @@ def render_login_page(): """Render the login page with Google OAuth.""" auth = get_auth_service() + # Inject converSQL-specific styling for the login experience + st.markdown( + """ + + """, + unsafe_allow_html=True, + ) + # Center the login content col1, col2, col3 = st.columns([1, 2, 1]) with col2: - # Professional login header + logo_data_uri = get_logo_data_uri() + if logo_data_uri: + logo_block = "" + else: + logo_block = "" + + auth_url = auth.get_auth_url() + demo_mode = os.getenv("DEMO_MODE", "false").lower() == "true" + + if auth_url: + login_cta = ( + f"" + ) + else: + login_cta = "" + st.markdown( - """ -
    -

    - 🏠 Single Family Loan Analytics -

    -

    - AI-Powered Loan Portfolio Intelligence -

    + f""" + + """, unsafe_allow_html=True, ) - # Professional login card - with st.container(): - st.markdown( - """ -
    -

    - 🔐 Secure Access Required -

    -

    - Sign in with your Google account to access the loan analytics platform. -

    -
    - """, - unsafe_allow_html=True, - ) - - st.markdown("
    ", unsafe_allow_html=True) - - # Google Sign-In Button - if st.button("🔐 Sign in with Google", type="primary", width="stretch"): - auth_url = auth.get_auth_url() - - if auth_url: - DEMO_MODE = os.getenv("DEMO_MODE", "false").lower() == "true" - if DEMO_MODE: - st.info("🔗 Redirecting to Google OAuth...") - st.code(auth_url) - st.markdown(f"[Click here if not redirected automatically]({auth_url})") - - # Multiple redirect methods for better compatibility - st.markdown( - f""" - - - """, - unsafe_allow_html=True, - ) - - # Also provide a direct link as backup - st.markdown( - f""" - - """, - unsafe_allow_html=True, - ) - - with st.spinner("Redirecting to Google..."): - time.sleep(2) - else: - st.error("❌ Authentication service unavailable. Please check your Google OAuth configuration.") - - st.markdown("---") + if demo_mode and auth_url: + st.info("🔗 Demo mode: use the OAuth link below if you are not redirected automatically.") + st.code(auth_url) + st.markdown(f"[Open Google OAuth in this tab]({auth_url})") + elif not auth_url: + st.error("❌ Authentication service unavailable. Please check your Google OAuth configuration.") # Info section with st.expander("ℹ️ About This Application", expanded=False): st.markdown( """ - **Single Family Loan Analytics Platform** provides comprehensive loan data analysis: - - **Core Features:** - - 🤖 **AI-Powered Queries**: Natural language to SQL conversion - - 📊 **Interactive Analytics**: Dynamic data exploration and visualization - - ⚡ **High Performance**: Optimized query engine with DuckDB - - 🔒 **Secure Access**: Protected with Google OAuth - - 📈 **Portfolio Insights**: Risk metrics and performance tracking - - **Data Infrastructure:** - - Single Family Loan performance data (56.8M+ loans) - - Real-time data sync from Cloudflare R2 storage - - Comprehensive data dictionary with domain expertise + **converSQL** pairs ontological intelligence with natural language interfaces to deliver: + + - 🤖 **AI-Guided SQL** – Structured prompts that bake in mortgage risk heuristics. + - 🧠 **Ontology-Aware Context** – 15 business domains and 110+ field definitions on tap. + - ⚡ **Streamlined Execution** – DuckDB acceleration, cached schema, and curated prompts. + - 🔒 **Enterprise Guardrails** – OAuth sign-in, optional audit logging, and provider failovers. """ ) - # Footer + # Footer matching main app styling + ai_status = get_ai_service_status() + if ai_status.get("available"): + provider = ai_status.get("active_provider", "ai_assistant") + if provider == "claude": + ai_provider_text = "Claude API (Anthropic)" + elif provider == "bedrock": + ai_provider_text = "Amazon Bedrock" + else: + ai_provider_text = provider.replace("_", " ").title() + else: + ai_provider_text = "Manual Analysis Mode" + st.markdown("---") st.markdown( - "
    " - "Powered by Streamlit and Google OAuth" - "
    ", + f""" +
    +
    + 💬 converSQL - Natural Language to SQL Query Generation Platform +
    +
    + Powered by StreamlitDuckDB{ai_provider_text}Ontological Data Intelligence
    + + Implementation Showcase: Single Family Loan Analytics + +
    + +
    + """, unsafe_allow_html=True, ) @@ -139,16 +285,6 @@ def render_user_menu(): with st.sidebar: st.markdown("---") - # User profile section - st.markdown( - """ -
    -

    👤 Profile

    -
    - """, - unsafe_allow_html=True, - ) - # User info st.markdown( f"
    👤 {user.get('name', 'User')}
    ", @@ -194,7 +330,8 @@ def wrapper(): return # User is authenticated - show the main app + result = main_app_function() render_user_menu() - return main_app_function() + return result return wrapper