Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 24 additions & 2 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,50 @@
- `app.py` is the Streamlit shell: it hydrates cached data via `initialize_app_data()`, gates the UI through `simple_auth_wrapper`, and delegates all heavy work to `src/core.py` and `src/ai_service.py`.
- `src/core.py` owns DuckDB execution and orchestrates data prep. `scan_parquet_files()` will run `scripts/sync_data.py` if `data/processed/*.parquet` are missing, so keep a local Parquet copy handy during tests to avoid network pulls.
- `src/ai_service.py` routes natural-language prompts into adapter implementations in `src/ai_engines/`. The prompt embeds the mortgage risk heuristics baked into `src/data_dictionary.py`; reuse `AIService._build_sql_prompt()` instead of crafting ad-hoc prompts.
- `src/visualization.py` handles Altair chart generation with support for Bar, Line, Scatter, Histogram, and Heatmap charts. Use `make_chart()` for consistent visualization output.
- `src/ui/` contains modular UI components: `tabs.py` for main interface tabs, `components.py` for reusable widgets, `sidebar.py` for navigation, and `style.py` for theming.
- `src/services/` contains service layer abstractions: `ai_service.py` and `data_service.py` for business logic separation.

## Data + ontology expectations
- Loan metadata lives in `data/processed/data.parquet`; schema text comes from `generate_enhanced_schema_context()` which stitches DuckDB types with ontology metadata from `src/data_dictionary.py` and `docs/DATA_DICTIONARY.md`.
- When adding derived features, update both the Parquet schema and the ontology entry so AI output and the Ontology Explorer tab stay in sync.
- The Streamlit Ontology tab imports `LOAN_ONTOLOGY` and `PORTFOLIO_CONTEXT`; breaking their shape (dict → FieldMetadata) will crash the UI.

## Visualization layer
- `src/visualization.py` provides chart generation using Altair with automatic type detection and error handling.
- Supported chart types: Bar, Line, Scatter, Histogram, Heatmap (defined in `ALLOWED_CHART_TYPES`).
- Charts auto-resolve data sources in priority order: explicit DataFrame → manual query results → AI query results.
- Use `_validate_chart_params()` for parameter validation before calling `make_chart()`.
- The visualization system handles type coercion, sorting, and provides fallbacks for missing data.

## AI engine adapters
- Adapters must subclass `AIEngineAdapter` in `src/ai_engines/base.py`, expose `provider_id`, `name`, `is_available()`, and `generate_sql()`, then be exported via `src/ai_engines/__init__.py` and registered inside `AIService.adapters`.
- Use `clean_sql_response()` to strip markdown fences, and return `(sql, "")` on success; downstream callers treat any non-empty error string as failure.
- Keep `AI_PROVIDER` fallbacks working—tests rely on `AIService` surviving with zero credentials, so default to "unavailable" rather than raising.
- Rate limiting is handled automatically via the base adapter class with configurable `AI_MAX_REQUESTS_PER_MINUTE`.

## Developer workflows
- Install deps with `pip install -r requirements.txt`; prefer `make setup` for a clean environment (installs + cleanup).
- Fast test cycle: `make test-unit` skips integration markers; `make test` mirrors CI (pytest + coverage). Integration adapters are ignored by default via `pytest.ini`; remove the `--ignore` flags there if you really need live API coverage.
- Lint/format stack is Black 120 cols + isort + flake8 + mypy. `make ci` runs the whole suite and matches the GitHub Actions workflow.
- Lint/format stack is Black 120 cols + isort + flake8 + mypy. `make ci` runs the whole suite and matches the GitHub Actions workflow.
- Local environment: Use `conda activate gcp-pipeline`. Run `make format`, `make ci`, and `make dev` for local testing.

## Environment & secrets
- Copy `.env.example` to `.env`, then set one provider block (`CLAUDE_API_KEY`, `AWS_*`, or `GEMINI_API_KEY`). Without credentials the UI drops to AI unavailable but manual SQL still works.
- Copy `.env.example` to `.env`, then set one provider block (`CLAUDE_API_KEY`, `AWS_*`, or `GEMINI_API_KEY`). Without credentials the UI drops to "AI unavailable" but manual SQL still works.
- Data sync needs Cloudflare R2 keys (`R2_ACCESS_KEY_ID`, `R2_SECRET_ACCESS_KEY`, `R2_ENDPOINT_URL`). In offline dev, set `FORCE_DATA_REFRESH=false` and place Parquet files under `data/processed/`.
- Authentication defaults to Google OAuth (`ENABLE_AUTH=true`); set it to `false` for local hacking or provide `GOOGLE_CLIENT_ID/SECRET` plus HTTPS when deploying.

## Practical tips
- Clear Streamlit caches with `streamlit cache clear` if schema or ontology changes; otherwise stale `@st.cache_data` results linger.
- When writing new ingest code, mirror the type-casting helpers in `notebooks/pipeline_csv_to_parquet*.ipynb` so DuckDB types stay compatible.
- Logging to Cloudflare D1 is optional—`src/d1_logger.py` silently no-ops without `CLOUDFLARE_*` secrets, so you can call it safely even in tests.
- For visualization work, test with different data types and edge cases—the chart system includes extensive error handling and type coercion.

## Linting & code quality
- Follow PEP 8: always add spaces after commas in function calls, lists, and tuples
- Remove unused imports to avoid F401 errors; use `isort` and check with `make format`
- For type hints, ensure all arguments match expected types; cast with `str()` or provide defaults when needed
- Module-level imports must be at the top of files before any other code to avoid E402 errors
- Use `make lint` frequently during development to catch issues early
- Target Python 3.11+ features: use built-in types (`list[T]`) instead of `typing.List[T]`
- Altair charts should handle large datasets—`alt.data_transformers.disable_max_rows()` is called in visualization module
9 changes: 9 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,15 @@ black src/ app.py
flake8 src/ app.py --max-line-length=100
```

### Package layout (v2)

The repository is introducing a modular core under `conversql/` while keeping `src/` as legacy during migration.

- `conversql/`: AI, data catalog, ontology, exec engines
- `src/`: existing modules used by the app and tests

For new code, prefer `conversql.*` imports. See `docs/ARCHITECTURE_V2.md` and `docs/MIGRATION.md`.

### Front-end Styling

When updating Streamlit UI components:
Expand Down
8 changes: 7 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ help:
@echo " check-deps Check for dependency updates"
@echo " setup Complete setup for new development environment"
@echo " ci Run full CI checks (format, lint, test)"
@echo " clean-unused Remove deprecated/unused files safely (dry run; add APPLY=1 to delete)"
@echo ""
@echo "Usage: make <command>"

Expand Down Expand Up @@ -169,4 +170,9 @@ setup: install-dev clean
ci: clean format-check lint test-cov
@echo "✅ All CI checks passed!"
@echo ""
@echo "Ready to commit and push!"
@echo "Ready to commit and push!"

# Remove deprecated/unused files safely
clean-unused:
@echo "🧹 Scanning for unused legacy files..."
@bash scripts/cleanup_unused_files.sh $(if $(APPLY),--apply,)
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
## New Documentation

We have introduced new documentation to help you understand the modular architecture and migration path.

Please refer to the following documents:
- [Architecture v2: Modular Layout](docs/ARCHITECTURE_V2.md)
- [Migration Guide](docs/MIGRATION.md)

<p align="center">
<img src="assets/conversql_logo.svg" alt="converSQL logo" width="360" />
Expand Down
Loading