-
Notifications
You must be signed in to change notification settings - Fork 0
Vibhav dev #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Vibhav dev #10
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request introduces major enhancements to TableScanner, transitioning it from a basic SQLite query service to a production-ready data access platform with advanced query capabilities, robust configuration management, and improved architecture.
Changes:
- Adds comprehensive query service with type-aware filtering, aggregations, FTS5 search, and connection pooling
- Implements new services for schema analysis, statistics, validation, and type inference
- Enhances API models to support advanced filtering and aggregation operations
- Improves security with input validation, path sanitization, and custom exception handling
- Removes deprecated code and dependencies (pandas, PyYAML, itables, ipywidgets)
Reviewed changes
Copilot reviewed 42 out of 44 changed files in this pull request and generated 25 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_query_service.py | Unit tests for QueryService covering filtering, sorting, aggregation, and SQL injection prevention |
| tests/integration/test_security_fixes.py | Integration tests for security fixes: variable limit enforcement, numeric validation, FTS5 safety |
| tests/integration/test_security.py | Security tests for path traversal prevention and CORS configuration |
| tests/integration/test_routes_advanced.py | Integration tests for advanced filtering, aggregations, sorting, pagination, and legacy compatibility |
| tests/integration/test_routes.py | Basic integration tests for health check, API docs, and deprecated endpoint removal |
| tests/integration/test_concurrency.py | Concurrency tests for connection pool under load |
| app/services/data/query_service.py | Core query service with type-aware filtering, aggregations, FTS5 search, and caching |
| app/services/data/connection_pool.py | Thread-safe SQLite connection pooling with lifecycle management |
| app/services/data/validation.py | JSON schema validation for DataTables Viewer configurations |
| app/services/data/type_inference.py | Rule-based type inference engine for column data types |
| app/services/data/statistics_service.py | Column statistics computation service |
| app/services/data/schema_service.py | Schema information retrieval service |
| app/services/data/schema_analyzer.py | Database schema introspection and profiling |
| app/services/data/fingerprint.py | Database fingerprinting for cache invalidation |
| app/utils/workspace.py | Enhanced workspace client with type retrieval and improved fallback logic |
| app/utils/sqlite.py | Simplified SQLite utilities (removed deprecated functions) |
| app/utils/cache.py | Improved ID sanitization with strict allow-list approach |
| app/utils/async_utils.py | Async utilities for running synchronous code in threads |
| app/utils/request_utils.py | Request processing utilities for routes |
| app/services/db_helper.py | Database helper service consolidating retrieval and validation logic |
| app/models.py | Enhanced models supporting advanced queries, filters, aggregations, and metadata |
| app/main.py | Updated application with custom exception handlers and improved CORS configuration |
| app/exceptions.py | Custom exceptions for better error handling |
| app/config_constants.py | Centralized configuration constants |
| static/viewer.html | Complete UI redesign with scientific light theme and enhanced features |
| pyproject.toml | Removed unnecessary dependencies (pandas, PyYAML, itables, ipywidgets) |
| docs/* | Comprehensive documentation updates |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
app/services/data/connection_pool.py
Outdated
| except: | ||
| pass |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| pass | |
| except Exception as e: | |
| logger.warning("Failed to close connection from pool: %s", e) |
app/services/data/schema_analyzer.py
Outdated
| profile.min_value = result[0] | ||
| profile.max_value = result[1] | ||
| profile.avg_length = result[2] or 0.0 | ||
| except sqlite3.Error: |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 42 out of 44 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 44 out of 47 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # We'll try to use a handle that looks like traversal | ||
| bad_handle = "local:../../../../etc/passwd" | ||
|
|
||
| response = client.get(f"/object/{bad_handle}/tables", headers=headers) |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment to 'response' is unnecessary as it is redefined before this value is used.
| raise ValueError(f"Workspace API error: [{error_code}] {error_msg}") | ||
| except requests.exceptions.HTTPError as e: | ||
| # Capture response body for better error messages | ||
| error_detail = f"HTTP {e.response.status_code}" |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment to 'error_detail' is unnecessary as it is redefined before this value is used.
This assignment to 'error_detail' is unnecessary as it is redefined before this value is used.
This assignment to 'error_detail' is unnecessary as it is redefined before this value is used.
| error_detail = f"HTTP {e.response.status_code}" |
app/main.py
Outdated
| from fastapi.responses import JSONResponse | ||
| from fastapi.staticfiles import StaticFiles | ||
| from fastapi.middleware.cors import CORSMiddleware | ||
| from fastapi.security import HTTPBearer, APIKeyCookie |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'HTTPBearer' is not used.
Import of 'APIKeyCookie' is not used.
| from fastapi.security import HTTPBearer, APIKeyCookie |
| @@ -0,0 +1,154 @@ | |||
| import os | |||
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
app/utils/workspace.py
Outdated
| except: | ||
| error_detail = e.response.text[:200] if e.response.text else str(e) |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except block directly handles BaseException.
| except: | |
| error_detail = e.response.text[:200] if e.response.text else str(e) | |
| except (ValueError, TypeError, AttributeError): | |
| error_detail = e.response.text[:200] if e.response and e.response.text else str(e) |
app/utils/sqlite.py
Outdated
| if str_stats[0] is not None: | ||
| col_stats["min"] = str_stats[0] | ||
| col_stats["max"] = str_stats[1] | ||
| except Exception: |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| except Exception: | |
| # Ignore errors in non-numeric aggregate stats (best-effort only) |
| RUN mkdir -p lib && \ | ||
| cd lib && \ | ||
| git clone https://github.com/cshenry/KBUtilLib.git && \ |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Docker build clones KBUtilLib directly from GitHub using git clone https://github.com/cshenry/KBUtilLib.git without pinning to a specific commit, tag, or verifying integrity. This means every build will execute whatever code is currently on the default branch of an external repository, so a compromise of that repo or the GitHub supply chain could inject arbitrary code into your image and run it with access to any secrets or data inside the container. To mitigate this supply-chain risk, fetch KBUtilLib via a pinned, immutable reference (e.g., specific commit or release) and/or vendor it into the repo so builds cannot be silently altered upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 44 out of 48 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
app/utils/sqlite.py
Outdated
| # Use validated table name from sqlite_master to prevent SQL injection | ||
| validated_table_name = result[0] | ||
| cursor.execute(f"PRAGMA table_info(\"{validated_table_name}\")") |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SQL injection prevention here has a critical flaw. The code retrieves a validated table name from sqlite_master, but then constructs the PRAGMA query using f-string formatting with double quotes. While the validated_table_name comes from the database, this pattern is inconsistent with the parameterized query approach used just above it.
For consistency and to prevent any potential issues, the PRAGMA query should use the same validated_table_name without f-string formatting, or use SQLite's quote() function. The safer approach is to avoid f-strings entirely for SQL construction even with validated names.
| # Use validated table name from sqlite_master to prevent SQL injection | |
| validated_table_name = result[0] | |
| cursor.execute(f"PRAGMA table_info(\"{validated_table_name}\")") | |
| # Use validated table name from sqlite_master and SQLite's quote() to prevent SQL injection | |
| validated_table_name = result[0] | |
| cursor.execute("SELECT quote(?)", (validated_table_name,)) | |
| quoted_table_name = cursor.fetchone()[0] | |
| cursor.execute("PRAGMA table_info(" + quoted_table_name + ")") |
app/services/data/connection_pool.py
Outdated
| except queue.Empty: | ||
| # Pool is empty, if we haven't reached max capacity (logic hard to track with Queue size only), | ||
| # ideally we pre-fill or dynamic fill. | ||
| # With standard Queue, we put connections IN. | ||
| # Strategy: Initialize Queue with N "tokens" or create on demand? | ||
| # Alternative: On Queue.get, if empty, we wait. | ||
| # BUT, initially queue is empty. | ||
| # So we need a mechanism to create new connections if < MAX and queue empty. | ||
| # Let's simplify: | ||
| # The queue holds *idle* connections. | ||
| # We need a semaphore for *total* connections? | ||
| # | ||
| # Let's use a standard sizing approach: | ||
| # When getting, if queue empty and we can create more, create one. | ||
| # This requires tracking count. Sizing is tricky with just a Queue. | ||
| # | ||
| # SIMPLIFIED APPROACH for SQLite: | ||
| # Just use the Queue as a resource pool. Populate it on demand? | ||
| # No, standard pattern: | ||
| # Queue initialized empty. | ||
| # If queue.empty(): | ||
| # if current connections < max: create new | ||
| # else: wait on queue | ||
| # | ||
| # This requires tracking active count. | ||
| # Given strict timeline, let's just FILL the queue on first access up to MAX? | ||
| # Or lazily create. | ||
|
|
||
| # Let's do lazy creation with a separate semaphore-like logic if needed, | ||
| # Or just rely on Python's robust GC and just use a pool of created connections. | ||
|
|
||
| # Refined Strategy: | ||
| # Queue contains available connections. | ||
| # If we get Empty, we check if we can create better? | ||
| # Actually, simpler: Pre-populate or lazily populate? | ||
| # Lazy: If invalid/closed, we discard. | ||
| # | ||
| # For this fix, let's use a "LifoQueue" or standard Queue. | ||
| # But to manage the *limit*, we need to know how many are out there. | ||
| raise TimeoutError(f"Timeout waiting for database connection: {db_path}") |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connection pool implementation has a critical design flaw. When queue.Empty is raised (line 100), the code raises a TimeoutError without creating a new connection. However, the queue starts empty and connections are only created during pool initialization (lines 177-180).
This creates a race condition: if multiple threads request connections before the pool is initialized, or if all MAX_CONNECTIONS are in use, subsequent requests will timeout immediately instead of either waiting or creating new connections up to the limit.
The comment on lines 103-139 acknowledges this issue but doesn't implement a solution. A proper implementation should track active connection count and create new connections on-demand up to MAX_CONNECTIONS, rather than pre-filling the queue.
app/main.py
Outdated
| from fastapi.staticfiles import StaticFiles | ||
| from fastapi.middleware.cors import CORSMiddleware | ||
| from fastapi.middleware.gzip import GZipMiddleware | ||
| from fastapi.security import HTTPBearer, APIKeyCookie |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'HTTPBearer' is not used.
Import of 'APIKeyCookie' is not used.
| from fastapi.security import HTTPBearer, APIKeyCookie |
app/utils/workspace.py
Outdated
| error_body = e.response.json() | ||
| if "error" in error_body: | ||
| error_detail = error_body["error"].get("message", str(error_body)) | ||
| except: |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except block directly handles BaseException.
| except: | |
| except Exception: |
| # Clone KBUtilLib (required external dependency) | ||
| # This creates /app/lib/KBUtilLib/ which is referenced by app/utils/workspace.py | ||
| RUN mkdir -p lib && \ | ||
| cd lib && \ | ||
| git clone https://github.com/cshenry/KBUtilLib.git && \ | ||
| cd .. | ||
|
|
||
| # Add KBUtilLib to PYTHONPATH so it can be imported | ||
| ENV PYTHONPATH=/app/lib/KBUtilLib/src:${PYTHONPATH} |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dockerfile clones KBUtilLib directly from https://github.com/cshenry/KBUtilLib.git at build time without pinning to a specific immutable revision or verifying integrity. This means any compromise of that GitHub repo (or unintended changes on its default branch) will inject arbitrary code into the running service with access to KBase auth tokens and other secrets, creating a serious supply-chain risk. To mitigate this, pin the dependency to a specific commit or release and, where possible, vendor or verify the code (e.g., by using a locked dependency artifact or checksum) instead of cloning an unpinned mutable branch.
- Update CORS middleware with allow_credentials, expose_headers, and max_age - Add MAX_UPLOAD_SIZE_MB configuration setting (default 500MB) - Implement async streaming upload in 1MB chunks - Add file size validation with clear 413 error messages - Improve error handling and cleanup for upload failures
This pull request introduces major enhancements and new features to TableScanner, including a production-ready configuration control plane, improved API documentation, advanced query capabilities, and better error handling. The changes also add new configuration and performance settings, expand the request/response models, and provide detailed documentation and setup instructions.
Key highlights:
.envandapp/config.py.1. Configuration Control Plane & Database Schema
app/db/schema.sql) supporting config records with full lifecycle states, audit logging, user overrides, version history, and test results, enabling robust configuration management and traceability.app/db/__init__.py) for clarity and maintainability.2. API & Documentation Enhancements
README.mdwith a comprehensive feature list, setup instructions, API usage examples (including advanced queries and aggregations), project structure, configuration, performance, and testing guidelines.app/main.pyto reflect advanced query capabilities, type-aware filtering, connection pooling, and more.3. Advanced Query Capabilities & Models
app/models.pyto support advanced filters, aggregations, and group-by queries, and allows columns to be specified as a list or comma-separated string. Also addsobject_typeto the table list response for richer metadata. [1] [2] [3] [4]app/config_constants.pyfor easier tuning and consistency.4. Error Handling & Exceptions
app/exceptions.pyfor table/column not found, invalid filters, and database access errors, improving error clarity.app/main.pyto return appropriate HTTP status codes and messages for common errors.5. Configuration & Performance Settings
.env.exampleandapp/config.pyfor AI provider selection, cache directories, timeouts, and CORS origins, enabling flexible deployment and tuning. [1] [2]These changes significantly improve TableScanner's reliability, flexibility, and usability for production environments.