Skip to content

Conversation

@VibhavSetlur
Copy link
Collaborator

This pull request introduces major enhancements and new features to TableScanner, including a production-ready configuration control plane, improved API documentation, advanced query capabilities, and better error handling. The changes also add new configuration and performance settings, expand the request/response models, and provide detailed documentation and setup instructions.

Key highlights:

  • Adds a robust configuration control plane with lifecycle management, audit logging, user overrides, and version history.
  • Expands API and documentation for advanced querying (aggregations, filters, group-by).
  • Introduces custom exceptions and error handling for more informative API responses.
  • Improves configuration flexibility and performance tuning via .env and app/config.py.
  • Refactors and documents project structure for maintainability.

1. Configuration Control Plane & Database Schema

  • Introduces a new SQLite schema (app/db/schema.sql) supporting config records with full lifecycle states, audit logging, user overrides, version history, and test results, enabling robust configuration management and traceability.
  • Documents the database module (app/db/__init__.py) for clarity and maintainability.

2. API & Documentation Enhancements

  • Overhauls README.md with a comprehensive feature list, setup instructions, API usage examples (including advanced queries and aggregations), project structure, configuration, performance, and testing guidelines.
  • Updates the FastAPI app description and features in app/main.py to reflect advanced query capabilities, type-aware filtering, connection pooling, and more.

3. Advanced Query Capabilities & Models

  • Expands request models in app/models.py to support advanced filters, aggregations, and group-by queries, and allows columns to be specified as a list or comma-separated string. Also adds object_type to the table list response for richer metadata. [1] [2] [3] [4]
  • Centralizes configuration constants (limits, cache, timeouts, API version) in app/config_constants.py for easier tuning and consistency.

4. Error Handling & Exceptions

  • Implements custom exceptions in app/exceptions.py for table/column not found, invalid filters, and database access errors, improving error clarity.
  • Registers exception handlers in app/main.py to return appropriate HTTP status codes and messages for common errors.

5. Configuration & Performance Settings

  • Adds new environment variables and settings in .env.example and app/config.py for AI provider selection, cache directories, timeouts, and CORS origins, enabling flexible deployment and tuning. [1] [2]

These changes significantly improve TableScanner's reliability, flexibility, and usability for production environments.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces major enhancements to TableScanner, transitioning it from a basic SQLite query service to a production-ready data access platform with advanced query capabilities, robust configuration management, and improved architecture.

Changes:

  • Adds comprehensive query service with type-aware filtering, aggregations, FTS5 search, and connection pooling
  • Implements new services for schema analysis, statistics, validation, and type inference
  • Enhances API models to support advanced filtering and aggregation operations
  • Improves security with input validation, path sanitization, and custom exception handling
  • Removes deprecated code and dependencies (pandas, PyYAML, itables, ipywidgets)

Reviewed changes

Copilot reviewed 42 out of 44 changed files in this pull request and generated 25 comments.

Show a summary per file
File Description
tests/unit/test_query_service.py Unit tests for QueryService covering filtering, sorting, aggregation, and SQL injection prevention
tests/integration/test_security_fixes.py Integration tests for security fixes: variable limit enforcement, numeric validation, FTS5 safety
tests/integration/test_security.py Security tests for path traversal prevention and CORS configuration
tests/integration/test_routes_advanced.py Integration tests for advanced filtering, aggregations, sorting, pagination, and legacy compatibility
tests/integration/test_routes.py Basic integration tests for health check, API docs, and deprecated endpoint removal
tests/integration/test_concurrency.py Concurrency tests for connection pool under load
app/services/data/query_service.py Core query service with type-aware filtering, aggregations, FTS5 search, and caching
app/services/data/connection_pool.py Thread-safe SQLite connection pooling with lifecycle management
app/services/data/validation.py JSON schema validation for DataTables Viewer configurations
app/services/data/type_inference.py Rule-based type inference engine for column data types
app/services/data/statistics_service.py Column statistics computation service
app/services/data/schema_service.py Schema information retrieval service
app/services/data/schema_analyzer.py Database schema introspection and profiling
app/services/data/fingerprint.py Database fingerprinting for cache invalidation
app/utils/workspace.py Enhanced workspace client with type retrieval and improved fallback logic
app/utils/sqlite.py Simplified SQLite utilities (removed deprecated functions)
app/utils/cache.py Improved ID sanitization with strict allow-list approach
app/utils/async_utils.py Async utilities for running synchronous code in threads
app/utils/request_utils.py Request processing utilities for routes
app/services/db_helper.py Database helper service consolidating retrieval and validation logic
app/models.py Enhanced models supporting advanced queries, filters, aggregations, and metadata
app/main.py Updated application with custom exception handlers and improved CORS configuration
app/exceptions.py Custom exceptions for better error handling
app/config_constants.py Centralized configuration constants
static/viewer.html Complete UI redesign with scientific light theme and enhanced features
pyproject.toml Removed unnecessary dependencies (pandas, PyYAML, itables, ipywidgets)
docs/* Comprehensive documentation updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 250 to 251
except:
pass
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
pass
except Exception as e:
logger.warning("Failed to close connection from pool: %s", e)

Copilot uses AI. Check for mistakes.
profile.min_value = result[0]
profile.max_value = result[1]
profile.avg_length = result[2] or 0.0
except sqlite3.Error:
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 42 out of 44 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 47 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# We'll try to use a handle that looks like traversal
bad_handle = "local:../../../../etc/passwd"

response = client.get(f"/object/{bad_handle}/tables", headers=headers)
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'response' is unnecessary as it is redefined before this value is used.

Copilot uses AI. Check for mistakes.
raise ValueError(f"Workspace API error: [{error_code}] {error_msg}")
except requests.exceptions.HTTPError as e:
# Capture response body for better error messages
error_detail = f"HTTP {e.response.status_code}"
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'error_detail' is unnecessary as it is redefined before this value is used.
This assignment to 'error_detail' is unnecessary as it is redefined before this value is used.
This assignment to 'error_detail' is unnecessary as it is redefined before this value is used.

Suggested change
error_detail = f"HTTP {e.response.status_code}"

Copilot uses AI. Check for mistakes.
app/main.py Outdated
from fastapi.responses import JSONResponse
from fastapi.staticfiles import StaticFiles
from fastapi.middleware.cors import CORSMiddleware
from fastapi.security import HTTPBearer, APIKeyCookie
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'HTTPBearer' is not used.
Import of 'APIKeyCookie' is not used.

Suggested change
from fastapi.security import HTTPBearer, APIKeyCookie

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,154 @@
import os
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Copilot uses AI. Check for mistakes.
Comment on lines 304 to 305
except:
error_detail = e.response.text[:200] if e.response.text else str(e)
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except block directly handles BaseException.

Suggested change
except:
error_detail = e.response.text[:200] if e.response.text else str(e)
except (ValueError, TypeError, AttributeError):
error_detail = e.response.text[:200] if e.response and e.response.text else str(e)

Copilot uses AI. Check for mistakes.
if str_stats[0] is not None:
col_stats["min"] = str_stats[0]
col_stats["max"] = str_stats[1]
except Exception:
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
except Exception:
# Ignore errors in non-numeric aggregate stats (best-effort only)

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +9
RUN mkdir -p lib && \
cd lib && \
git clone https://github.com/cshenry/KBUtilLib.git && \
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Docker build clones KBUtilLib directly from GitHub using git clone https://github.com/cshenry/KBUtilLib.git without pinning to a specific commit, tag, or verifying integrity. This means every build will execute whatever code is currently on the default branch of an external repository, so a compromise of that repo or the GitHub supply chain could inject arbitrary code into your image and run it with access to any secrets or data inside the container. To mitigate this supply-chain risk, fetch KBUtilLib via a pinned, immutable reference (e.g., specific commit or release) and/or vendor it into the repo so builds cannot be silently altered upstream.

Copilot uses AI. Check for mistakes.
@VibhavSetlur VibhavSetlur requested a review from Copilot January 26, 2026 21:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 48 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 61 to 63
# Use validated table name from sqlite_master to prevent SQL injection
validated_table_name = result[0]
cursor.execute(f"PRAGMA table_info(\"{validated_table_name}\")")
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL injection prevention here has a critical flaw. The code retrieves a validated table name from sqlite_master, but then constructs the PRAGMA query using f-string formatting with double quotes. While the validated_table_name comes from the database, this pattern is inconsistent with the parameterized query approach used just above it.

For consistency and to prevent any potential issues, the PRAGMA query should use the same validated_table_name without f-string formatting, or use SQLite's quote() function. The safer approach is to avoid f-strings entirely for SQL construction even with validated names.

Suggested change
# Use validated table name from sqlite_master to prevent SQL injection
validated_table_name = result[0]
cursor.execute(f"PRAGMA table_info(\"{validated_table_name}\")")
# Use validated table name from sqlite_master and SQLite's quote() to prevent SQL injection
validated_table_name = result[0]
cursor.execute("SELECT quote(?)", (validated_table_name,))
quoted_table_name = cursor.fetchone()[0]
cursor.execute("PRAGMA table_info(" + quoted_table_name + ")")

Copilot uses AI. Check for mistakes.
Comment on lines 100 to 139
except queue.Empty:
# Pool is empty, if we haven't reached max capacity (logic hard to track with Queue size only),
# ideally we pre-fill or dynamic fill.
# With standard Queue, we put connections IN.
# Strategy: Initialize Queue with N "tokens" or create on demand?
# Alternative: On Queue.get, if empty, we wait.
# BUT, initially queue is empty.
# So we need a mechanism to create new connections if < MAX and queue empty.
# Let's simplify:
# The queue holds *idle* connections.
# We need a semaphore for *total* connections?
#
# Let's use a standard sizing approach:
# When getting, if queue empty and we can create more, create one.
# This requires tracking count. Sizing is tricky with just a Queue.
#
# SIMPLIFIED APPROACH for SQLite:
# Just use the Queue as a resource pool. Populate it on demand?
# No, standard pattern:
# Queue initialized empty.
# If queue.empty():
# if current connections < max: create new
# else: wait on queue
#
# This requires tracking active count.
# Given strict timeline, let's just FILL the queue on first access up to MAX?
# Or lazily create.

# Let's do lazy creation with a separate semaphore-like logic if needed,
# Or just rely on Python's robust GC and just use a pool of created connections.

# Refined Strategy:
# Queue contains available connections.
# If we get Empty, we check if we can create better?
# Actually, simpler: Pre-populate or lazily populate?
# Lazy: If invalid/closed, we discard.
#
# For this fix, let's use a "LifoQueue" or standard Queue.
# But to manage the *limit*, we need to know how many are out there.
raise TimeoutError(f"Timeout waiting for database connection: {db_path}")
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection pool implementation has a critical design flaw. When queue.Empty is raised (line 100), the code raises a TimeoutError without creating a new connection. However, the queue starts empty and connections are only created during pool initialization (lines 177-180).

This creates a race condition: if multiple threads request connections before the pool is initialized, or if all MAX_CONNECTIONS are in use, subsequent requests will timeout immediately instead of either waiting or creating new connections up to the limit.

The comment on lines 103-139 acknowledges this issue but doesn't implement a solution. A proper implementation should track active connection count and create new connections on-demand up to MAX_CONNECTIONS, rather than pre-filling the queue.

Copilot uses AI. Check for mistakes.
app/main.py Outdated
from fastapi.staticfiles import StaticFiles
from fastapi.middleware.cors import CORSMiddleware
from fastapi.middleware.gzip import GZipMiddleware
from fastapi.security import HTTPBearer, APIKeyCookie
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'HTTPBearer' is not used.
Import of 'APIKeyCookie' is not used.

Suggested change
from fastapi.security import HTTPBearer, APIKeyCookie

Copilot uses AI. Check for mistakes.
error_body = e.response.json()
if "error" in error_body:
error_detail = error_body["error"].get("message", str(error_body))
except:
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except block directly handles BaseException.

Suggested change
except:
except Exception:

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +13
# Clone KBUtilLib (required external dependency)
# This creates /app/lib/KBUtilLib/ which is referenced by app/utils/workspace.py
RUN mkdir -p lib && \
cd lib && \
git clone https://github.com/cshenry/KBUtilLib.git && \
cd ..

# Add KBUtilLib to PYTHONPATH so it can be imported
ENV PYTHONPATH=/app/lib/KBUtilLib/src:${PYTHONPATH}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile clones KBUtilLib directly from https://github.com/cshenry/KBUtilLib.git at build time without pinning to a specific immutable revision or verifying integrity. This means any compromise of that GitHub repo (or unintended changes on its default branch) will inject arbitrary code into the running service with access to KBase auth tokens and other secrets, creating a serious supply-chain risk. To mitigate this, pin the dependency to a specific commit or release and, where possible, vendor or verify the code (e.g., by using a locked dependency artifact or checksum) instead of cloning an unpinned mutable branch.

Copilot uses AI. Check for mistakes.
- Update CORS middleware with allow_credentials, expose_headers, and max_age
- Add MAX_UPLOAD_SIZE_MB configuration setting (default 500MB)
- Implement async streaming upload in 1MB chunks
- Add file size validation with clear 413 error messages
- Improve error handling and cleanup for upload failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants