Skip to content

Conversation

@cchwala
Copy link
Member

@cchwala cchwala commented Jan 22, 2026

Closes #13

Summary: Add Parser Service with Composite Key Schema and Automatic Recovery

Core Features

Parser Service (~1200 LOC): Auto-processes CML CSV files uploaded via SFTP, writes to TimescaleDB with robust error handling

  • Watches /app/data/incoming/ for cml_data_*.csv and cml_metadata_*.csv files
  • Archives successful files to archived/YYYY-MM-DD/, quarantines failures with .error.txt notes
  • Ingests raw data even when metadata is missing (logs warnings for orphaned records)
  • Automatic database reconnection: Recovers from connection loss with retry logic, preventing cascade failures
  • Plugin-style parser registry for extensibility

Database Schema

Composite Primary Key: cml_metadata(cml_id, sublink_id) to support sublink-specific metadata

  • Added length column to cml_metadata for link length tracking
  • Updated all queries (visualization, webserver, health checks) to handle composite keys
  • cml_data uses TimescaleDB hypertable on time column

Architecture

Modular design with separation of concerns:

  • db_writer.py (262 LOC): Batch inserts with connection retry (exponential backoff), automatic reconnection on OperationalError/InterfaceError
  • file_manager.py (97 LOC): Safe file operations with cross-device fallback (move → copy)
  • file_watcher.py (86 LOC): Filesystem monitoring via watchdog
  • service_logic.py (68 LOC): Core CML processing logic
  • parsers/demo_csv_data/: CSV parsers for raw data and metadata

Testing & CI

  • 27 parser unit tests with 100% coverage of core functionality
  • 7 E2E integration tests covering full MNO → SFTP → Parser → Database pipeline
  • CI workflows: parser unit tests + E2E integration tests
  • Local test script: scripts/run_e2e_test_locally.sh

Key Technical Improvements

  1. Database resilience: _with_connection_retry() wrapper catches connection errors, reconnects, and retries once; _execute_batch_insert() consolidates duplicate write logic
  2. Configuration: Environment-driven (PARSER_INCOMING_DIR, PARSER_ARCHIVED_DIR, DATABASE_URL, etc.)
  3. Cross-device compatibility: File manager uses shutil for atomic moves with copy fallback
  4. MNO simulator updates: Generates sublink-specific metadata with length field; configurable via config.yml

Documentation

  • Comprehensive parser README with architecture, configuration, and usage examples
  • Updated integration test documentation with troubleshooting guide
  • Consolidated composite key schema documentation across all components

Files changed: 37 files, +2103/-380 lines

- Extract _attempt_connect() helper for cleaner DB retry logic
- Replace manual DataFrame iteration with to_numpy() in write methods
- Extract _safe_move() helper to eliminate duplicated move-fallback logic
- Remove unused is_valid_file() method from FileManager
- Inline Config._env_bool for better readability
- Simplify exception handling in process_file()
- Remove unnecessary column reordering in CSV parser
- Move time import to module top for correctness
@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 78.50877% with 147 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.64%. Comparing base (76a28b7) to head (46524be).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
parser/main.py 0.00% 59 Missing ⚠️
parser/service_logic.py 0.00% 41 Missing ⚠️
parser/db_writer.py 83.76% 19 Missing ⚠️
parser/file_watcher.py 85.71% 9 Missing ⚠️
parser/validate_dataframe.py 74.07% 7 Missing ⚠️
mno_data_source_simulator/main.py 76.19% 5 Missing ⚠️
parser/file_manager.py 93.75% 4 Missing ⚠️
parser/tests/test_file_watcher.py 93.93% 2 Missing ⚠️
webserver/main.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #14       +/-   ##
===========================================
+ Coverage   54.23%   64.64%   +10.40%     
===========================================
  Files           5       19       +14     
  Lines         874     1547      +673     
===========================================
+ Hits          474     1000      +526     
- Misses        400      547      +147     
Flag Coverage Δ
mno_simulator 85.82% <82.14%> (-0.69%) ⬇️
parser 78.35% <78.47%> (?)
webserver 29.63% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…d fix tests

- Moved demo_csv_data to new parsers/ directory for clarity and modularity
- Updated example_metadata.csv to match new MNO data generator format (with sublink_id, frequency, etc.)
- Fixed test imports and test data to match new structure and requirements
- Remove ABC-based parser classes (BaseParser, CSVRawDataParser,
  CSVMetadataParser) and ParserRegistry in favor of simpler
  function-based parsers
- Consolidate demo_csv_data parsers into parser/parsers/demo_csv_data/
  with parse_raw.py and parse_metadata.py
- Remove obsolete tests for class-based parsers and registry
- Update test_demo_csv_data.py imports to match new structure
- Maintain 100% test coverage for demo_csv_data parsers
- Change cml_metadata primary key from cml_id to (cml_id, sublink_id)
- Add frequency and polarization columns to preserve sublink-specific data
- Update MNO simulator to generate 728 metadata rows (2 per CML) without deduplication
- Update parser db_writer to handle composite key validation and inserts
- Fix parser Dockerfile to use proper Python package structure with relative imports
- Update all tests to validate composite key schema
- Add comprehensive integration test documentation

This change ensures sublink-specific metadata (frequency, polarization) is preserved
instead of being lost during deduplication, as each CML has two sublinks with different
transmission characteristics.
The MNO simulator runs on a 30-second cycle to generate and upload data.
Tests were failing in CI because they immediately checked the database
without waiting for data to be generated, uploaded, and processed by the parser.

Added 90-second wait loops to:
- test_mno_simulator_uploading_files
- test_sftp_to_parser_pipeline

These tests now poll the database every 5 seconds for up to 90 seconds,
giving the full pipeline time to:
1. MNO simulator generate data (first cycle at ~30s)
2. Upload files via SFTP
3. Parser process files and write to database
…idate parser docs

- Update parser/README.md with complete CSV format examples and database schema
  - Show all 9 metadata columns including length
  - Document composite primary key (cml_id, sublink_id)
  - Add practical file format examples
- Update tests/integration/README.md
  - Clarify composite key validation in all tests
  - Update SQL examples to check (cml_id, sublink_id) pairs
  - Note expected 728 metadata rows (2 sublinks per CML)
- Update parser/service_logic.py warning message for composite keys
- Remove parser/IMPLEMENTATION_PLAN.md (consolidated into README.md)
- Wait 40s for MNO simulator first generation cycle before running tests
- Check SFTP and parser directories before tests
- Add detailed directory listings and service status to failure logs
- Show archived and quarantine directories to debug parser behavior
Parser was using relative paths (data/incoming) instead of absolute paths
(/app/data/incoming), which could cause it to watch the wrong directory.

Added explicit environment variables:
- PARSER_INCOMING_DIR=/app/data/incoming
- PARSER_ARCHIVED_DIR=/app/data/archived
- PARSER_QUARANTINE_DIR=/app/data/quarantine
- Increase test_parser_writes_to_database wait from 45s to 90s to match other tests
- Add progress logging every 15s to help diagnose CI issues
- Add run_e2e_test_locally.sh script to reproduce CI environment locally
  - Replicates exact CI workflow steps
  - Shows service logs and diagnostics
  - Uses macOS-compatible wait loops (no timeout command)

The 45s timeout was too short - MNO simulator uploads metadata immediately,
then data files at 30s intervals. Tests need 90s to reliably catch data."
- Use DISTINCT/DISTINCT ON in webserver and visualization queries to avoid duplicate CMLs
- Update database health checks to use correct username (myuser instead of postgres)
…nsert logic

- Add _ensure_connected() to check connection before operations
- Add _with_connection_retry() wrapper to handle OperationalError/InterfaceError
- Extract common batch insert pattern into _execute_batch_insert() method
- Eliminates ~60 lines of duplicate code between write_metadata() and write_rawdata()
- Prevents cascade failures when database connection is lost
- Update test to provide complete DataFrame for connection testing
…ed files

The test was incorrectly checking for CSV files in webserver's archived directory,
but parser and webserver use different volume mounts. The webserver reads from the
database, not directly from archived CSV files. Updated test to verify webserver
can access the database, which is the actual data source for the webserver.
@cchwala cchwala merged commit 2ec65bf into main Jan 23, 2026
7 checks passed
@cchwala cchwala deleted the add_data_parser_service branch January 23, 2026 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parse the data provided by the data source simulation into the DB

2 participants