-
Notifications
You must be signed in to change notification settings - Fork 0
Add data parser service #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
+2,103
−380
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…volumes (phase 8)
…nnect retry/backoff
- Extract _attempt_connect() helper for cleaner DB retry logic - Replace manual DataFrame iteration with to_numpy() in write methods - Extract _safe_move() helper to eliminate duplicated move-fallback logic - Remove unused is_valid_file() method from FileManager - Inline Config._env_bool for better readability - Simplify exception handling in process_file() - Remove unnecessary column reordering in CSV parser - Move time import to module top for correctness
…configuration details
…uration instructions
…-timeout not installed
…ling of missing cml_id values
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #14 +/- ##
===========================================
+ Coverage 54.23% 64.64% +10.40%
===========================================
Files 5 19 +14
Lines 874 1547 +673
===========================================
+ Hits 474 1000 +526
- Misses 400 547 +147
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…d fix tests - Moved demo_csv_data to new parsers/ directory for clarity and modularity - Updated example_metadata.csv to match new MNO data generator format (with sublink_id, frequency, etc.) - Fixed test imports and test data to match new structure and requirements
- Remove ABC-based parser classes (BaseParser, CSVRawDataParser, CSVMetadataParser) and ParserRegistry in favor of simpler function-based parsers - Consolidate demo_csv_data parsers into parser/parsers/demo_csv_data/ with parse_raw.py and parse_metadata.py - Remove obsolete tests for class-based parsers and registry - Update test_demo_csv_data.py imports to match new structure - Maintain 100% test coverage for demo_csv_data parsers
…into service_logic module
- Change cml_metadata primary key from cml_id to (cml_id, sublink_id) - Add frequency and polarization columns to preserve sublink-specific data - Update MNO simulator to generate 728 metadata rows (2 per CML) without deduplication - Update parser db_writer to handle composite key validation and inserts - Fix parser Dockerfile to use proper Python package structure with relative imports - Update all tests to validate composite key schema - Add comprehensive integration test documentation This change ensures sublink-specific metadata (frequency, polarization) is preserved instead of being lost during deduplication, as each CML has two sublinks with different transmission characteristics.
The MNO simulator runs on a 30-second cycle to generate and upload data. Tests were failing in CI because they immediately checked the database without waiting for data to be generated, uploaded, and processed by the parser. Added 90-second wait loops to: - test_mno_simulator_uploading_files - test_sftp_to_parser_pipeline These tests now poll the database every 5 seconds for up to 90 seconds, giving the full pipeline time to: 1. MNO simulator generate data (first cycle at ~30s) 2. Upload files via SFTP 3. Parser process files and write to database
…idate parser docs - Update parser/README.md with complete CSV format examples and database schema - Show all 9 metadata columns including length - Document composite primary key (cml_id, sublink_id) - Add practical file format examples - Update tests/integration/README.md - Clarify composite key validation in all tests - Update SQL examples to check (cml_id, sublink_id) pairs - Note expected 728 metadata rows (2 sublinks per CML) - Update parser/service_logic.py warning message for composite keys - Remove parser/IMPLEMENTATION_PLAN.md (consolidated into README.md)
- Wait 40s for MNO simulator first generation cycle before running tests - Check SFTP and parser directories before tests - Add detailed directory listings and service status to failure logs - Show archived and quarantine directories to debug parser behavior
Parser was using relative paths (data/incoming) instead of absolute paths (/app/data/incoming), which could cause it to watch the wrong directory. Added explicit environment variables: - PARSER_INCOMING_DIR=/app/data/incoming - PARSER_ARCHIVED_DIR=/app/data/archived - PARSER_QUARANTINE_DIR=/app/data/quarantine
- Increase test_parser_writes_to_database wait from 45s to 90s to match other tests - Add progress logging every 15s to help diagnose CI issues - Add run_e2e_test_locally.sh script to reproduce CI environment locally - Replicates exact CI workflow steps - Shows service logs and diagnostics - Uses macOS-compatible wait loops (no timeout command) The 45s timeout was too short - MNO simulator uploads metadata immediately, then data files at 30s intervals. Tests need 90s to reliably catch data."
- Use DISTINCT/DISTINCT ON in webserver and visualization queries to avoid duplicate CMLs - Update database health checks to use correct username (myuser instead of postgres)
…nsert logic - Add _ensure_connected() to check connection before operations - Add _with_connection_retry() wrapper to handle OperationalError/InterfaceError - Extract common batch insert pattern into _execute_batch_insert() method - Eliminates ~60 lines of duplicate code between write_metadata() and write_rawdata() - Prevents cascade failures when database connection is lost - Update test to provide complete DataFrame for connection testing
…ed files The test was incorrectly checking for CSV files in webserver's archived directory, but parser and webserver use different volume mounts. The webserver reads from the database, not directly from archived CSV files. Updated test to verify webserver can access the database, which is the actual data source for the webserver.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #13
Summary: Add Parser Service with Composite Key Schema and Automatic Recovery
Core Features
Parser Service (~1200 LOC): Auto-processes CML CSV files uploaded via SFTP, writes to TimescaleDB with robust error handling
/app/data/incoming/forcml_data_*.csvandcml_metadata_*.csvfilesarchived/YYYY-MM-DD/, quarantines failures with.error.txtnotesDatabase Schema
Composite Primary Key:
cml_metadata(cml_id, sublink_id)to support sublink-specific metadatalengthcolumn tocml_metadatafor link length trackingcml_datauses TimescaleDB hypertable ontimecolumnArchitecture
Modular design with separation of concerns:
db_writer.py(262 LOC): Batch inserts with connection retry (exponential backoff), automatic reconnection onOperationalError/InterfaceErrorfile_manager.py(97 LOC): Safe file operations with cross-device fallback (move → copy)file_watcher.py(86 LOC): Filesystem monitoring via watchdogservice_logic.py(68 LOC): Core CML processing logicparsers/demo_csv_data/: CSV parsers for raw data and metadataTesting & CI
scripts/run_e2e_test_locally.shKey Technical Improvements
_with_connection_retry()wrapper catches connection errors, reconnects, and retries once;_execute_batch_insert()consolidates duplicate write logicPARSER_INCOMING_DIR,PARSER_ARCHIVED_DIR,DATABASE_URL, etc.)lengthfield; configurable viaconfig.ymlDocumentation
Files changed: 37 files, +2103/-380 lines