MAPS (Medical Annotation Processing System) is a comprehensive Python-based application designed to parse, analyze, and export medical imaging annotation data from various medical imaging systems and file formats. Built specifically for handling complex medical imaging session data with multiple observer readings, nodule annotations, coordinate mappings, and research literature.
MAPS combines a powerful FastAPI backend with a modern React web interface, supporting XML, JSON, PDF, and ZIP files, providing real-time updates, advanced analytics, keyword extraction, and seamless Supabase integration for scalable data management.
This system was developed to address the challenges of processing heterogeneous medical annotation data formats, providing researchers and medical professionals with tools to:
- XML: Parse LIDC-IDRI and other medical imaging annotations
- JSON: Process structured annotation data
- PDF: Extract keywords from research papers and documentation
- ZIP: Batch process entire datasets with automatic extraction
- Folders: Recursive directory processing with multi-file support
- Extract observer readings, confidence scores, and nodule characteristics
- Handle multi-session observer reviews and unblinded readings
- Export data to standardized Excel templates and SQLite databases
- Import PYLIDC data directly to Supabase PostgreSQL
- Schema-agnostic parsing with automatic parse case detection
- Automatic keyword extraction from medical documents and PDFs
- Perform advanced analytics on radiologist agreement and data quality
- Process up to 1000 files per batch with real-time progress tracking
Import radiology data from PYLIDC to Supabase PostgreSQL with automatic parse case detection and keyword extraction.
- Set up Supabase: Create a project at supabase.com
- Configure: Copy
.env.exampleto.envand add your Supabase credentials - Migrate: Apply database schema:
psql "$SUPABASE_DB_URL" -f migrations/*.sql - Import: Run
python scripts/pylidc_to_supabase.py --limit 10
** Full guide**: docs/QUICKSTART_SUPABASE.md
Schema-Agnostic Design: Automatically detects XML structure patterns PYLIDC Integration: Direct import from LIDC-IDRI dataset Parse Case Tracking: Know which XML schema was used for each document Keyword Extraction: Automatic medical term extraction with categories JSONB Storage: Flexible PostgreSQL storage with GIN indexes Full-Text Search: Fast document search by keywords and content Analytics Ready: Materialized views and helper functions included
from maps.database.enhanced_document_repository import EnhancedDocumentRepository
from maps.adapters.pylidc_adapter import PyLIDCAdapter
import pylidc as pl
# Initialize repository with parse case and keyword tracking
repo = EnhancedDocumentRepository(
enable_parse_case_tracking=True,
enable_keyword_extraction=True
)
# Import PYLIDC scan
adapter = PyLIDCAdapter()
scan = pl.query(pl.Scan).first()
canonical_doc = adapter.scan_to_canonical(scan)
# Insert with automatic detection
doc, content, parse_case, keywords = repo.insert_canonical_document_enhanced(
canonical_doc,
source_file=f"pylidc://{scan.patient_id}",
detect_parse_case=True,
extract_keywords=True
)
print(f"Imported: {scan.patient_id}")
print(f"Parse case: {parse_case}")
print(f"Keywords extracted: {keywords}")** Documentation**:
- Quick Start Guide - Get started in 5 minutes
- Schema-Agnostic Guide - Complete architecture documentation
- Examples - Usage examples
Fully automatic keyword extraction, case detection, and analytics on EVERY import using database triggers!
The system automatically processes ALL imports (XML, PDF, LIDC, JSON) through a complete pipeline:
ANY IMPORT → Automatic Triggers → Keywords Extracted → Case Detected → Views Updated → Ready for Analysis
- Triggers on INSERT: Automatic keyword extraction from all segment types
- Hybrid Case Detection: Filename regex (1.0 confidence) + keyword signature (0.0-1.0)
- Confidence Thresholding: Auto-assign ≥0.8, manual review <0.8
- Cross-Type Validation: Keywords appearing in both qualitative and quantitative segments
file_summary- Per-file aggregated statisticssegment_statistics- Per-segment metrics (word count, numeric density, keywords)numeric_data_flat- Auto-extracted numeric fields from JSONBcases_with_evidence- Established cases with linked dataunresolved_segments- Orphaned data needing assignmentcase_identifier_validation- Completeness metrics with actionable recommendations
lidc_patient_summary- Patient-level consensus (9 characteristics: subtlety, malignancy, etc.)lidc_nodule_analysis- Per-nodule with per-radiologist columnslidc_patient_cases- Case-level rollup with TCIA linkslidc_3d_contours- Spatial coordinates for 3D visualizationlidc_contour_slices- Per-slice polygon datalidc_nodule_spatial_stats- Derived spatial statistics
export_universal_wide- All data types, flattenedexport_lidc_analysis_ready- SPSS/R/Stata format (one row per radiologist rating)export_lidc_with_links- Patient summary with TCIA download linksexport_radiologist_data- Inter-rater analysis formatexport_top_keywords- Top 1000 keywords by relevance
- All export views accessible to anonymous users
- LIDC medical views (de-identified data)
- Universal analysis views
- Internal processing tables restricted to authenticated users
- Curated Medical Concepts: Lung-RADS®, RadLex, LIDC-IDRI, TCIA, Radiomics, cTAKES, NER
- Categories: Standardization Systems, Diagnostic Concepts, Imaging Biomarkers, Performance Metrics
- AMA Citations: Full references to source papers and documentation
- Topic Tags: Filtering by "LIDC", "Radiomics", "NLP", "Reporting", "Biomarkers", etc.
- Bidirectional Navigation: Keyword → Files/Segments/Cases AND File/Case → Keywords
keyword_directory- Complete catalog with usage stats and citationskeyword_occurrence_map- Where-used at segment levelfile_keyword_summary- Keywords per filecase_keyword_summary- Keywords per casekeyword_subject_category_summary- Rollup by categorykeyword_topic_tag_summary- Rollup by tag
The system includes complete 3D contour processing utilities:
from maps.lidc_3d_utils import (
extract_nodule_mesh,
calculate_consensus_contour,
compute_inter_rater_reliability,
generate_3d_visualization,
get_tcia_download_script
)
# Extract 3D mesh for 3D printing
mesh_path = extract_nodule_mesh("LIDC-IDRI-0001", "1", contour_data, "stl")
# Calculate consensus from multiple radiologists
consensus = calculate_consensus_contour([rad1, rad2, rad3, rad4], method='average')
# Compute inter-rater reliability
ratings = {
"malignancy": [4, 5, 4, 4],
"subtlety": [3, 3, 4, 3]
}
metrics = compute_inter_rater_reliability(ratings)
print(f"ICC: {metrics['malignancy_icc']:.3f}")
# Generate interactive 3D visualization
html_path = generate_3d_visualization("LIDC-IDRI-0001", "1", contour_data)The complete system is deployed via 14 SQL migrations:
- 001_initial_schema - Core tables (already existed)
- 002_unified_case_identifier - Schema-agnostic foundation (already existed)
- 003-005 - Various enhancements (already existed)
- 006_automatic_triggers - Keyword extraction triggers NEW
- 007_case_detection_system - Hybrid case detection NEW
- 008_universal_views - Cross-format views NEW
- 009_lidc_specific_views - Medical analysis views NEW
- 010_lidc_3d_contour_views - Spatial visualization NEW
- 011_export_views - CSV-ready materialized views NEW
- 012_public_access_policies - RLS for anonymous read NEW
- 013_keyword_semantics - Canonical keywords + citations NEW
- 014_keyword_navigation_views - Keyword discovery NEW
# Apply all migrations (run in order)
for i in {001..014}; do
psql "$SUPABASE_DB_URL" -f migrations/$(printf "%03d" $i)*.sql
done
# Refresh all export views
psql "$SUPABASE_DB_URL" -c "SELECT * FROM refresh_all_export_views();"
# Backfill canonical keyword links
psql "$SUPABASE_DB_URL" -c "SELECT * FROM backfill_canonical_keyword_ids();"
# Check database statistics
psql "$SUPABASE_DB_URL" -c "SELECT * FROM public_database_statistics;"# Query keyword directory
from sqlalchemy import create_engine
engine = create_engine(os.getenv("SUPABASE_DB_URL"))
# Get all keywords in a category
query = """
SELECT * FROM keyword_directory
WHERE subject_category = 'Radiologist Perceptive and Diagnostic Concepts'
ORDER BY total_occurrences DESC
"""
keywords = pd.read_sql(query, engine)
# Get canonical keywords for a specific file
query = "SELECT * FROM get_file_canonical_keywords(%s)"
file_keywords = pd.read_sql(query, engine, params=[file_id])
# Search by topic tag
query = "SELECT * FROM search_keywords_by_tag('LIDC')"
lidc_keywords = pd.read_sql(query, engine)
# Get where a keyword is used
query = "SELECT * FROM get_canonical_keyword_occurrences('malignancy')"
occurrences = pd.read_sql(query, engine)- Keywords Tab: Browse canonical keywords, filter by category/tag
- Keyword Detail Modal: Click any keyword → see all files/segments/cases
- Clickable Keyword Chips: Throughout the dashboard for easy navigation
- TCIA Integration: Direct links to study pages and DICOM downloads
- 3D Visualization: In-browser nodule rendering with Plotly
- Case Assignment Interface: Manual review queue for confidence <0.8
** Complete Documentation**: Analysis and Export System Guide
MAPS/
main.py # Application entry point
XMLPARSE.py # Core GUI application and parsing engine
radiology_database.py # SQLite database operations and analytics
config.py # Configuration management
enhanced_logging.py # Advanced logging system
performance_config.py # Performance optimization settings
XML Files → Parser Engine → Data Validation → Export Engine → Output Files
↓ ↓ ↓ ↓ ↓
Multi-format Structure Quality Checks Template Excel/SQLite
Detection Analysis Missing Values Formatting + Analytics
- Language: Python 3.8+
- GUI Framework: Tkinter (custom-styled)
- Data Processing: Pandas, NumPy
- Excel Operations: OpenPyXL
- Database: SQLite3
- XML Processing: ElementTree
- File Operations: Cross-platform file handling
- Entry point for the GUI application
- Window configuration and initialization
- Import error handling and system compatibility checks
The heart of the application containing:
- NYTXMLGuiApp: Main application class
- File/folder selection interfaces
- Progress tracking with live updates
- Export format selection dialogs
- Real-time processing feedback
- parse_radiology_sample(): Main XML parsing function
- detect_parse_case(): Intelligent XML structure detection
- parse_multiple(): Batch processing with memory optimization
- Multi-format support (NYT, LIDC, custom formats)
- Template transformation: Radiologist 1-4 column format
- Nodule-centric organization: Grouping by file and nodule
- Quality validation: Missing value detection and reporting
- Memory optimization: Batch processing for large datasets
- Excel Export: Multiple format options with rich formatting
- SQLite Export: Relational database with analytics capabilities
- Template Format: User-defined column structure
- Multi-sheet organization: Separate sheets per folder/parse case
- RadiologyDatabase class: SQLite wrapper with medical data focus
- Batch operations: Efficient data insertion and querying
- Analytics engine: Radiologist agreement analysis
- Quality reporting: Data completeness and consistency checks
- Excel integration: Database-to-Excel export with formatting
- NYT Format: Standard radiology XML with ResponseHeader structure
- LIDC Format: Lung Image Database Consortium XML structure
- Custom Formats: Extensible parsing for new XML schemas
- Automatic Detection: Intelligent format recognition
Complete_Attributes - Full radiologist data (confidence, subtlety, obscuration, reason)
With_Reason_Partial - Includes reason field with partial attributes
Core_Attributes_Only - Essential attributes without reason
Minimal_Attributes - Limited attribute set
No_Characteristics - Structure without characteristic data
LIDC_Single_Session - Single LIDC reading session
LIDC_Multi_Session_X - Multiple LIDC sessions (2-4 radiologists)
No_Sessions_Found - XML without readable sessions
XML_Parse_Error - Malformed or unparseable XML
Detection_Error - Structure analysis failure
- Radiologist Information: ID, session type, reading timestamps
- Nodule Characteristics: Confidence, subtlety, obscuration, diagnostic reason
- Coordinate Data: X, Y, Z coordinates with edge mapping
- Medical Metadata: StudyInstanceUID, SeriesInstanceUID, SOP_UID, modality
- Session Classification: Standard vs. Detailed coordinate sessions
- Individual XML file parsing
- Immediate feedback on parse results
- Error handling and reporting
- Data preview capabilities
- Recursive XML file discovery
- Batch processing with progress tracking
- Per-folder statistics and reporting
- Error isolation (continue on failure)
- Combined Output: Single Excel file with multiple sheets
- Folder Organization: Separate sheet per source folder
- Template Format: Radiologist 1-4 repeating column structure
- Single Database: Combined SQLite database for all folders
- Progress Tracking: Real-time processing updates with live logging
- Parse case sheets: Separate sheets by XML structure type
- Session separation: Detailed vs. Standard coordinate sessions
- Color coding: Parse case-based row highlighting
- Missing value highlighting: Orange highlighting for MISSING values
- Auto-formatting: Column width adjustment and alignment
- Radiologist Columns: Repeating "Radiologist 1", "Radiologist 2", "Radiologist 3", "Radiologist 4"
- Compact Ratings: Format like "Conf:5 | Sub:3 | Obs:2 | Reason:1"
- Color Coordination: Each radiologist column gets unique color scheme
- Comprehensive Headers: FileID, NoduleID, ParseCase, SessionType, coordinates, metadata
- Combined Sheet: "All Combined" with data from all folders
- Individual Sheets: One sheet per source folder
- Consistent Formatting: Template format across all sheets
- Navigation: Easy switching between folder views
-- Core tables for relational data organization
sessions - Individual radiologist reading sessions
nodules - Unique nodule instances with metadata
radiologists - Radiologist information and statistics
files - Source file tracking and metadata
batches - Processing batch management
quality_issues - Data quality problem tracking- Radiologist Agreement: Inter-rater reliability calculations
- Data Quality Metrics: Completeness, consistency analysis
- Performance Statistics: Processing time and success rates
- Batch Tracking: Historical processing information
- SQL query interface for custom analysis
- Predefined analytical views
- Export capabilities to Excel with formatting
- Integration with external analysis tools
- Missing Value Detection: Identification of MISSING vs #N/A vs empty values
- Data Completeness Analysis: Per-column and overall completeness statistics
- Type Validation: Ensuring numeric fields contain valid numbers
- Structure Validation: XML schema compliance checking
- Quality Warnings: User prompts for data quality issues
- Continue/Cancel Options: User choice on problematic data
- Detailed Reporting: Comprehensive quality statistics
- Column Hiding: Auto-hide columns with >85% missing values
- Graceful Degradation: Continue processing on individual file failures
- Error Logging: Detailed error tracking with timestamps
- User Feedback: Clear error messages and resolution suggestions
- Recovery Options: Partial processing results preservation
- Clean Design: Aptos font, consistent color scheme (#d7e3fc)
- Intuitive Layout: Logical workflow progression
- File Management: Easy file/folder selection and management
- Export Options: Clear choice between Excel and SQLite formats
- Live Progress Bars: Visual progress indication
- Real-time Logging: Timestamped activity log with color coding
- File-by-file Updates: Individual file processing status
- Statistics Display: Success/failure counts, processing rates
- Auto-close Options: Configurable completion behavior
- Color-coded Messages: Info (blue), success (green), warning (orange), error (red)
- Creator Signature: Animated signature popup on startup
- Status Updates: Contextual status information
- Error Popups: Temporary error notifications
- Batch Processing: Process files in configurable batches
- Garbage Collection: Explicit memory cleanup
- Data Streaming: Minimize memory footprint for large datasets
- Efficient Data Structures: Optimized data organization
- Smart Sampling: Intelligent sampling for column width calculation
- Vectorized Operations: Pandas optimization for data manipulation
- Batch Database Operations: Efficient SQLite bulk insertions
- Parallel Processing Ready: Architecture supports future parallelization
- Responsive UI: Non-blocking progress updates
- Background Processing: Long operations don't freeze interface
- Cancellation Options: User can interrupt long operations
- Resource Monitoring: Memory and performance tracking
- Core XML Parsing Engine - Multi-format XML processing
- GUI Application - Complete Tkinter interface
- Excel Export System - Multiple export formats with rich formatting
- SQLite Database Integration - Relational database with analytics
- Multi-Folder Processing - Combined output generation
- Template Format Export - Radiologist 1-4 column structure
- Quality Validation System - Comprehensive data quality checks
- Progress Tracking - Real-time processing feedback
- Error Handling - Robust error management and recovery
A separate application for database analysis and visualization:
- Database Browser: Navigate and explore SQLite databases
- Query Interface: Visual SQL query builder
- Analytics Dashboard: Radiologist agreement analysis
- Data Visualization: Charts and graphs for data insights
- Export Tools: Advanced export options from database
- Comparison Tools: Compare multiple databases
- Statistical Analysis: Advanced inter-rater reliability metrics
- Machine Learning Integration: Anomaly detection in radiologist readings
- Predictive Modeling: Quality prediction based on XML structure
- Batch Comparison: Compare processing results across batches
- Parallel Processing: Multi-core processing for large batches
- Cloud Integration: AWS/Azure processing capabilities
- API Development: REST API for automated processing
- Docker Containerization: Deployment and scaling support
- DICOM Integration: Support for DICOM file processing
- Web Interface: Browser-based processing interface
- Real-time Monitoring: Live processing dashboards
- Integration APIs: Connect with hospital information systems
- Natural Language Processing: Extract insights from reason text
- Computer Vision: Image coordinate validation
- Automated Quality Assessment: AI-powered data quality scoring
- Predictive Analytics: Forecast processing outcomes
- Python: 3.8 or higher
- RAM: Minimum 4GB, Recommended 8GB+
- Storage: 1GB+ free space for databases and exports
- OS: Windows 10+, macOS 10.14+, Linux (Ubuntu 18.04+)
- Small Dataset (<1,000 files): ~2-5 minutes
- Medium Dataset (1,000-10,000 files): ~10-30 minutes
- Large Dataset (10,000+ files): ~30+ minutes
- Memory Usage: ~100-500MB typical, scales with dataset size
# Core Dependencies
pandas>=1.3.0
openpyxl>=3.0.9
numpy>=1.21.0
# GUI and System
tkinter (built-in)
platform (built-in)
subprocess (built-in)
# Database
sqlite3 (built-in)
# XML Processing
xml.etree.ElementTree (built-in)# Clone the repository
git clone <repository-url>
cd "XML PARSE"
# Install dependencies
pip install pandas openpyxl numpy
# Run the application
python main.py# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -r requirements.txt
# Run tests (if available)
python -m pytest tests/
# Start development server
python main.py- Launch application:
python main.py - Click "Select XML Files" or "Select Folders"
- Choose export format (Excel/SQLite/Both)
- Click "Export to Excel" or "Export to SQLite"
- Monitor progress and review results
- Click "Select Folders" → "Multiple Folders"
- Use Cmd+Click (macOS) to select multiple folders
- Choose combined export format
- Process creates single Excel with multiple sheets
- Single SQLite database contains all folder data
- Select files/folders for processing
- Choose "Export to Excel"
- System automatically applies template format
- Results show Radiologist 1-4 columns with compact ratings
- Export data to SQLite format
- Use generated analysis Excel for quick insights
- Query database directly using SQL tools
- Future: Use Database GUI for advanced analysis
- Single-threaded Processing: No parallel processing yet
- Memory Usage: Large datasets can consume significant RAM
- XML Format Support: Limited to known formats (extensible)
- Error Recovery: Some XML errors cannot be automatically resolved
- Very Large Files: Files >100MB may process slowly
- Special Characters: Some Unicode characters in XML may cause issues
- Network Drives: Processing from network locations may be slower
- macOS Permissions: May require permissions for file access
- Large Datasets: Process in smaller batches
- Memory Issues: Close other applications during processing
- File Errors: Check XML validity before processing
- Performance: Use local storage for better performance
- Follow PEP 8 Python style guidelines
- Add docstrings to all functions and classes
- Include type hints where appropriate
- Write tests for new features
- Update documentation for changes
- main.py: Entry point only, minimal logic
- XMLPARSE.py: Core functionality, well-documented
- radiology_database.py: Database operations
- New features: Consider separate modules for large features
- Test with various XML formats
- Verify export formats work correctly
- Check error handling with malformed data
- Performance test with large datasets
** Testing Documentation:**
- Testing Guide - Comprehensive testing documentation
- Quick Reference - Quick commands and tips
Run Tests:
# Web tests
cd web/ && npm test
# Python tests
pytest -v
# Coverage reports
npm run test:coverage # web
pytest --cov=src --cov-report=html # pythonCI/CD: Tests run automatically on push/PR via GitHub Actions (.github/workflows/test.yml)
MAPS is proprietary software with dual licensing:
- Free for academic research and education
- Must cite in publications
- No commercial use permitted
- See LICENSE for full terms
- Required for any for-profit use
- Includes support and updates
- Custom pricing based on use case
- Contact for commercial licensing
Copyright (c) 2025 Isa Lucia Schlichting. All Rights Reserved.
If you use MAPS in academic research, please cite:
@software{schlichting2025maps,
author = {Schlichting, Isa Lucia},
title = {MAPS: Medical Annotation Processing System},
year = {2025},
publisher = {GitHub},
url = {https://github.com/luvisaisa/MAPS}
}For commercial licensing, enterprise support, or questions:
- 📧 Email: isa.lucia.sch@outlook.com
- 📄 Details: COMMERCIAL_LICENSE.md
- 💻 Repository: https://github.com/luvisaisa/MAPS
- Repository: NYTXMLPARSE (GitHub)
- Author: luvisaisa
- Created: 2025
- Language: Python
For issues, questions, or contributions:
- Create an issue in the GitHub repository
- Review existing documentation
- Check known issues section
- Contact development team
Last Updated: August 12, 2025 Version: 2.0 Status: Active Development