Skip to content

Bundles HADES packages (i.e., DQD, Achilles) in a plumber API served as a GCP Cloud Run for the Connect EHR pipeline.

Notifications You must be signed in to change notification settings

Analyticsphere/ccc-omop-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ccc-omop-analyzer

OHDSI tool execution for OMOP CDM data on BigQuery. Deployed as Cloud Run service + jobs.

Overview

Executes OHDSI data quality and characterization tools on BigQuery OMOP datasets:

  • DQD (DataQualityDashboard) - Data quality validation with 4,000+ checks
  • Achilles - Database characterization and descriptive statistics for ATLAS
  • PASS (Profile of Analytic Suitability Score) - Evaluates data fitness-for-purpose across six dimensions (accessibility, provenance, standards, concept diversity, source diversity, temporal)

Architecture

  • Cloud Run Service - REST API for health checks, Atlas table creation, and report generation
  • Cloud Run Jobs - Long-running tool executions (up to 24h):
    • ccc-omop-analyzer-dqd-job
    • ccc-omop-analyzer-achilles-job
    • ccc-omop-analyzer-pass-job
  • Airflow Integration - Jobs triggered via CloudRunExecuteJobOperator

BigQuery Schema Naming

Project IDs with hyphens require special handling:

  • Achilles: Pass dataset name only (e.g., "ehr_synthea"). JDBC connection's DefaultDataset parameter handles qualification.
  • DQD: Use double-quoted fully qualified names (e.g., '"project-id".dataset'). SqlRender translates to backticks.
  • PASS: Uses project.dataset format directly (e.g., "project-id.ehr_synthea").

PASS Configuration

The PASS job evaluates data quality across six evidence-based dimensions:

  1. Accessibility - Whether clinical data is present and accessible for analysis
  2. Provenance - Information preservation and traceability through mapping
  3. Standards - Use of standardized vocabularies for interoperable research
  4. Concept Diversity - Variety of distinct clinical concepts in the data
  5. Source Diversity - Variety of data source types (EHR, claims, registries)
  6. Temporal - Data distribution over time (span, density, consistency)

Default Settings (configured in constants.R):

  • METRICS = "all" - All six metrics are calculated
  • CALCULATE_COMPOSITE = TRUE - Weighted composite score is generated
  • VERBOSE_MODE = TRUE - Detailed logging enabled

Outputs: Five CSV files uploaded to GCS containing field-level, table-level, and overall scores with 95% confidence intervals.

Components

Core Files

File Purpose
Dockerfile Container image with R, DQD, Achilles, PASS, and dependencies
cloudbuild.yaml Cloud Build config for service + jobs (DQD, Achilles, PASS)
plumber_api.R REST API with health check, Atlas table creation, and report generation
entrypoint.sh Container entrypoint for service account authentication

R Modules (R/ directory)

File Purpose
constants.R Configuration constants (DQD, Achilles, PASS)
utils.R Helper functions (GCS, JDBC, authentication, SQL)
run_dqd.R DQD execution logic
run_dqd_job.R DQD job entrypoint
run_achilles.R Achilles execution logic
run_achilles_job.R Achilles job entrypoint
run_pass.R PASS execution logic
run_pass_job.R PASS job entrypoint
run_create_atlas_results_tables.R Atlas results table creation
run_generate_delivery_report.R OMOP delivery report generation

Deployment

gcloud builds submit --config cloudbuild.yaml

Deploys:

  • Service: ccc-omop-analyzer (4 CPU, 8GB RAM, 1h timeout)
  • Jobs: ccc-omop-analyzer-{dqd,achilles,pass}-job (4 CPU, 8GB RAM, 2h timeout, configurable to 24h)

Environment Variables

All jobs require PROJECT_ID, CDM_DATASET_ID, and GCS_ARTIFACT_PATH. Additional variables:

DQD/Achilles jobs:

  • ANALYTICS_DATASET_ID - Target dataset for results
  • CDM_VERSION - OMOP CDM version (e.g., "5.4")
  • CDM_SOURCE_NAME - Source identifier
  • SERVICE_ACCOUNT_EMAIL - BigQuery JDBC auth

PASS job: No additional variables required.

Job Execution

Jobs are triggered from Airflow via CloudRunExecuteJobOperator. Each job:

  1. Authenticates with BigQuery/GCS
  2. Executes tool logic (run_dqd(), run_achilles(), or run_pass())
  3. Uploads results to GCS and/or BigQuery
  4. Exits with status code (0=success, 1=failure)

Output Artifacts

Tool Artifacts
DQD GCS: dqdashboard_results.{json,csv}
BigQuery: {analytics_dataset_id}.dqd_results
Achilles GCS: achilles_results.csv, results/*.json (ARES)
BigQuery: {dataset_id}.achilles_*, {dataset_id}.*_concept_counts
PASS GCS: pass_overall.csv (scores by metric with CIs), pass_table_level.csv (scores by table), pass_field_level.csv (field-level detail), pass_composite_overall.csv (composite score), pass_composite_components.csv (metric contributions)

API Endpoints

Endpoint Method Parameters Purpose
/heartbeat GET - Health check
/create_atlas_results_tables POST project_id, cdm_dataset_id, analytics_dataset_id Create Atlas results tables in BigQuery
/generate_delivery_report POST delivery_report_path, dqd_results_path, output_gcs_path Generate HTML report from delivery metrics and DQD results

Authentication

Service account credentials from Secret Manager (ccc-omop-analyzer-secretSERVICE_ACCOUNT_JSON).

Required permissions:

  • BigQuery read/write
  • GCS write
  • Secret Manager read

About

Bundles HADES packages (i.e., DQD, Achilles) in a plumber API served as a GCP Cloud Run for the Connect EHR pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •