This repository contains;
- An example to run the Insilicom data harmonization API.
- Calculate the MAS score for the input, API output and the manually curated metadata.
The definition of the MAS score:
The MAS score quantifies the alignment of input metadata with a given data model, such as GDC standard. It is calculated as the ratio of harmonized data cells to the total expected cells across all required variables (16 for GDC). The score ranges from 0 to 1,with higher values indicating better compliance and metadata quality.
The dataset includes:
geo_input_20.json: The 20 lung cancer-related GEO samples** metadata provided in JSON format.gold_standard_20.json: The same 20 GEO metadata manually curated by us.gdc_meta.json: The Genomic Data Commons (GDC) standard.
- Benchmark Name: Cancer Harmonization Benchmark
- Version: 1.0
- File name:
gold_standard_20.json - Data Type: metadata
- Data Model Source: GDC
- Data Model Target: GDC
- Includes Variable Mappings: true
- Includes Value Mappings: true
- Has Hidden Data: true
- Hidden Data Proportion: 20%
- Annotation Method: Manual by 2 experts
- Number of Variables: 23
- Number of Values: 460
- API Submission Supported: true
- Container Submission Supported: false
- License: CC BY 4.0
- Number of Cases: 20
Following the 01_metadata_API_calling.ipynb notebook.
Following the 02_MAS_score_calculation.ipynb notebook.
This benchmark dataset is developed by Insilicom. It is subject to limited release.
For questions, please contact:
- Jinfeng Zhang – jinfeng@insilicom.com
- Sheldon Pang – xpang@insilicom.com
2025 Insilicom LLC. All Rights Reserved. Limited external release.