From c1206abadd5a031c26ffcd45af8501e992b5f0bd Mon Sep 17 00:00:00 2001 From: msguixe <56801139+msguixe@users.noreply.github.com> Date: Thu, 22 Jan 2026 15:25:26 +0100 Subject: [PATCH 1/3] Enhance GENIE dataset documentation Added detailed description and notes about the GENIE dataset, including data sources, version, and important notes regarding synonymous mutations. --- docs/Datasets/General_datasets/GENIE.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/docs/Datasets/General_datasets/GENIE.md b/docs/Datasets/General_datasets/GENIE.md index 4ee30b3fb..7121fe13c 100644 --- a/docs/Datasets/General_datasets/GENIE.md +++ b/docs/Datasets/General_datasets/GENIE.md @@ -1,5 +1,28 @@ # GENIE ## Description +A dataset of many panel sequencing samples (>170 000) from many institutions in many countries. +[Dataset website](https://genie.synapse.org/Explore/GENIE). +Data downloaded from [sypanspe.org](synapse.org). +[Here](https://docs.google.com/presentation/d/18MMDAOfSa3rRfApb8obhicyIU2n60aGcM9QA_2lMbao/edit?slide=id.p#slide=id.p) +you can find a Google Slides document with some details about the GENIE dataset. +GENIE contains panels from MSKCC IMPACT (>80%), DFCI, UCSC, VHIO, etc. +Many of them are without a normal-matched control. +Some panels are on hotspots instead of the whole gene. +Overall, the MSKCC IMPACT panel data is the most complete one, +with a normal-match control and covering ~500 cancer-related genes. + +Downloaded data is in this path: ```/data/bbg/datasets/intogen/input/data/cancer/panels/GENIE/``` +The latest version downloaded is GENIE v.15 (syn53210170). +Take into account that some IMPACT panel data could be duplicated in the cbioportal panel data download: +```/data/bbg/datasets/intogen/input/data/cancer/panels/cbioportal/```. +The IMPACT panels from cbioportal have more clinical data. + +**Important note** +> This dataset has synonymous mutations filtered. It cannot be used for intOGen. + ## Reference +Mònica Sánchez Guixé +Santi Demajo +Abel González Perez From 8e6d50360b507f8aa18f40ee3a85d0526d405c15 Mon Sep 17 00:00:00 2001 From: msguixe <56801139+msguixe@users.noreply.github.com> Date: Thu, 22 Jan 2026 15:49:49 +0100 Subject: [PATCH 2/3] Change important note to a heading Updated heading for important note section. --- docs/Datasets/General_datasets/GENIE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Datasets/General_datasets/GENIE.md b/docs/Datasets/General_datasets/GENIE.md index 7121fe13c..b754e8e47 100644 --- a/docs/Datasets/General_datasets/GENIE.md +++ b/docs/Datasets/General_datasets/GENIE.md @@ -18,7 +18,7 @@ Take into account that some IMPACT panel data could be duplicated in the cbiopor ```/data/bbg/datasets/intogen/input/data/cancer/panels/cbioportal/```. The IMPACT panels from cbioportal have more clinical data. -**Important note** +### Important note > This dataset has synonymous mutations filtered. It cannot be used for intOGen. From ca02f1316a7d3814804ee95a448ed71eff203751 Mon Sep 17 00:00:00 2001 From: msguixe <56801139+msguixe@users.noreply.github.com> Date: Thu, 22 Jan 2026 14:50:03 +0000 Subject: [PATCH 3/3] chore: fix linting issues --- docs/Datasets/General_datasets/GENIE.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/Datasets/General_datasets/GENIE.md b/docs/Datasets/General_datasets/GENIE.md index b754e8e47..353c7ac87 100644 --- a/docs/Datasets/General_datasets/GENIE.md +++ b/docs/Datasets/General_datasets/GENIE.md @@ -1,28 +1,30 @@ # GENIE ## Description + A dataset of many panel sequencing samples (>170 000) from many institutions in many countries. [Dataset website](https://genie.synapse.org/Explore/GENIE). Data downloaded from [sypanspe.org](synapse.org). -[Here](https://docs.google.com/presentation/d/18MMDAOfSa3rRfApb8obhicyIU2n60aGcM9QA_2lMbao/edit?slide=id.p#slide=id.p) +[Here](https://docs.google.com/presentation/d/18MMDAOfSa3rRfApb8obhicyIU2n60aGcM9QA_2lMbao/edit?slide=id.p#slide=id.p) you can find a Google Slides document with some details about the GENIE dataset. GENIE contains panels from MSKCC IMPACT (>80%), DFCI, UCSC, VHIO, etc. Many of them are without a normal-matched control. Some panels are on hotspots instead of the whole gene. -Overall, the MSKCC IMPACT panel data is the most complete one, +Overall, the MSKCC IMPACT panel data is the most complete one, with a normal-match control and covering ~500 cancer-related genes. Downloaded data is in this path: ```/data/bbg/datasets/intogen/input/data/cancer/panels/GENIE/``` The latest version downloaded is GENIE v.15 (syn53210170). -Take into account that some IMPACT panel data could be duplicated in the cbioportal panel data download: +Take into account that some IMPACT panel data could be duplicated in the cbioportal panel data download: ```/data/bbg/datasets/intogen/input/data/cancer/panels/cbioportal/```. -The IMPACT panels from cbioportal have more clinical data. +The IMPACT panels from cbioportal have more clinical data. ### Important note +> > This dataset has synonymous mutations filtered. It cannot be used for intOGen. - ## Reference + Mònica Sánchez Guixé Santi Demajo Abel González Perez