Skip to content

Unexpected number of samples in summary output file cohorts.tsv #76

@koszulordie

Description

@koszulordie

When running Intogen pipeline as a workflow in the Seqera launchpad, bbglabirb/ALP > intogen_cll_richter, the output file cohorts.tsv shows an inconsistent number of samples in the column SAMPLES.

I have checked that the parsing step BBGTOOLS:INTOGENPLUS:PARSE:GROUPBY (OpenVariant groupby) works as intended checking the ouput file of this process: /workspace/nobackup/work/intogen/intogen-richter/work/7e/c0c7be82455ae7407208bd61ceaa82/MASSONI.parsed.tsv.gz. When counting the unique samples with pandas, it gives the expected count of samples, 1069.

However, I have also checked that the step BBGTOOLS:INTOGENPLUS:PREPROCESS:VARIANTS_COUNT yields the wrong sample counts, as can be readily seen by executing the corresponding bash script:

variants=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | wc -l)
samples=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | cut -f1 | sort -u | wc -l)
echo "MASSONI	CLL	WGS	$variants	$samples" > MASSONI.counts

which yields the wrong count of 24 samples.

Metadata

Metadata

Labels

bugSomething isn't workinginvalidThis doesn't seem right

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions