-
Notifications
You must be signed in to change notification settings - Fork 3
Description
When running Intogen pipeline as a workflow in the Seqera launchpad, bbglabirb/ALP > intogen_cll_richter, the output file cohorts.tsv shows an inconsistent number of samples in the column SAMPLES.
I have checked that the parsing step BBGTOOLS:INTOGENPLUS:PARSE:GROUPBY (OpenVariant groupby) works as intended checking the ouput file of this process: /workspace/nobackup/work/intogen/intogen-richter/work/7e/c0c7be82455ae7407208bd61ceaa82/MASSONI.parsed.tsv.gz. When counting the unique samples with pandas, it gives the expected count of samples, 1069.
However, I have also checked that the step BBGTOOLS:INTOGENPLUS:PREPROCESS:VARIANTS_COUNT yields the wrong sample counts, as can be readily seen by executing the corresponding bash script:
variants=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | wc -l)
samples=$(zcat MASSONI.parsed.tsv.gz | tail -n+2 | cut -f1 | sort -u | wc -l)
echo "MASSONI CLL WGS $variants $samples" > MASSONI.counts
which yields the wrong count of 24 samples.