Skip to content

Conversation

@berndbohmeier
Copy link
Collaborator

This is the first iteration of the summarize command. It is by no mean done, but it has the most important first steps

UI:

  • Sample Statistics
    answers which samples are done, which are failing. Gives an idea on how much the sequencing is already done
  • QC Statistics
    gives an idea about how good the sequencing is working. Which experiments have good data, which have high coverage, which are contaminated, etc.
  • Prevalence
    Two ways to show prevalence, grouped by columns in the metadata file

JasonAHendry and others added 29 commits September 16, 2025 09:24
The -e flag is not needed.
Get ready for better text
Ensure we mark duplicates and only take the best sample
Ensure we exclude samples not in master metadata
Make sample_type mandatory. We need it for summary analysis and it is
best if people just include it when creating the sample sheet. It should
not be much extra work.

We maybe want to make it optional if we can derive it from the sample
name, to be discussed.
For now just don't group them by alt alleles. We might have a different
set of alt alleles when we call the same mutation in different
experiments, as we might have a triallelic side. As we in the end group
by amino acid change, this leads to problems. We need to better discuss
how to hanle csq calling correctly for multiple changes.
This helps to only look at the sequenced samples
Green and red should be more intuitve what they mean
It makes sense that we don't report contamination if we have low
coverage, as this could actually just be caused by low coverage. But if
we are over the abs threshold, we can be sure, it's contamination
Exclude columns with to many entries and which are numeric. Maybe we
want to filter more in the future, or have a way to provide a list
They are big and we don't need them for the analysis
berndbohmeier and others added 2 commits December 4, 2025 12:48
Co-authored-by: danieljbridges <51692943+danieljbridges@users.noreply.github.com>
Copy link
Owner

@JasonAHendry JasonAHendry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK -- I had a first look over the updated code.

I think we have a lot of great stuff here but still plenty of work to do to finalise. Some of the bigger things in my view are:

  • thinking a bit more about what the 'core' summary command should provide; while allowing for some flexibility for bespoke analyses for specific panels (e.g. hrp2/3 deletions)
  • reviewing how we handle mapping, and ideally providing a more generic solution that doesn't require the user providing GIS files themselves
  • considering again how we handle behaviour across versions; e.g. what happens if we have some data processed with delve, and some with bcftools? Similarly, we should use the unfiltered VCF file if it is available
  • allowing just the dashboard to be launched if the data has been processed already

See some comments throughout, can also talk more in person!

self.df["barcode"] = [correct_barcode_format(b) for b in self.df["barcode"]]


class ExtendedMetadataTableParser(MetadataTableParser):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I had sample_type optional for running realtime, and only required when wanting to do the nomadic summarize, using this subclass of MetadataTableParser.

Now that we are making sample_type mandatory, I think now this is unnecessary, and this code can be removed?

However, using the subclass I also checked whether there was at least one positive and negative control in the experiment. This is important for some of the QC checks that are done in summarize (e.g. checking for contamination, and in the future it will be required for hrp2/3 deletion detection).

Is this done somewhere else now? Or how will summarize behave if the user has an experiment without a positive and/or negative control?

return Settings(**data)


def get_master_columns_mapping(settings: Settings) -> dict[str, str]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss around best way to introduce metadata for mapping. I think if we keep this approach we should add some more docstring to this module. I am also not 100% sure util/summary.py is the best place for this mapping related code.

# for now we use the master metadata file
inventory_metadata = pd.concat([df[fixed_columns] for df in dfs])
if meta_data_path is not None and not no_master_metadata:
master_metadata = pd.read_csv(meta_data_path).rename(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we write a specific parser or somehow encapsulate and test the loading of the master metadata file.

In general I think user input validation / testing is very important for our use case, so whenever the user is providing a file to the program, it is good to have some tests around what happens if it is incorrect &c. We do that for the nomadic realtime metadata file at present.

f"{output_dir}/summary.gene-deletions.prevalence.csv", index=False
)

for col in prevalence_by:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check that the prevalence_by columns are all in the metadata further up, and provide some graceful error handling if they are not.

from nomadic.util.summary import Settings, get_map_settings


class SummaryDashboardBuilder(ABC):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely my fault, but not sure I 100% love the amount of code duplication we get by having two separate dashboard class hierarchies. Might be quite hard to refactor though.

berndbohmeier and others added 17 commits December 5, 2025 13:41
So we don't show 123.0 instead of 123
Make the parameters used to define likely false-positive
mutations function arguments; consolidate two versions
of prevalence calculation functions.
This commit does three things
1. Define a new class, called ExperimentOutputChecker,
that holds information about experiment outputs for
nomadic summarize.
2. Consolidate some functions that were duplicate (e.g.
find_metadata and get_metadata_csv
3. Reorganise the util/experiment.py module
Replace with a function and dataclass
We don't want to do this kind of fixing in nomadic, it should be done
before.
We try it first like this, but imo it would be better to actually
separate viewing the summary into a different command or combine it with
dashboard. It is a bit awkward to combine an output command (produce the
files of summary) with an input command (showing files from a folder).
Put all mvp specific setting into an object and make the dashboard work
for any panel with amplicons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants