Summarize command #64

berndbohmeier · 2025-10-29T14:59:52Z

This is the first iteration of the summarize command. It is by no mean done, but it has the most important first steps

UI:

Sample Statistics
answers which samples are done, which are failing. Gives an idea on how much the sequencing is already done
QC Statistics
gives an idea about how good the sequencing is working. Which experiments have good data, which have high coverage, which are contaminated, etc.
Prevalence
Two ways to show prevalence, grouped by columns in the metadata file

The -e flag is not needed.

Get ready for better text Ensure we mark duplicates and only take the best sample Ensure we exclude samples not in master metadata

Make sample_type mandatory. We need it for summary analysis and it is best if people just include it when creating the sample sheet. It should not be much extra work. We maybe want to make it optional if we can derive it from the sample name, to be discussed.

For now just don't group them by alt alleles. We might have a different set of alt alleles when we call the same mutation in different experiments, as we might have a triallelic side. As we in the end group by amino acid change, this leads to problems. We need to better discuss how to hanle csq calling correctly for multiple changes.

This helps to only look at the sequenced samples

Green and red should be more intuitve what they mean

It makes sense that we don't report contamination if we have low coverage, as this could actually just be caused by low coverage. But if we are over the abs threshold, we can be sure, it's contamination

Exclude columns with to many entries and which are numeric. Maybe we want to filter more in the future, or have a way to provide a list

They are big and we don't need them for the analysis

src/nomadic/util/summary.py

src/nomadic/summarize/dashboard/translations/en.yml

src/nomadic/summarize/dashboard/components.py

Co-authored-by: danieljbridges <51692943+danieljbridges@users.noreply.github.com>

JasonAHendry

OK -- I had a first look over the updated code.

I think we have a lot of great stuff here but still plenty of work to do to finalise. Some of the bigger things in my view are:

thinking a bit more about what the 'core' summary command should provide; while allowing for some flexibility for bespoke analyses for specific panels (e.g. hrp2/3 deletions)
reviewing how we handle mapping, and ideally providing a more generic solution that doesn't require the user providing GIS files themselves
considering again how we handle behaviour across versions; e.g. what happens if we have some data processed with delve, and some with bcftools? Similarly, we should use the unfiltered VCF file if it is available
allowing just the dashboard to be launched if the data has been processed already

See some comments throughout, can also talk more in person!

src/nomadic/util/metadata.py

JasonAHendry · 2025-12-03T10:42:08Z

src/nomadic/util/metadata.py

        self.df["barcode"] = [correct_barcode_format(b) for b in self.df["barcode"]]
+
+
+class ExtendedMetadataTableParser(MetadataTableParser):


Previously I had sample_type optional for running realtime, and only required when wanting to do the nomadic summarize, using this subclass of MetadataTableParser.

Now that we are making sample_type mandatory, I think now this is unnecessary, and this code can be removed?

However, using the subclass I also checked whether there was at least one positive and negative control in the experiment. This is important for some of the QC checks that are done in summarize (e.g. checking for contamination, and in the future it will be required for hrp2/3 deletion detection).

Is this done somewhere else now? Or how will summarize behave if the user has an experiment without a positive and/or negative control?

src/nomadic/summarize/main.py

src/nomadic/summarize/dashboard/components.py

src/nomadic/summarize/main.py

JasonAHendry · 2025-12-05T08:37:39Z

src/nomadic/util/summary.py

+    return Settings(**data)
+
+
+def get_master_columns_mapping(settings: Settings) -> dict[str, str]:


Let's discuss around best way to introduce metadata for mapping. I think if we keep this approach we should add some more docstring to this module. I am also not 100% sure util/summary.py is the best place for this mapping related code.

JasonAHendry · 2025-12-05T08:42:30Z

src/nomadic/summarize/main.py

+    # for now we use the master metadata file
+    inventory_metadata = pd.concat([df[fixed_columns] for df in dfs])
+    if meta_data_path is not None and not no_master_metadata:
+        master_metadata = pd.read_csv(meta_data_path).rename(


I would suggest we write a specific parser or somehow encapsulate and test the loading of the master metadata file.

In general I think user input validation / testing is very important for our use case, so whenever the user is providing a file to the program, it is good to have some tests around what happens if it is incorrect &c. We do that for the nomadic realtime metadata file at present.

JasonAHendry · 2025-12-05T08:52:40Z

src/nomadic/summarize/main.py

+        f"{output_dir}/summary.gene-deletions.prevalence.csv", index=False
+    )
+
+    for col in prevalence_by:


I think we should check that the prevalence_by columns are all in the metadata further up, and provide some graceful error handling if they are not.

JasonAHendry · 2025-12-05T08:55:03Z

src/nomadic/summarize/dashboard/builders.py

+from nomadic.util.summary import Settings, get_map_settings
+
+
+class SummaryDashboardBuilder(ABC):


This is completely my fault, but not sure I 100% love the amount of code duplication we get by having two separate dashboard class hierarchies. Might be quite hard to refactor though.

This is because the dashboard needs some time to start

So we don't show 123.0 instead of 123

Make the parameters used to define likely false-positive mutations function arguments; consolidate two versions of prevalence calculation functions.

This commit does three things 1. Define a new class, called ExperimentOutputChecker, that holds information about experiment outputs for nomadic summarize. 2. Consolidate some functions that were duplicate (e.g. find_metadata and get_metadata_csv 3. Reorganise the util/experiment.py module

Replace with a function and dataclass

We don't want to do this kind of fixing in nomadic, it should be done before.

We try it first like this, but imo it would be better to actually separate viewing the summary into a different command or combine it with dashboard. It is a bit awkward to combine an output command (produce the files of summary) with an input command (showing files from a folder).

Put all mvp specific setting into an object and make the dashboard work for any panel with amplicons.

JasonAHendry and others added 29 commits September 16, 2025 09:24

Fix example script

2127113

The -e flag is not needed.

Add functionality for cleaning sample_type column

c1bd5ea

Add first-draft of summarise command

1de3bea

Merge remote-tracking branch 'origin/main' into feature/summary

995e836

Read in master metadata file

12dc6a5

Make summarize work with new format and delve

b84a5d3

Add sample summary statistic

f075d69

Add prevalence by region plot

a892093

Small improvements to summarize

c608abb

Get ready for better text Ensure we mark duplicates and only take the best sample Ensure we exclude samples not in master metadata

Add samples amplicons barplot

ccab72b

Add prevalence per region/year plot

188d28f

Fix filtering of false positives

044d80c

Update wsaf false positive threashold

b3d7ac1

Move some files to utils experiment

317bd91

Add some more docs and move code in summarize

454da0b

Linter fixes and code structure

d8b8d75

Make sample_type mandatory

a68b0d8

Make sample_type mandatory. We need it for summary analysis and it is best if people just include it when creating the sample sheet. It should not be much extra work. We maybe want to make it optional if we can derive it from the sample name, to be discussed.

Limit prevalence to samples in master metadata file

6793b05

Ensure we handle sample ids that are number better

29b0222

Store in inventory if samples are unknown

839b058

Show legend in Sample statistic pie

24bcf86

This helps to only look at the sequenced samples

Better text and labels for QC Summary

3d9d56b

New colorscales for QC summary

3f39f5c

Green and red should be more intuitve what they mean

Ensure prevalence is ordered by aa positions

7e85cf5

Always report contaminated when over abs. thresh.

5200e38

It makes sense that we don't report contamination if we have low coverage, as this could actually just be caused by low coverage. But if we are over the abs threshold, we can be sure, it's contamination

Allow to group by all columns in metadata file

3b1367e

Exclude columns with to many entries and which are numeric. Maybe we want to filter more in the future, or have a way to provide a list

Allow to start dashboard in debug mode via env

7a85fd5

Don't check for depth files in summary

37e6f26

They are big and we don't need them for the analysis

berndbohmeier requested a review from JasonAHendry October 29, 2025 14:59

Add init file to summarize module

b56081c

JasonAHendry reviewed Nov 26, 2025

View reviewed changes

src/nomadic/util/summary.py Show resolved Hide resolved

danieljbridges reviewed Dec 2, 2025

View reviewed changes

src/nomadic/summarize/dashboard/translations/en.yml Outdated Show resolved Hide resolved

danieljbridges reviewed Dec 2, 2025

View reviewed changes

src/nomadic/summarize/dashboard/translations/en.yml Outdated Show resolved Hide resolved

danieljbridges reviewed Dec 2, 2025

View reviewed changes

src/nomadic/summarize/dashboard/components.py Outdated Show resolved Hide resolved

berndbohmeier and others added 2 commits December 4, 2025 12:48

Apply wording change for summary from code review

c82f8b7

Co-authored-by: danieljbridges <51692943+danieljbridges@users.noreply.github.com>

Clean up some things in summarize

4acbf6d

JasonAHendry requested changes Dec 5, 2025

View reviewed changes

berndbohmeier added 4 commits December 5, 2025 11:32

Add pydantic to dependencies

506b5fd

Rename missing to not_sequenced

3ec23f2

Use regex for matching of alt column names

dedf24c

Also auto open browser after a delay for summary

b389612

This is because the dashboard needs some time to start

berndbohmeier force-pushed the feature/summary branch from 563277f to b389612 Compare December 5, 2025 11:29

berndbohmeier and others added 17 commits December 5, 2025 13:41

Ensure dtype of throughput table is int

2b9d189

So we don't show 123.0 instead of 123

Rename meta_data to metadata

99fd811

Remove summary command structure comment

6178046

Expose min coverage and max contamination values

1dce302

Expose false-positive filter and consolidate prev calc

d9d70d8

Make the parameters used to define likely false-positive mutations function arguments; consolidate two versions of prevalence calculation functions.

Remove ExperimentResultsChecker class

2cacbd9

Replace with a function and dataclass

A few typing fixes

ebe4be2

Remove fixing of leading zeros

6a2a225

We don't want to do this kind of fixing in nomadic, it should be done before.

Make summary general

9e7dc49

Put all mvp specific setting into an object and make the dashboard work for any panel with amplicons.

Only plot data in summary if we have it

7b58bfd

Remove print statement

d752536

Ensure to only include field samples in metadata

6aaa86f

Exit summary early if we have no field samples

182ce36

Ensure we print the experiment that has an metadata error

f5f0228

More detailed output of what summary is loaded

645c1e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Summarize command #64

Summarize command #64

Uh oh!

berndbohmeier commented Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JasonAHendry left a comment

Uh oh!

Uh oh!

JasonAHendry Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JasonAHendry Dec 5, 2025

Uh oh!

JasonAHendry Dec 5, 2025

Uh oh!

JasonAHendry Dec 5, 2025

Uh oh!

JasonAHendry Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		self.df["barcode"] = [correct_barcode_format(b) for b in self.df["barcode"]]


		class ExtendedMetadataTableParser(MetadataTableParser):

		return Settings(**data)


		def get_master_columns_mapping(settings: Settings) -> dict[str, str]:

		from nomadic.util.summary import Settings, get_map_settings


		class SummaryDashboardBuilder(ABC):

Summarize command #64

Are you sure you want to change the base?

Summarize command #64

Uh oh!

Conversation

berndbohmeier commented Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JasonAHendry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JasonAHendry Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JasonAHendry Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

JasonAHendry Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

JasonAHendry Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

JasonAHendry Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants