-
Notifications
You must be signed in to change notification settings - Fork 0
Summarize command #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The -e flag is not needed.
Get ready for better text Ensure we mark duplicates and only take the best sample Ensure we exclude samples not in master metadata
Make sample_type mandatory. We need it for summary analysis and it is best if people just include it when creating the sample sheet. It should not be much extra work. We maybe want to make it optional if we can derive it from the sample name, to be discussed.
For now just don't group them by alt alleles. We might have a different set of alt alleles when we call the same mutation in different experiments, as we might have a triallelic side. As we in the end group by amino acid change, this leads to problems. We need to better discuss how to hanle csq calling correctly for multiple changes.
This helps to only look at the sequenced samples
Green and red should be more intuitve what they mean
It makes sense that we don't report contamination if we have low coverage, as this could actually just be caused by low coverage. But if we are over the abs threshold, we can be sure, it's contamination
Exclude columns with to many entries and which are numeric. Maybe we want to filter more in the future, or have a way to provide a list
They are big and we don't need them for the analysis
Co-authored-by: danieljbridges <51692943+danieljbridges@users.noreply.github.com>
JasonAHendry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK -- I had a first look over the updated code.
I think we have a lot of great stuff here but still plenty of work to do to finalise. Some of the bigger things in my view are:
- thinking a bit more about what the 'core' summary command should provide; while allowing for some flexibility for bespoke analyses for specific panels (e.g. hrp2/3 deletions)
- reviewing how we handle mapping, and ideally providing a more generic solution that doesn't require the user providing GIS files themselves
- considering again how we handle behaviour across versions; e.g. what happens if we have some data processed with delve, and some with bcftools? Similarly, we should use the unfiltered VCF file if it is available
- allowing just the dashboard to be launched if the data has been processed already
See some comments throughout, can also talk more in person!
| self.df["barcode"] = [correct_barcode_format(b) for b in self.df["barcode"]] | ||
|
|
||
|
|
||
| class ExtendedMetadataTableParser(MetadataTableParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously I had sample_type optional for running realtime, and only required when wanting to do the nomadic summarize, using this subclass of MetadataTableParser.
Now that we are making sample_type mandatory, I think now this is unnecessary, and this code can be removed?
However, using the subclass I also checked whether there was at least one positive and negative control in the experiment. This is important for some of the QC checks that are done in summarize (e.g. checking for contamination, and in the future it will be required for hrp2/3 deletion detection).
Is this done somewhere else now? Or how will summarize behave if the user has an experiment without a positive and/or negative control?
| return Settings(**data) | ||
|
|
||
|
|
||
| def get_master_columns_mapping(settings: Settings) -> dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss around best way to introduce metadata for mapping. I think if we keep this approach we should add some more docstring to this module. I am also not 100% sure util/summary.py is the best place for this mapping related code.
src/nomadic/summarize/main.py
Outdated
| # for now we use the master metadata file | ||
| inventory_metadata = pd.concat([df[fixed_columns] for df in dfs]) | ||
| if meta_data_path is not None and not no_master_metadata: | ||
| master_metadata = pd.read_csv(meta_data_path).rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest we write a specific parser or somehow encapsulate and test the loading of the master metadata file.
In general I think user input validation / testing is very important for our use case, so whenever the user is providing a file to the program, it is good to have some tests around what happens if it is incorrect &c. We do that for the nomadic realtime metadata file at present.
| f"{output_dir}/summary.gene-deletions.prevalence.csv", index=False | ||
| ) | ||
|
|
||
| for col in prevalence_by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check that the prevalence_by columns are all in the metadata further up, and provide some graceful error handling if they are not.
| from nomadic.util.summary import Settings, get_map_settings | ||
|
|
||
|
|
||
| class SummaryDashboardBuilder(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is completely my fault, but not sure I 100% love the amount of code duplication we get by having two separate dashboard class hierarchies. Might be quite hard to refactor though.
This is because the dashboard needs some time to start
563277f to
b389612
Compare
So we don't show 123.0 instead of 123
Make the parameters used to define likely false-positive mutations function arguments; consolidate two versions of prevalence calculation functions.
This commit does three things 1. Define a new class, called ExperimentOutputChecker, that holds information about experiment outputs for nomadic summarize. 2. Consolidate some functions that were duplicate (e.g. find_metadata and get_metadata_csv 3. Reorganise the util/experiment.py module
Replace with a function and dataclass
We don't want to do this kind of fixing in nomadic, it should be done before.
We try it first like this, but imo it would be better to actually separate viewing the summary into a different command or combine it with dashboard. It is a bit awkward to combine an output command (produce the files of summary) with an input command (showing files from a folder).
Put all mvp specific setting into an object and make the dashboard work for any panel with amplicons.
This is the first iteration of the summarize command. It is by no mean done, but it has the most important first steps
UI:
answers which samples are done, which are failing. Gives an idea on how much the sequencing is already done
gives an idea about how good the sequencing is working. Which experiments have good data, which have high coverage, which are contaminated, etc.
Two ways to show prevalence, grouped by columns in the metadata file