Skip to content

Conversation

@klterwelp
Copy link

Hello!

I've completed my review of code that was changed by Raymond and downstream Madin processing using messier datasets (ex: anything touched by patric data).

Summary of Changes

Only prepare datasets that have updates, otherwise use Madin2020 data.

Pasteur Data

New Pasteur data included "Strictly anaerobic", which was not handled by Madin code. My changes keep these five species rather than filtering this data out.

Patric Data

I use the newest patric data. To do this, I updated the BV-BRC link.
Warning: 576 rows are corrupted with the newest patric data. This is a BV-BRC issue and I ignored it for now.
Removed genome data from non-complete genomes instead of just plasmids.
More robust cell shape and temperature clean-up.

More robust removal of SAG/MAG data but not full removal (ie: require genome to be complete). This is worth discussion. Primarily, I think MAGs are fine for non-metabolism/genome length traits. I could see skipping the SAG/MAG removal and rather set NA for traits we think should be NA for MAGs/SAGs.

rrndb

Added pre-filter for double-NA rows (NA for rRNA and tRNA counts)

faprotax

Enable the capture of data from species with sub-species information.

Other notes:

Genobank and metanogen datasets and scripts are not different from Madin. I did not review these even though Raymond made some changes to the readmes.

Copy link
Member

@jananiravi jananiravi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- thanks for the quick updates, Kat!


#Only include organisms with both genus and species name
store3 <- store2[lengths(strsplit(store2$species, " ")) == 2,]
store3 <- store2[lengths(strsplit(store2$species, " ")) >= 2,]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you may have strain names, too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just to include strain names as well! The data I saw that were >2 had data with strain names. It's possible this isn't the most robust way to do this.

For example, it may be more robust to connect these names to NCBI taxids and then check if those taxids are at the species level or below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlesiyon has the full taxdump file --> full set of unique species (at that phyletic level) and strains -- with their corresponding taxIDs (parent taxIDs). Worth incorporating?


#Remove all genome data where sequencing depth < recommended
#Clean up column from text
pat2$sequencing_depth <- gsub("approximately|approx.|fold|ND|n.d|about|Unknown|unknown|missing|Not Applicable|not applicable|not specified|unspecified|at least|>|x|X|-|","",pat2$sequencing_depth)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional changes you suggested will be added later -- not for this project?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently none of the output of these changes are used in my downstream code reviews in other repos.

These are things I recommend to fix the glaring issues with the pipeline considering patric --> bv-brc updates. It is possible these issues existed in 2023 (since that's still 5 years after the paper).

If we wanted to be thorough, I could ask Raymond for his local copy of the BV-BRC data he downloaded and the day he downloaded it. Then I could test the pipeline using that specific data and check if these issues persist even in earlier BV-BRC data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, any additional changes that are only comments are not implemented for this updated data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth chatting with @rlesiyon about this. @ninawale, thoughts?

@jananiravi jananiravi requested a review from rlesiyon January 7, 2026 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants