Review of Raymond's Code #2

klterwelp · 2025-12-18T22:16:07Z

Hello!

I've completed my review of code that was changed by Raymond and downstream Madin processing using messier datasets (ex: anything touched by patric data).

Summary of Changes

Only prepare datasets that have updates, otherwise use Madin2020 data.

Pasteur Data

New Pasteur data included "Strictly anaerobic", which was not handled by Madin code. My changes keep these five species rather than filtering this data out.

Patric Data

I use the newest patric data. To do this, I updated the BV-BRC link.
Warning: 576 rows are corrupted with the newest patric data. This is a BV-BRC issue and I ignored it for now.
Removed genome data from non-complete genomes instead of just plasmids.
More robust cell shape and temperature clean-up.

More robust removal of SAG/MAG data but not full removal (ie: require genome to be complete). This is worth discussion. Primarily, I think MAGs are fine for non-metabolism/genome length traits. I could see skipping the SAG/MAG removal and rather set NA for traits we think should be NA for MAGs/SAGs.

rrndb

Added pre-filter for double-NA rows (NA for rRNA and tRNA counts)

faprotax

Enable the capture of data from species with sub-species information.

Other notes:

Genobank and metanogen datasets and scripts are not different from Madin. I did not review these even though Raymond made some changes to the readmes.

There's no reason to remove these species for being "Strictly Anaerobic". I just removed the "Strictly." This adds 5 species. I assume the new data added this category which is why Madin did not implement this change.

…length

Also adds comments related to Complete genomes removed

Also removes ( from the array filter for cell_shape

…ecies info

jananiravi

Looks good -- thanks for the quick updates, Kat!

jananiravi · 2026-01-05T22:52:29Z

R/preparation/faprotax.R


 #Only include organisms with both genus and species name 
-store3 <- store2[lengths(strsplit(store2$species, " ")) == 2,]
+store3 <- store2[lengths(strsplit(store2$species, " ")) >= 2,]


since you may have strain names, too?

Yes, just to include strain names as well! The data I saw that were >2 had data with strain names. It's possible this isn't the most robust way to do this.

For example, it may be more robust to connect these names to NCBI taxids and then check if those taxids are at the species level or below.

@rlesiyon has the full taxdump file --> full set of unique species (at that phyletic level) and strains -- with their corresponding taxIDs (parent taxIDs). Worth incorporating?

R/preparation/pasteur.R

jananiravi · 2026-01-05T22:53:29Z

R/preparation/patric.R


 #Remove all genome data where sequencing depth < recommended
 #Clean up column from text
 pat2$sequencing_depth <- gsub("approximately|approx.|fold|ND|n.d|about|Unknown|unknown|missing|Not Applicable|not applicable|not specified|unspecified|at least|>|x|X|-|","",pat2$sequencing_depth)


The additional changes you suggested will be added later -- not for this project?

Currently none of the output of these changes are used in my downstream code reviews in other repos.

These are things I recommend to fix the glaring issues with the pipeline considering patric --> bv-brc updates. It is possible these issues existed in 2023 (since that's still 5 years after the paper).

If we wanted to be thorough, I could ask Raymond for his local copy of the BV-BRC data he downloaded and the day he downloaded it. Then I could test the pipeline using that specific data and check if these issues persist even in earlier BV-BRC data.

But yes, any additional changes that are only comments are not implemented for this updated data.

Worth chatting with @rlesiyon about this. @ninawale, thoughts?

workflow.R

klterwelp added 20 commits December 16, 2025 16:09

Add fix for BV-BRC metadata link

c70fb16

Replace strictly anaerobic with anaerobic

07a2daf

There's no reason to remove these species for being "Strictly Anaerobic". I just removed the "Strictly." This adds 5 species. I assume the new data added this category which is why Madin did not implement this change.

Update genome length to only use "Complete" genomes

ba518a1

Also filter out genomes that are NA for sequencing status for genome …

a147b9f

…length

Remove sequencing status filter for genome length

596d1ce

More robustly remove SAG and MAG

8b6c9c9

Also adds comments related to Complete genomes removed

Add comment on how many genomes this removes

187d4ef

Make cell shape filter more robust to ARRAY nonsense

20fed22

Remove o from temperatures, adjust downstream filtering

d4765bf

Also removes ( from the array filter for cell_shape

Only prepare raymond updated scripts

9954242

Make oC removal more robust for optimal_temp

17bfede

Add updated patric dataset

8fc9aef

Add step to remove NA for rRNA and tRNA gene rows

781fee2

Remove double NA rows

4b5a247

Change to at least 2 names to avoid filtering out species with sub-sp…

85a5a4f

…ecies info

Include species with sub species names

11f7180

Removed genbank, raymond did not update.

dd99cff

Remove preparing metanogen, same as Madin

62839c4

Keep strictly anaerobic rather than changing to anaerobic

ccb1ffd

Add updated CSVs

bd30774

jananiravi approved these changes Jan 5, 2026

View reviewed changes

jananiravi requested a review from rlesiyon January 7, 2026 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Review of Raymond's Code #2

Review of Raymond's Code #2

Uh oh!

klterwelp commented Dec 18, 2025

Uh oh!

jananiravi left a comment

Uh oh!

jananiravi Jan 5, 2026

Uh oh!

klterwelp Jan 6, 2026

Uh oh!

jananiravi Jan 7, 2026

Uh oh!

Uh oh!

jananiravi Jan 5, 2026

Uh oh!

klterwelp Jan 6, 2026

Uh oh!

klterwelp Jan 6, 2026

Uh oh!

jananiravi Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Review of Raymond's Code #2

Are you sure you want to change the base?

Review of Raymond's Code #2

Uh oh!

Conversation

klterwelp commented Dec 18, 2025

Summary of Changes

Pasteur Data

Patric Data

rrndb

faprotax

Other notes:

Uh oh!

jananiravi left a comment

Choose a reason for hiding this comment

Uh oh!

jananiravi Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

klterwelp Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jananiravi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jananiravi Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

klterwelp Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

klterwelp Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jananiravi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants