Skip to content

Conversation

@tristan-f-r
Copy link
Contributor

@tristan-f-r tristan-f-r commented Jul 30, 2025

Adds documentation back to the HIV dataset, and:

  • Does offline UniProt mapping. This drops a significant part of the code.
  • Drops KEGG gold standard generation, since it wasn't sufficient. Note that we now include a Prior work section in datasets/README.md, so this will never be actually lost.

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some initial comments that mostly pertain to my different expectations for what can go in the dataset readmes. I am finding it hard to review the new script-based pipeline with respect to the original notebooks. I'm not sure if it is a pure port or a complete rewrite.

@tristan-f-r
Copy link
Contributor Author

The kegg_orthology.py is a port. Every other file has been substantially rewritten, especially name_mapping.py, though for the better: this entire pipeline is now just 90 lines of Python.

@tristan-f-r tristan-f-r changed the title docs: hiv refactor: hiv Jan 23, 2026
@tristan-f-r tristan-f-r added the dataset Mutating datasets in any way. label Jan 23, 2026
@tristan-f-r
Copy link
Contributor Author

I'm going to separate the miscellaneous cache changes over to make this an easier diff to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Mutating datasets in any way.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants