Skip to content

Conversation

@alinakbase
Copy link
Collaborator

This refactor introduces a cleaner, modular, and more maintainable architecture for UniProt data parsing within the cdm_data_loader_utils package. The new design separates concerns across multiple parser components, centralizes shared identifier extraction, enhances XML utilities, and adds a comprehensive test suite to ensure long-term stability.

Key improvements include:
• Modular parser structure under cdm_data_loader_utils/parsers/
• Unified shared identifier extraction (shared_identifiers.py)
• Robust XML parsing utilities (xml_utils.py)
• Refactored UniProt parser (uniprot.py) with clearer logic paths
• Complete tests for UniProt refactor, including:
• shared identifiers
• XML utilities
• UniProt entry parsing
• Cleaner directory layout aligned with CDM conventions

This refactor provides a foundation for future expansion (features, evidence, associations, and publications) while improving maintainability and reducing duplicated logic.

@ialarmedalien ialarmedalien changed the base branch from main to develop December 4, 2025 16:27
@ialarmedalien ialarmedalien changed the base branch from develop to main December 4, 2025 16:30
@ialarmedalien ialarmedalien force-pushed the uniprot-refactor-v2 branch 2 times, most recently from 3b89f65 to 2e45b47 Compare December 4, 2025 17:03
@ialarmedalien ialarmedalien changed the base branch from main to develop December 4, 2025 17:03
@ialarmedalien ialarmedalien force-pushed the uniprot-refactor-v2 branch 3 times, most recently from 2a781b3 to bba5e5a Compare December 10, 2025 22:17
@alinakbase alinakbase force-pushed the uniprot-refactor-v2 branch 3 times, most recently from a49eea1 to 8653e2c Compare December 24, 2025 01:28
if os.path.exists(tmp_path):
try:
os.remove(tmp_path)
except Exception:
@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 56.22407% with 211 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.75%. Comparing base (2db2ff5) to head (a84db46).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
src/cdm_data_loader_utils/parsers/uniprot.py 48.75% 144 Missing ⚠️
src/cdm_data_loader_utils/parsers/uniref.py 60.00% 54 Missing ⚠️
src/cdm_data_loader_utils/parsers/xml_utils.py 80.35% 11 Missing ⚠️
...data_loader_utils/parsers/gene_association_file.py 0.00% 1 Missing ⚠️
...dm_data_loader_utils/parsers/shared_identifiers.py 88.88% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #45      +/-   ##
===========================================
+ Coverage    51.18%   51.75%   +0.57%     
===========================================
  Files           59       61       +2     
  Lines         3048     3223     +175     
===========================================
+ Hits          1560     1668     +108     
- Misses        1488     1555      +67     
Files with missing lines Coverage Δ
...data_loader_utils/parsers/gene_association_file.py 62.06% <0.00%> (ø)
...dm_data_loader_utils/parsers/shared_identifiers.py 88.88% <88.88%> (ø)
src/cdm_data_loader_utils/parsers/xml_utils.py 80.35% <80.35%> (ø)
src/cdm_data_loader_utils/parsers/uniref.py 55.18% <60.00%> (+13.47%) ⬆️
src/cdm_data_loader_utils/parsers/uniprot.py 49.46% <48.75%> (-7.73%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24cc1aa...a84db46. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alinakbase and others added 2 commits January 28, 2026 09:21
Remove .DS_Store from tests directory

Fixing path problems that were preventing module import

Removing `tests/` directory from under `tests`

UniRef updates

lint

First steps towards using external IDs for CDM entities

organize uniprot.py

revise uniprot.py and test_uniprot.py

format: apply ruff formatting to uniprot parser

change the format

format

Regenerate uv.lock

Uniprot branch file movements

Fix formatting / tests / docs

update uniref and test uniref

reformat uniprot and uniref

update uniprot.py and test

any data files used by the test should go in test/data

formatting uniprot.py

Fix UniRef parsing tests and stabilize timestamp handling

Remove large uniprot archaea test data directory

style: apply Ruff formatting to uniprot-refactor-v2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants