Skip to content

Conversation

@afg1
Copy link

@afg1 afg1 commented Jan 8, 2026

This PR adds a new data format to the available ones for export - parquet. This is a more AI-ready data format, and is part of the work I'm doing to enable direct creation of a huggingface dataset from RNAcentral.

Adds a dependency on pyarrow to handle parquet writing.

I also ran into some weirdness with the database schema reflection, which may not happen in the production environment, but the code here should be able to get the schema directly, or fallback to a specified one if not.

I tested locally with the changes in the recent webcode PR, and it produces valid parquet files with the things we want in them!

afg1 added 3 commits January 6, 2026 16:34
…it definition if it fails

The public db for example does not allow the necessary permissions, so it will fallback to the explicit definition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants