Skip to content

Create function to build ARTIS .duckdb directly from KNB  #28

@theamarks

Description

@theamarks

Running To-do List

  • add artis-duckdb-metadata/blob/develop/process_knb_to_duckdb.R to exploreARTIS/develop-build-knb-duckdb branch
    • function .R scripts need to live in ./R/ directory when developing an R package
  • Move all roxygen2 documentation decorators to the very top of the function script. It may pass checks and build when it is separated across the script, but lets stick to explected formatting conventions.
  • explicitly #' @export process_knb_to_duckdb in roxygen2 header to signify the main function to export the the package NAMESPACE.
  • Reorganize script so helper functions are listed first. Think about maintaining an "executable timeline" from top to bottom
  • call package dependencies in roxygen2 header instead of in script
  • add directory argument to process_knb_to_duckdb() to replace download_dir <- "~/Downloads/artis_downloads"
  • We need the ability to generate a duckdb of "custom artis timeseires" HS_version/year pairings. Very few end users will need all HS_versions and years. Perhaps have two functions? Or an argument along the lines of "artis_custome_timeseries = TRUE"

Problem

Storing the ARTIS database on KNB requires the use of many .csv to ensure the data is preserved in a accessible and usable format. For our end users, a duckdb will provide many benefits. However, we want to make it as easy as possible to set up a duckdb for our end users without introducing even more technologies. They will already have a learning curve to querying and running analyses with duckdb (trying to minimize!). We want to make the uptake of our new duckdb distribution as smooth and reproducible as possible.

Solution

Call a single function and BAM, you have the power of the new ARTIS duckdb on your local computer!

Add an exploreARTIS function that standardizes and streamlines building the ARTIS duckdb pulling data directly from the ARTIS KNB record. Use @Anurag19101996's script as foundation to insert data in a standardized way into duckdb. Pulling data directly from KNB would also register every ARTIS data download through the built in KNB metric service!

Function arguments:

  • version: "latest" or "DOI" or "1.0.0"
  • model_run?: "FAO" of "SAU" (not sure how we will separate these in KNB. might not need if new DOI assigned to different data versions)
  • user_orcid: "https://orcid.org/0000-0002-9370-9128" (for signing into KNB)
  • other KNB credentials needed?
  • path: file path for duckdb file

Questions

  • Use DataOne API to request data OR rdataone package?
  • Is rdataone just a R client for the DataONE API?
  • How do we specify the KNB repository through the DataONE API and/or rdataone package?
  • Is it possible to download/pull KNB data directly into duckdb without saving it locally first? Insert R function into SQL query passed to duckdb?
  • Find the persistent DOI for the ARTIS data and the versioned DOIs

Relevant info and resources

The Knowledge Network for Biocomplexity (KNB) data repository is a member of the DataOne network of data repositories.

Ideas for function name

  • exploreARTIS::build_artis_with_ducks() 🦆
  • build_artis_duckdb()
  • make_artis_duckdb()
  • setup_artis_duckdb()

Metadata

Metadata

Labels

🏛️ Organizefile, folder, directory, architecture✋ help wantedExtra attention is needed🪄 enhancementNew functionality or feature request

Projects

Status

🔍 Needs Review

Relationships

None yet

Development

No branches or pull requests

Issue actions