Skip to content

Data Management

Eric Marty edited this page Sep 12, 2023 · 26 revisions

Page Editor: @allopole
To request edits to this page, open an issue, and tag @allopole


Introduction to data management

In our lab, we collect data for our projects from many different sources and experiments but the way we manage and store data should be uniform to some degree. This page will hopefully help us figure out our individual practices data management practices and figure out ways to record the best ones.

Data camp's data management page is really helpful and in line with suggested Drake Lab practices: https://www.statmethods.net/management/index.html

Raw data

Data can be copied from hand-written notes into Excel, data scraped from online resources, or simulation-based data. These data are considered to be 'raw data' and should be backed up and saved in their rawest form. It's tempting to overwrite with cleaner data but resist.

If your data are from simulations, the files might be too large to manage this way. In this case, make sure you save the exact way that the data were generated including version number of R packages (see the software packrat), and back up your source code on GitHub.

When naming data files, use google style guide. Other notes:

  • If data are originally in Excel, save data as CSV - easier to read into R, doesn't require proprietary software
  • If variables are not intuitive, figure out what they represent and change them to something you easily understand
  • If NA values are coded as something weird, take note and consider changing to NA so R understands
  • Name your data file as something people can easily understand 2016-05-alaska-b.csv and save meta-data with the same name

Sharing your dissertation research

Once projects are published or you leave the lab, save it within the DrakeLab GitHub. Before research is published, the Github repo should be kept private. Naming the project repo on Drake-Lab GitHub should follow these tips to increase standardization:

  • where relevant use surname or grant name/nickname
  • separate words by hyphens
  • use descriptive, meaningful words
  • e.g., "evans-mosq-field-study"

Published works should be archived in a public repository with a DOI. The associated repo can then be kept private or made public, depending on the aims of the research. The definitive, public archive of lab research should be submitted to Figshare, Dryad, or Zenodo for permanent deposition of research/data rather than just making GitHub repos public. Doing so makes sure what you publish is cleaner and doesn’t have info that you may not want everyone to be able to see (such as rejected journal submissions/protocols/etc).

Resources:

  1. Good Enough Practices in Scientific Computing
  2. Tidy Data

Lab Links

Clone this wiki locally