-
Notifications
You must be signed in to change notification settings - Fork 0
Data Manipulation
Page Editor: @allopole
To request edits to this page, open an issue, and tag @allopole
Data Manipulation is a related to ‘Data Exploration’ and prepares us for data analysis. It involves ‘manipulating’ data using available sets of variables. This is done to enhance understand of the data in different ways.
80% of time spent on data analysis is for cleaning and preparing the data (Dasu and Johnson, 2003)
What should we be keeping in mind with respect to data manipulation? Hopefully this section will point us to links for specific data manipulation tasks that we encounter from time to time.
Hadley Wickham, developer of R packages like reshape, ggplot2, stringr, and probably others, has advocated for scientists to strive for "tidy" data. He says, "Tidy datasets are all alike but every messy dataset is messy in its own way". Here's a link to his paper on Tidy Data and here is a tutorial on data tidying. Tidy data make summarizing, data visualization, and analysis nice and friendly.
Which R packages help you achieve Tidy Data? See this analyticsvidhya article. Here's some basic info:
-
magrittr -- piping functions to simplify and make code easier to read. We highly recommend learning the piping function early in your coding career even if you don't always use it. For example
my.data %>% my.functionis the same asmy.function(my.data)and sometimes when you want to string a few functions together to summarize and then plot using ggplot, you might not necessarily have to save the changes to the data. But if you do want to save the changes to the data your manipulating, you can usemy.data %<>% my.functionis the same asmy.new.data <- my.function(my.data). - dplyr -- lots of functions that you can also do in base R, but dplyr provides a small number of verb/functions that you can remember and return to often without having to look up what they do each time. My favorite cheat sheet is found here and it helps show you how to use the functions on an example data set.
- tidyr -- the cheat sheet above also helps with tidyr functions but here are some more resources. The last link brings you to an R community discussion about using Base R vs the tidyverse.
- reshape, and more!
Summary statistics are the first figures used to represent nearly every dataset. They also form the foundation for complicated computations and analyses. Thus, it's important that we can do these statistics in a nice, reproducible, efficient way.
- Very basic R blogger post on mean, standard deviation, summary function: https://www.r-bloggers.com/r-tutorial-series-summary-and-descriptive-statistics/
- Descriptive statistics for different types of data: http://rcompanion.org/handbook/C_02.html
- Packages for making nice little tables with Rmd (kable) and for outputting tables to pdf
- Data camp's article on basic statistics: https://www.statmethods.net/stats/index.html
- And on advanced statistics: https://www.statmethods.net/advstats/index.html
- Aaron King's guide to ODE's in R: https://kingaa.github.io/thid/odes/ODEs_in_R.pdf
- journal-club doc
- google-sites lab manual
- index of all Drake-lab google sites
- lab-meeting--minutes doc Contact John if you are having trouble accessing google docs or websites.
- repository of public domain images