Data Manipulation

Page Editor: @allopole
To request edits to this page, open an issue, and tag @allopole

Introduction to data manipulation

Data Manipulation is a related to ‘Data Exploration’ and prepares us for data analysis. It involves ‘manipulating’ data using available sets of variables. This is done to enhance understand of the data in different ways.

80% of time spent on data analysis is for cleaning and preparing the data (Dasu and Johnson, 2003)

What should we be keeping in mind with respect to data manipulation? Hopefully this section will point us to links for specific data manipulation tasks that we encounter from time to time.

Strive for "Tidy Data"

Hadley Wickham, developer of R packages like reshape, ggplot2, stringr, and probably others, has advocated for scientists to strive for "tidy" data. He says, "Tidy datasets are all alike but every messy dataset is messy in its own way". Here's a link to his paper on Tidy Data and here is a tutorial on data tidying. Tidy data make summarizing, data visualization, and analysis nice and friendly.

Which R packages help you achieve Tidy Data? See this analyticsvidhya article. Here's some basic info:

magrittr -- piping functions to simplify and make code easier to read. We highly recommend learning the piping function early in your coding career even if you don't always use it. For example my.data %>% my.function is the same as my.function(my.data) and sometimes when you want to string a few functions together to summarize and then plot using ggplot, you might not necessarily have to save the changes to the data. But if you do want to save the changes to the data your manipulating, you can use my.data %<>% my.function is the same as my.new.data <- my.function(my.data).
dplyr -- lots of functions that you can also do in base R, but dplyr provides a small number of verb/functions that you can remember and return to often without having to look up what they do each time. My favorite cheat sheet is found here and it helps show you how to use the functions on an example data set.
tidyr -- the cheat sheet above also helps with tidyr functions but here are some more resources. The last link brings you to an R community discussion about using Base R vs the tidyverse.
reshape, and more!

Summarize data with descriptive statistics

Summary statistics are the first figures used to represent nearly every dataset. They also form the foundation for complicated computations and analyses. Thus, it's important that we can do these statistics in a nice, reproducible, efficient way.

Very basic R blogger post on mean, standard deviation, summary function: https://www.r-bloggers.com/r-tutorial-series-summary-and-descriptive-statistics/
Descriptive statistics for different types of data: http://rcompanion.org/handbook/C_02.html
Packages for making nice little tables with Rmd (kable) and for outputting tables to pdf

Preparation for and notes on analysis

Data camp's article on basic statistics: https://www.statmethods.net/stats/index.html
And on advanced statistics: https://www.statmethods.net/advstats/index.html
Aaron King's guide to ODE's in R: https://kingaa.github.io/thid/odes/ODEs_in_R.pdf

Lab Links

journal-club doc
google-sites lab manual
index of all Drake-lab google sites
lab-meeting--minutes doc Contact John if you are having trouble accessing google docs or websites.
repository of public domain images

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Manipulation

Introduction to data manipulation

Strive for "Tidy Data"

Summarize data with descriptive statistics

Preparation for and notes on analysis

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lab Links

Clone this wiki locally