Skip to content

Data Manipulation

Eric Marty edited this page Sep 12, 2023 · 16 revisions

Page Editor: @allopole
To request edits to this page, open an issue, and tag @allopole


Introduction to data manipulation

Data Manipulation is a related to ‘Data Exploration’ and prepares us for data analysis. It involves ‘manipulating’ data using available sets of variables. This is done to enhance understand of the data in different ways.

80% of time spent on data analysis is for cleaning and preparing the data (Dasu and Johnson, 2003)

What should we be keeping in mind with respect to data manipulation? Hopefully this section will point us to links for specific data manipulation tasks that we encounter from time to time.

Strive for "Tidy Data"

Hadley Wickham, developer of R packages like reshape, ggplot2, stringr, and probably others, has advocated for scientists to strive for "tidy" data. He says, "Tidy datasets are all alike but every messy dataset is messy in its own way". Here's a link to his paper on Tidy Data and here is a tutorial on data tidying. Tidy data make summarizing, data visualization, and analysis nice and friendly.

Which R packages help you achieve Tidy Data? See this analyticsvidhya article. Here's some basic info:

  • magrittr -- piping functions to simplify and make code easier to read. We highly recommend learning the piping function early in your coding career even if you don't always use it. For example my.data %>% my.function is the same as my.function(my.data) and sometimes when you want to string a few functions together to summarize and then plot using ggplot, you might not necessarily have to save the changes to the data. But if you do want to save the changes to the data your manipulating, you can use my.data %<>% my.function is the same as my.new.data <- my.function(my.data).
  • dplyr -- lots of functions that you can also do in base R, but dplyr provides a small number of verb/functions that you can remember and return to often without having to look up what they do each time. My favorite cheat sheet is found here and it helps show you how to use the functions on an example data set.
  • tidyr -- the cheat sheet above also helps with tidyr functions but here are some more resources. The last link brings you to an R community discussion about using Base R vs the tidyverse.
  • reshape, and more!

Summarize data with descriptive statistics

Summary statistics are the first figures used to represent nearly every dataset. They also form the foundation for complicated computations and analyses. Thus, it's important that we can do these statistics in a nice, reproducible, efficient way.

Preparation for and notes on analysis

Lab Links

Clone this wiki locally