Welcome! This is the second “data wrangling” lesson, which is about two topics: “reshaping” and “summarizing” data.
Some functions in R, whether for making graphs, running statistics, or
even just running simple calculations, need your data to be in
particular format, such as the “tidy” format that has come up a few
times. Getting data into different formats is easy with dplyr
functions.
We will go over separate() and unite(), which separate and unite values in columns based on consistent separator characters (like “_” or a space) and the pivot_wider() and pivot_longer() functions, which deal with pivoting data from columns into rows and vice versa.
Often we have groups in our data that we are interested in learning
something about. For instance, in the last activity: what was the
average weight of male vs female bison? Every observation was a bison,
and we can ask R to group those rows by a column (animal_sex) and
calculate a mean. We do this using the group_by() and summarize()
functions.
You will submit two outputs for this activity:
-
A PDF of a rendered Quarto document with all of your R code. Please create a new Quarto document (e.g. don’t use this
README.qmd), include all of the code that appears in this document, in addition to adding your own code and answers to all of the questions in the “Q#” sections. Submit this through Gradescope. -
A plot and your answers from the final “Energy production” section. Submit that content to the google slide below: https://docs.google.com/presentation/d/1AJVu0eq5-OoRGi4UlBervrqKbFHOINvAB-8u0JaFzYM/edit?usp=sharing
If you have trouble submitting as a PDF, please ask Calvin or Malin for help. If we still can’t solve it, you can submit the .qmd file instead.
A reminder: Please label the code in your final submission in two
ways: 1) denote your answers to each question using headers that
correspond to the question you’re answering and 2) thoroughly “comment”
your code: remember, this means annotating your code directly by typing
descriptions of what each line does after a #. This will help future
you!
Let’s start by reading in the relevant packages
library(tidyverse)
library(here)This first section illustrates reshaping and summarizing data using a small, fake dataset that I created to more clearly illustrate what the functions are doing (sometimes it’s hard to wrap our heads around things when the datasets are massive and largely out of view).
Let’s read in the data. This dataframe has 5 columns: a site/year combination column, an annual water temperature (Celsius) for that site/year, and population estimates for three species of whale: Gray, Humpback, and Blue. Note that the Blue whale survey didn’t begin until 2022, so there is no data in the 2021 cell.
# Read in the data
whale_popn_wide <- read_csv(here("data/whale_popn_wide.csv"))
# Take a look at it
whale_popn_wide# A tibble: 4 × 5
site_year annual_temp `Gray whale` `Humpback whale` `Blue whale`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Monterey_2021 11 0 25 NA
2 Monterey_2022 12.5 16 26 55
3 Monterey_2023 10 29 23 63
4 Monterey_2024 9 42 21 66
Notice those slightly different quotations around the whale column
names? R doesn’t typically like anything but letters, numbers, “_”s,
and “.”s in column names, meaning certain data reading functions will
get a little angry if there are special characters, like spaces or
parentheses, in column names. The tidyverse read_csv() function gets
around this by putting the backticks (aka, grave accents, the ones
probably in the top left of your keyboard under the tilde sign) around
column names where it detects special characters in order to preserve
those special characters. The whale column names have spaces, so
whenever you reference them as a column, put those quotes around the
name:
## Weird `` column examples:
# Ex. Fetch the data within one column
whale_popn_wide$`Gray whale`[1] 0 16 29 42
# Ex. Plot Gray whale abundance vs Blue whale abundance
whale_popn_wide %>%
ggplot(aes(x = `Gray whale`,
y = `Blue whale`)) +
geom_point()The separate() function takes a handful of arguments:
-
data= the dataframe you add -
col= the column name you want to separate -
into= a vectors of the names of the new columns that you want to create -
sep= the separator between columns (can be a particular character, such as “_” or “.”, or can be a number referencing where to split)
The first thing you may notice is that the site_year column is not
very useful for us in this form; we should separate that into two
columns, one with site and one with year. Notice that in that column,
the separator that separates the site (Monterey) and the year (202X) is
an underscore (_). Let’s save this new dataframe as
whale_popn_wide.sep.
whale_popn_wide.sep <- whale_popn_wide %>%
# Separate the site_year column
separate(col = site_year,
# Separate into two columns in this case
into = c("site", "year"),
# Separate by the "_"
sep = "_"
)
# Now look at the data
whale_popn_wide.sep# A tibble: 4 × 6
site year annual_temp `Gray whale` `Humpback whale` `Blue whale`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Monterey 2021 11 0 25 NA
2 Monterey 2022 12.5 16 26 55
3 Monterey 2023 10 29 23 63
4 Monterey 2024 9 42 21 66
Now we have two new columns, where the underscore is removed!
Note that the data reads in year as a chr, which means character -
this is because we separated a character column (site_year) into two, so
it stays as a character. Let’s add on a mutate function to change year
to a numeric column using mutate()
whale_popn_wide.sep <- whale_popn_wide %>%
# Separate the site_year column
separate(col = site_year,
# Separate into two columns in this case
into = c("site", "year"),
# Separate by the "_"
sep = "_"
) %>%
# Run the year column through the as.numeric() function and re-store that as year
mutate(year = as.numeric(year))
# Now look at the data
whale_popn_wide.sep# A tibble: 4 × 6
site year annual_temp `Gray whale` `Humpback whale` `Blue whale`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Monterey 2021 11 0 25 NA
2 Monterey 2022 12.5 16 26 55
3 Monterey 2023 10 29 23 63
4 Monterey 2024 9 42 21 66
If we want to split simply by a position within the character string
(e.g. 4 characters from the left) and not by a specific character (e.g.
“_”), we can put a number into the sep = argument. Positive values
start at 1 at the far-left of the string; negative value start at -1 at
the far-right of the string. This is useful if we have a column with a
known, constant number of characters, each of which stands for certain
information (as you’ll see down in part 2…).
whale_popn_wide %>%
# Separate the site_year column
separate(col = site_year,
# Separate into two columns in this case
into = c("site", "year"),
# Separate by the "_"
sep = 4
)# A tibble: 4 × 6
site year annual_temp `Gray whale` `Humpback whale` `Blue whale`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Mont erey_2021 11 0 25 NA
2 Mont erey_2022 12.5 16 26 55
3 Mont erey_2023 10 29 23 63
4 Mont erey_2024 9 42 21 66
The unite() function is simply the inverse of the separate column
The separate() function takes a handful of arguments:
-
data= the dataframe you add -
col= the name of the new column you want to unite columns into -
...= the names of the columns that you want to unite together -
sep= the separator to put between values
Let’s unite the site and year columns back together, but this time separate them by a comma and a space
whale_popn_wide.sep %>%
# Unite the site and year columns back into site_year
unite(col = site_year, # New col name is going to be site_year again
# Add in the existing columns that we want to unite together
site, year,
# Separate by a comma and space this time ", "
sep = ", "
)# A tibble: 4 × 5
site_year annual_temp `Gray whale` `Humpback whale` `Blue whale`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Monterey, 2021 11 0 25 NA
2 Monterey, 2022 12.5 16 26 55
3 Monterey, 2023 10 29 23 63
4 Monterey, 2024 9 42 21 66
If we want to make a graph of the three whale species as colored lines or points, we need this data to be in “long” format. We need every row to be its own observation. Currently there are three observations per row: one for each species.
whale_popn_wide.sep# A tibble: 4 × 6
site year annual_temp `Gray whale` `Humpback whale` `Blue whale`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Monterey 2021 11 0 25 NA
2 Monterey 2022 12.5 16 26 55
3 Monterey 2023 10 29 23 63
4 Monterey 2024 9 42 21 66
To get this dataset into long format, we want it to be “pivoted”, where there’s a new column where the column names go to and a second column for the data values within the cells.
The pivot_longer() function takes the arguments:
-
data= Data -
cols= Columns that we want to pivot -
names_to= What new column we want the column names to go to -
values_to= What new column we want the cell values to go to
Lets put the column names into a column called species and the values
into a column called population_size, and name the new dataframe
whale_long.
# Pivot to long format
whale_long <- whale_popn_wide.sep %>%
# Take the columns we want to pivot
pivot_longer(cols = c(`Gray whale`, `Humpback whale`, `Blue whale`),
# Put the column names into a column called "species"
names_to = "species",
# Put the cell values into a column called "population_size"
values_to = "population_size")
# Check out the data
whale_long# A tibble: 12 × 5
site year annual_temp species population_size
<chr> <dbl> <dbl> <chr> <dbl>
1 Monterey 2021 11 Gray whale 0
2 Monterey 2021 11 Humpback whale 25
3 Monterey 2021 11 Blue whale NA
4 Monterey 2022 12.5 Gray whale 16
5 Monterey 2022 12.5 Humpback whale 26
6 Monterey 2022 12.5 Blue whale 55
7 Monterey 2023 10 Gray whale 29
8 Monterey 2023 10 Humpback whale 23
9 Monterey 2023 10 Blue whale 63
10 Monterey 2024 9 Gray whale 42
11 Monterey 2024 9 Humpback whale 21
12 Monterey 2024 9 Blue whale 66
Pay attention to how the data changes - note how when the data was wide, the 2021 values for Gray, Humpback, and Blue (0, 25, and NA) were horizontal, and when we pivot them, they are now vertical. The associated data (year, annual temp) gets duplicated to accommodate this pivot, and the column names have also been pivoted similarly so that every data point still has the associated data that it had before the pivot.
You can also specify the columns you want to pivot by using the column number rather than the column name
whale_popn_wide.sep %>%
pivot_longer(cols = c(4:6), # c(4:6) creates a vector of the numbers 4 through 6, aka 4, 5, 6
names_to = "species",
values_to = "population_size")# A tibble: 12 × 5
site year annual_temp species population_size
<chr> <dbl> <dbl> <chr> <dbl>
1 Monterey 2021 11 Gray whale 0
2 Monterey 2021 11 Humpback whale 25
3 Monterey 2021 11 Blue whale NA
4 Monterey 2022 12.5 Gray whale 16
5 Monterey 2022 12.5 Humpback whale 26
6 Monterey 2022 12.5 Blue whale 55
7 Monterey 2023 10 Gray whale 29
8 Monterey 2023 10 Humpback whale 23
9 Monterey 2023 10 Blue whale 63
10 Monterey 2024 9 Gray whale 42
11 Monterey 2024 9 Humpback whale 21
12 Monterey 2024 9 Blue whale 66
You can also specify the columns you DO NOT want to pivot, and it will pivot all other columns. Do this by adding a minus sign in front of the c() that holds your column names:
# Pivot to long format
whale_popn_wide.sep %>%
# Specify the columns we DON'T want to pivot
pivot_longer(cols = -c(site, year, annual_temp),
# Put the column names into a column called "species"
names_to = "species",
# Put the cell values into a column called "population_size"
values_to = "population_size")# A tibble: 12 × 5
site year annual_temp species population_size
<chr> <dbl> <dbl> <chr> <dbl>
1 Monterey 2021 11 Gray whale 0
2 Monterey 2021 11 Humpback whale 25
3 Monterey 2021 11 Blue whale NA
4 Monterey 2022 12.5 Gray whale 16
5 Monterey 2022 12.5 Humpback whale 26
6 Monterey 2022 12.5 Blue whale 55
7 Monterey 2023 10 Gray whale 29
8 Monterey 2023 10 Humpback whale 23
9 Monterey 2023 10 Blue whale 63
10 Monterey 2024 9 Gray whale 42
11 Monterey 2024 9 Humpback whale 21
12 Monterey 2024 9 Blue whale 66
# Or:
whale_popn_wide.sep %>%
# Specify we DON'T want to pivot columns 1 through 3
pivot_longer(cols = -c(1:3), # c(1:3) creates a vector of the numbers 1 through 3, aka 1,2,3
names_to = "species",
values_to = "population_size")# A tibble: 12 × 5
site year annual_temp species population_size
<chr> <dbl> <dbl> <chr> <dbl>
1 Monterey 2021 11 Gray whale 0
2 Monterey 2021 11 Humpback whale 25
3 Monterey 2021 11 Blue whale NA
4 Monterey 2022 12.5 Gray whale 16
5 Monterey 2022 12.5 Humpback whale 26
6 Monterey 2022 12.5 Blue whale 55
7 Monterey 2023 10 Gray whale 29
8 Monterey 2023 10 Humpback whale 23
9 Monterey 2023 10 Blue whale 63
10 Monterey 2024 9 Gray whale 42
11 Monterey 2024 9 Humpback whale 21
12 Monterey 2024 9 Blue whale 66
Great! Now we can graph this data using ggplot!
whale_long %>%
ggplot(aes(x = year, y = population_size, color = species)) +
geom_point() +
geom_line()pivot_wider() is simply the inverse of pivot_longer(). We tell it:
names_from= which column to get the the values that will become column namesvalues_from= which column to get the the values that will become cell values
whale_long %>%
pivot_wider(names_from = species,
values_from = population_size)# A tibble: 4 × 6
site year annual_temp `Gray whale` `Humpback whale` `Blue whale`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Monterey 2021 11 0 25 NA
2 Monterey 2022 12.5 16 26 55
3 Monterey 2023 10 29 23 63
4 Monterey 2024 9 42 21 66
Now let’s say we want to calculate an average population size for each whale species across all survey years. Let’s arrange the dataframe by species and then year to get a look at the data for each species.
whale_long %>% arrange(species, year)# A tibble: 12 × 5
site year annual_temp species population_size
<chr> <dbl> <dbl> <chr> <dbl>
1 Monterey 2021 11 Blue whale NA
2 Monterey 2022 12.5 Blue whale 55
3 Monterey 2023 10 Blue whale 63
4 Monterey 2024 9 Blue whale 66
5 Monterey 2021 11 Gray whale 0
6 Monterey 2022 12.5 Gray whale 16
7 Monterey 2023 10 Gray whale 29
8 Monterey 2024 9 Gray whale 42
9 Monterey 2021 11 Humpback whale 25
10 Monterey 2022 12.5 Humpback whale 26
11 Monterey 2023 10 Humpback whale 23
12 Monterey 2024 9 Humpback whale 21
We can think of this dataframe as three separate dataframes stacked on top of one another: one for Blue whales, one for Gray whales, and one for Humpback whales.
Conceptually, to get an average for each species, what we want is to
take the average of all values in the population_size column
separately for each of the three species. We can accomplish this using
the very handy functions group_by() and summarize()
group_by() tells R that you want to assign “groups” within your data.
R then knows that all observations (rows) that have a shared value in
the grouping column belong to that value’s group. Let’s try it out by
using species as a grouping variable, making each unique species a
group:
whale_long %>%
group_by(species)# A tibble: 12 × 5
# Groups: species [3]
site year annual_temp species population_size
<chr> <dbl> <dbl> <chr> <dbl>
1 Monterey 2021 11 Gray whale 0
2 Monterey 2021 11 Humpback whale 25
3 Monterey 2021 11 Blue whale NA
4 Monterey 2022 12.5 Gray whale 16
5 Monterey 2022 12.5 Humpback whale 26
6 Monterey 2022 12.5 Blue whale 55
7 Monterey 2023 10 Gray whale 29
8 Monterey 2023 10 Humpback whale 23
9 Monterey 2023 10 Blue whale 63
10 Monterey 2024 9 Gray whale 42
11 Monterey 2024 9 Humpback whale 21
12 Monterey 2024 9 Blue whale 66
What changed? Well, in the data, nothing! But notice at the top of the
data frame, it tells you what column is being used as the group, as well
as how many unique values there are (it should say Groups: species[3],
meaning there are 3 unique values in the species column that are
grouping the data).
If you would like to group by multiple columns, just add more column
names into group_by(), separated by commas (remember this for part 2…)
group_by(col1, col2, etc)
The real functionality of group_by() comes once we use another
function to modify this grouped data. Let’s try it by using the
summarize() function.
summarize() takes a data table, typically a grouped data table, and
summarizes an existing column to create a new column based on a function
that you provide. It has a similar syntax as mutate(), where you have
a new column equal the calculation you provide:
summarize(new_column = function(existing_column))
Let’s now calculate a mean() of the population_size column, grouped
by species, and call the new column mean_pop:
whale_long %>%
group_by(species) %>%
summarize(mean_pop = mean(population_size))# A tibble: 3 × 2
species mean_pop
<chr> <dbl>
1 Blue whale NA
2 Gray whale 21.8
3 Humpback whale 23.8
The output is a dataframe with 3 rows - one for each grouping variable.
summarize() took a mean of all 4 data points in the population_size
column separately for each of the 3 species. Why is the Blue whale mean
NA, though??
Let’s look at the original data again, sorted by species:
whale_long %>% arrange(species, year)# A tibble: 12 × 5
site year annual_temp species population_size
<chr> <dbl> <dbl> <chr> <dbl>
1 Monterey 2021 11 Blue whale NA
2 Monterey 2022 12.5 Blue whale 55
3 Monterey 2023 10 Blue whale 63
4 Monterey 2024 9 Blue whale 66
5 Monterey 2021 11 Gray whale 0
6 Monterey 2022 12.5 Gray whale 16
7 Monterey 2023 10 Gray whale 29
8 Monterey 2024 9 Gray whale 42
9 Monterey 2021 11 Humpback whale 25
10 Monterey 2022 12.5 Humpback whale 26
11 Monterey 2023 10 Humpback whale 23
12 Monterey 2024 9 Humpback whale 21
If you remember back at the beginning, I said that this fictitious set
of surveys started in 2022 for the Blue whales, so there is no data in
2021. R records missing data, or data cells that don’t have values in
them, as NA. Important note: missing data is not the same as a
zero; there were 0 gray whales in 2021, but the value for Blue whales
could have been 50, it could have been 0, or it could have been 50,000:
we have no way of knowing because that data does not exist. It is very
important to understand where you have missing data in your datasets and
to deal with it appropriately. Here’s how we will deal with the missing
data:
Why did R’s mean() function calculate the mean of those 4 values,
including NA, to be equal to NA? The mean of those three values that are
present (55, 63, 66) should be ~61. R can’t take a mean of a series of
values where a value isn’t a number. NAs are common in data though; we
can get around the presence of the NA by changing an argument to the
mean() function and saying na.rm = TRUE, which tells the function to
remove any NAs before trying to do the calculation. The default is
FALSE, meaning that mean() will not remove the NAs unless you tell
it to. The same argument exists in other similar functions
(e.g. sum(), sd()). Let’s do that here:
whale_long %>%
group_by(species) %>%
summarize(mean_pop = mean(population_size, na.rm = TRUE))# A tibble: 3 × 2
species mean_pop
<chr> <dbl>
1 Blue whale 61.3
2 Gray whale 21.8
3 Humpback whale 23.8
Great! Now we have the true Blue whale population mean across the three
years. We can add additional calculated summary columns by adding a new
line separated by a comma, just like in mutate(). Let’s also calculate
the standard deviation, which will give us a rough estimate of how
variable the data is, using the sd() function:
whale_long %>%
group_by(species) %>%
summarize(mean_pop = mean(population_size, na.rm = TRUE),
sd_pop = sd(population_size, na.rm = TRUE))# A tibble: 3 × 3
species mean_pop sd_pop
<chr> <dbl> <dbl>
1 Blue whale 61.3 5.69
2 Gray whale 21.8 18.0
3 Humpback whale 23.8 2.22
Using the summary statistics and the graph from earlier, which which whale population is largest? Which is most variable?
In the era of climate change, the production of energy through the burning of fossil fuels is on everyone’s mind (well, not everyone, unfortunately…). How much of the energy that the U.S. produces is from fossil fuel sources? What about from renewable energy sources? The data for this worksheet contains information about energy production in the United States from as early as 1949 until present, and is divided up by energy source. The data was retrieved from the U.S. Energy Information Administration back in 2019 (https://www.eia.gov/totalenergy/data/annual/).
I slightly modified the dataset from its original form, but purposefully left it somewhat messy - the cleaning of data is 90% of data science anyways!
Your goal in this section is to take a dataframe from “somewhat messy” to “clean and tidy” and produce two graphs of the US’s energy production over time, separated by source of energy. This is a longer and trickier section 2 than in previous activities. Along the way you will use many of the data wrangling and visualization tools that you’ve learned in the past few lessons. The questions in the section will ask you to complete specific tasks - make sure you complete them before moving on to the next question, and ask us if you get stuck!
Read in the dataframe US_energy.wide.csv and store it as
US_energy.wide
Run code to find the unique values in the YYYYMM column. What values do you think these represent?
According to the column name, the first four digits of each value are
the year, while the last two digits are the month. Using the
separate() function, create two new columns from the YYYYMM column:
one with the year values, and one with the month values. Store the
new dataframe to use later as US_energy.sep.
Hint: Remember the two different ways you can separate a column using the “sep” argument. See the separate() help page for more info.
Are the values in your two new columns numeric or character? If
character, change them to numeric within a mutate() function before
storing the new dataframe as US_energy.sep.
Check out the unique values for “month” in this dataframe. What do you
think month 13 represents? Let’s remove it for now: Filter your
dataframe to exclude rows where the month column is 13. Store the new
dataframe as US_energy.monthly (Hint: !=)
It appears that there are two main types of production columns
represented here: There are variables for specific energy production
sources (e.g. Coal Production, Wind Energy) as well as variables for
the overall broad energy categories of
Total Renewable Energy Production, Total Fossil Fuels Production,
and Total Primary Energy Production (which is the sum of the previous
two). Since these two types of data might be confusing to display on the
same graph, let’s split the US_energy.monthly dataframe into the two
types:
Create two new dataframes based off of the US_energy.monthly
data: 1) one dataframe named specific_prod which has the data for the
specific production sources such as Coal, Crude Oil, Wind, etc, and 2) a
second dataframe named broad_prod which has the data for the three
broad production categories of: Total Renewable Energy Production,
Total Fossil Fuels Production, and Total Primary Energy Production.
Both dataframes should have the year, month, and unit columns as
well.
(Hint: select())
(remember that if there are spaces or other special characters in the column name, then you need those funky single quotations called backticks)
Now that we have made good progress cleaning up our dataframes, let’s
pivot them into Tidy format to prepare for visualizations! We will have
to repeat the pivoting for each of our new dataframes (both
specific_prod and broad_prod).
Pivot both dataframes so that the energy variable column names become
values in a new column named energy_source. Name the new value column
that values go to simply production. Name the new dataframes that you
create specific_prod.long and broad_prod.long.
Hint: Remember how to pivot by excluding
Now we have two “long” dataframes where every observation (row) is the
production of a given energy production type within a month of a given
year. This means that every group of 1) energy source (e.g. Wind) and 2)
year (e.g. 1973) has 12 values: one for each month. Let’s sum() those
values together so we can simply look at the amount of energy that each
source produces across each year.
Separately for each dataframe broad_prod.long and
specific_prod.long, group the data by both year and
energy source and sum up the values in the production column. Name
the new dataframes broad_prod_yearly and specific_prod_yearly.
Time to create some graphs: customize the graphs in 2.8 and 2.9 as you see fit in order to create a clear and concise story.
Let’s start our data visualization party by answering: how has the
United States’ broad energy production changed over time? using the
broad_prod_yearly dataframe we created. As a reminder, the energy
sources in this dataframe are Total Renewable Energy Production,
Total Fossil Fuels Production, and Total Primary Energy Production.
Create code to produce the graph and include a couple sentences interpreting what you see from the graph.
Next, answer: how have the various kinds of energy sources that the
United States produces changed over time? using the
specific_prod_yearly dataframe.
Create code to produce the graph and include a small paragraph (3-5 sentences) interpreting what you see from the graph.
Export both graphs and both interpretations to this slide deck: https://docs.google.com/presentation/d/1AJVu0eq5-OoRGi4UlBervrqKbFHOINvAB-8u0JaFzYM/edit?usp=sharing
And submit the well-labeled PDF on Gradescope. Thanks!


