Data Carpentry Workshop Remix Jan 15, 2020

Setup

Instructors talk about the Data Carpentry Intro before continuing the lesson below.

Note that this lesson is adapted from the Data Carpentry Ecology Lesson. You can find a lot more information there. Text that has been copied from this lesson is highlighted in blue.

#install.packages("tidyverse")
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Get the Data

The data used for this less is all about Australia, including its climate over time and recent fires. To learn more about Australia’s devastating fires check out this NY Times article. The data and information comes from Tidy Tuesday, a weekly data project aimed at the R ecosystem. You can find more datasets at https://github.com/rfordatascience/tidytuesday .

The data are available online so the first thing we’ll do is load them into R. To load data you’ll need to use a function.

Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages. A function usually takes one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function sqrt(). The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number. Executing a function (‘running it’) is called calling the function.

Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str() or data.frame(), come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it. You should already have installed the tidyverse package. This is an “umbrella-package” that installs several packages useful for data analysis which work together well such as tidyr, dplyr, ggplot2, tibble, etc.

We’ll read in our data using the read_csv() function, from the tidyverse package readr. We assign each dataset to a variable so we can reuse it. The name of the variable is on the left side of the arrow. The action we’re doing with the function is on the right side.

rainfall <- read_csv('https://tinyurl.com/Oz-fire-rain')

## Parsed with column specification:
## cols(
##   station_code = col_character(),
##   city_name = col_character(),
##   year = col_double(),
##   month = col_character(),
##   day = col_character(),
##   rainfall = col_double(),
##   period = col_double(),
##   quality = col_character(),
##   lat = col_double(),
##   long = col_double(),
##   station_name = col_character()
## )

temperature <- read_csv('https://tinyurl.com/Oz-fire-temp')

## Parsed with column specification:
## cols(
##   city_name = col_character(),
##   date = col_date(format = ""),
##   temperature = col_double(),
##   temp_type = col_character(),
##   site_name = col_character()
## )

You can see some information about the data we have just loaded. The name of each column is shown along with the type of data in that column. The data are stored in a format we call a data frame.

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by the functions such as read_csv(); in other words, when importing spreadsheets from your hard drive (or the web). A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors).

You will see the message Parsed with column specification, followed by each column name and its data type. When you execute read_csv on a data file, it looks through the first 1000 rows of each column and guesses the data type for each column as it reads it into R. For example, in this dataset, read_csv reads columns as col_double (a numeric data type), and as col_character. You have the option to specify the data type for a column manually by using the col_types argument in read_csv.

If you want to inspect your data frame there are several functions to do so. head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

Size:
- dim(surveys) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
- nrow(surveys) - returns the number of rows
- ncol(surveys) - returns the number of columns
Content:
- head(surveys) - shows the first 6 rows
- tail(surveys) - shows the last 6 rows
Names:
- names(surveys) - returns the column names (synonym of colnames() for data.frame objects)
- rownames(surveys) - returns the row names
Summary:
- str(surveys) - structure of the object and information about the class, length and content of each column
- summary(surveys) - summary statistics for each column

For more details on this dataset see the Tidy Tuesday site.

This statement doesn’t produce any output because assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name:rainfall.

rainfall

## # A tibble: 179,273 x 11
##    station_code city_name  year month day   rainfall period quality   lat  long
##    <chr>        <chr>     <dbl> <chr> <chr>    <dbl>  <dbl> <chr>   <dbl> <dbl>
##  1 009151       Perth      1967 01    01          NA     NA <NA>    -32.0  116.
##  2 009151       Perth      1967 01    02          NA     NA <NA>    -32.0  116.
##  3 009151       Perth      1967 01    03          NA     NA <NA>    -32.0  116.
##  4 009151       Perth      1967 01    04          NA     NA <NA>    -32.0  116.
##  5 009151       Perth      1967 01    05          NA     NA <NA>    -32.0  116.
##  6 009151       Perth      1967 01    06          NA     NA <NA>    -32.0  116.
##  7 009151       Perth      1967 01    07          NA     NA <NA>    -32.0  116.
##  8 009151       Perth      1967 01    08          NA     NA <NA>    -32.0  116.
##  9 009151       Perth      1967 01    09          NA     NA <NA>    -32.0  116.
## 10 009151       Perth      1967 01    10          NA     NA <NA>    -32.0  116.
## # … with 179,263 more rows, and 1 more variable: station_name <chr>

You can also view the data in a separate window by clicking on its name in the Global Environment window. Here you can also see some information about the data. Note how much data we have! Click the arrow next to rainfall to view the columns and type of data in each column. For more information about this dataset (i.e. the metadata) see https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-07/readme.md

Note that we’ve read our data from a website, but you can read local files as well. These files are in csv format, which means plain text where columns are separated by commas. This is a very simple format that avoids all the complexity of Excel. If you need to read an Excel file you can either export as a csv or use a different function to read the data.

For more information on data frames see the Starting with Data section of the Data Carpentry lesson.

Basic data exploration

One handy way to look at your data is to get a summary table. Let’s do this for the temperature dataset.

summary(temperature)

##   city_name              date             temperature     temp_type        
##  Length:528278      Min.   :1910-01-01   Min.   :-11.5   Length:528278     
##  Class :character   1st Qu.:1940-09-06   1st Qu.: 11.1   Class :character  
##  Mode  :character   Median :1967-10-04   Median : 16.3   Mode  :character  
##                     Mean   :1966-11-09   Mean   : 16.6                     
##                     3rd Qu.:1993-08-02   3rd Qu.: 21.6                     
##                     Max.   :2019-05-31   Max.   : 48.3                     
##                                          NA's   :3465                      
##   site_name        
##  Length:528278     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

For a really basic exploratory analysis let’s look at how temperature is changing in Australian cities over time. I’ve summarized our original dataset so you can make your first plot more easily. First: load the summary data, which can be found at https://tinyurl.com/Oz-mean-temp.

#Challenge: load data
yearly_temp <- read_csv('https://tinyurl.com/Oz-mean-temp')

## Parsed with column specification:
## cols(
##   year = col_double(),
##   city_name = col_character(),
##   temperature = col_double()
## )

Now we’ll plot the temperature as a function of time.

Plotting with `ggplot2`

ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatter plot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

ggplot2 functions like data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2

ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

To build a ggplot, we will use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()

use the ggplot() function and bind the plot to a specific data frame using the data argument

ggplot(data = yearly_temp)

define a mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.

ggplot(yearly_temp, aes(year,temperature))

- add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:

geom_point() for scatter plots, dot plots, etc.
geom_boxplot() for, well, boxplots!
geom_line() for trend lines, time series, etc.

To add a geom to the plot use the + operator. Because we have two continuous variables, let’s use geom_point() first:

ggplot(yearly_temp, aes(year,temperature)) +
  geom_point()

Notice that you have a lot of data points for each year. This is because you have a data point on the mean temperature for each city. Let’s change our plot to show each city in a different color.

ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_point()

That’s a bit clearer. Let’s show the pattern over time by adding lines between the yearly points.

ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_point()+geom_line()

And let’s tidy this into a publication-quality plot.

ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_line() +
  labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
  theme_bw()

In a few quick commands we can already plot temperature and observe how it’s been increasing.

Notes

- Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes().
- You can also specify mappings for a given geom independently of the mappings defined globally in the ggplot() function.
- The + sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the + sign is added at the beginning of the line containing the new layer, ggplot2 will not add the new layer and will return an error message.

Manipulating data

In the prior section I gave you a summary table of temperature data. Let’s consider how you could generate this summary table and do other data manipulation given our original datasets.

Data Manipulation using `dplyr` and `tidyr`

The tidyverse package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R:

The results from a base R function sometimes depend on the type of data.
Using R expressions in a non standard way, which can be confusing for new learners.
Hidden arguments, having default operations that new learners are not aware of.

The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.

This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.

The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is non-trivial, and tidyr gives you tools for this and more sophisticated data manipulation.

To learn more about dplyr and tidyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet and this one about tidyr.

We’re going to learn some of the most common dplyr functions:

select(): subset columns
filter(): subset rows on conditions
mutate(): create new columns by using information from other columns
group_by() and summarize(): create summary statistics on grouped data
arrange(): sort results
count(): count discrete values

Selecting columns and filtering rows

To choose rows based on a specific criterion, use filter().

For our data let’s speculate that we may be seeing an increase in fires due to changes in maximum daily temperatures. After all, it could be spiking temperatures that allow fires to occur, and that wouldn’t be reflected well in the mean temperature for a day. In this case we need to filter our data to include only data where the column temp_type is listed as max.

temperatures_maxs <- filter(temperature,temp_type == "max")

Now consider what summary information you want. You probably want the average maximum temperature for each city for each year to look at changes over time. Let’s look at how we would calculate the average maximum temperature for one city for one year. Then we’ll be able to extend this to all cities and all years.

First filter your data to include only data from 2019 at “PERTH AIRPORT”. This is a challenge for you.

temperatures_maxs_PerthAir <- filter(temperatures_maxs,site_name == "PERTH AIRPORT")

You should have gotten stuck on how we know whether data came from 2019. That information is in the date column but you have to extract it. This is a great lesson. You first need to think about what you want your data to look like. Once you know what you want you can figure out how to communicate that to the computer.

In this case we’ll use the lubridate package to process date information.

Challenge: load the lubridate library.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

You can use the year function to extract the year from the date column. Assign that to a new column in your data frame.

temperatures_maxs_PerthAir$year <- year(temperatures_maxs_PerthAir$date)

Now you can filter for data from 2019. This is a challenge.

temperatures_maxs_PerthAir_2019 <- filter(temperatures_maxs_PerthAir,year == 2019)

Now you can calculate the mean max temperature for this site in this year using the mean function. This is a challenge.

mean(temperatures_maxs_PerthAir_2019$temperature)

## [1] 28.5649

Pipes

What if you want to do multiple filter steps at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

You probably took the intermediate step approach in your prior work.

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

temperatures_maxs$year <- year(temperatures_maxs$date)
temperatures_maxs_PerthAir_2019 <- filter(filter(temperatures_maxs,site_name == "PERTH AIRPORT"),year == 2019)

This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, pipes, let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr.

temperatures_maxs_PerthAir_2019 <- temperatures_maxs %>%
  filter(site_name == "PERTH AIRPORT") %>%
  filter(year == 2019)

Note that the data are sent from one function to the next so in subsequent functions you do not provide the name of the data. Make sure to put the pipe at the end of the line when using multiple lines so the processing doesn’t end prematurely.

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame temperatures_max, then we filtered for rows with site_name == "PERTH AIRPORT", then we filtered for rows with year == 2019. Make sure to use the double equals sign when checking for equality. A single equals will assign data to a variable.

Split-apply-combine data analysis and the `summarize()` function

Now we know how to calculate one example, but we would like the mean max temperature for all cities for all years. To get this information we’ll generate a summary table.

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The `summarize()` function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics.

In this case we are interested in one result for each city for each year. That means we’ll group by these variables. Then we’ll generate a summary table. In this table we want to calculate the mean of the temperature for each group.

mean_max_temp <- temperatures_maxs %>%
  filter(!is.na(temperature)) %>%
  group_by(city_name,year) %>%
  summarize(mean_max_temp = mean(temperature))

Now you can plot. This is a challenge. Don’t forget to fix your axis labels.

ggplot(mean_max_temp, aes(year, mean_max_temp,color=city_name)) +
         geom_line()+
  labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
  theme_bw()

Note that we have lots of data but it’s a bit hard to compare in the table. That’s because when we work with data in R we typically use “long form”. That means for everyone combination of variables (year and city), we have one observation (mean temp). Humans typically like to see data in “wide form”. We could have year as rows and city as columns and the values in the middle as temperature. To convert from long form to wide form we can use the spread function.

mean_max_temp_wide <- mean_max_temp %>%
  spread(key = city_name,value = mean_max_temp)

Click on the data in the Environment to see it.

You can save your new human-readable data.

write_csv(mean_max_temp_wide, path = "data/mean_max_temp_wide.csv")

If you get data in wide form and need to convert it to long form to use in R use the gather function. This is a bit more complicated as you need to specify the columns you want to gather, the name of the column where those column names go, and the name of the column where all the values go.

mean_max_temp_widetolong <- mean_max_temp_wide %>%
  gather(-year, key = city_name,value = mean_max_temp)

mean_max_temp_widetolong <- mean_max_temp_wide %>%
  gather(2:8, key = city_name,value = mean_max_temp)

mean_max_temp_widetolong <- mean_max_temp_wide %>%
  gather(BRISBANE:SYDNEY, key = city_name,value = mean_max_temp)

More data manipulation

Another analysis we might be interested in is whether temperatures for individual months are increasing.

First let’s extract the month from the date information. This time we’ll do it a bit differently

temperature <- temperature %>%
  mutate(year = year(date), month = month(date))

Challenge: Examine how the temperature has increased over time in Canberra. Filter for data from Canberra. Get the average max temperature in each month for each year. Spread year to be viewable.

monthly_temps_CAN <- temperature %>%
  filter(city_name == "CANBERRA") %>%
  filter(temp_type == "max") %>%
  group_by(month, year) %>%
  summarize(avg_temp = mean(temperature, na.rm = T)) %>%
  spread(year, avg_temp)
head(monthly_temps_CAN)

## # A tibble: 6 x 108
## # Groups:   month [6]
##   month `1913` `1914` `1915` `1916` `1917` `1918` `1919` `1920` `1921` `1922`
##   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     1   NA     29.1   27.1   28.8   27.0   23.9   30.1   23.3   25.4    NaN
## 2     2   NA     29.7   30.0   26.9   22.8   24.6   29.4   27.3   25.8    NaN
## 3     3   NA     25.6   26.2   23.7   23.3   23.4   24.2   22.4   21.8    NaN
## 4     4   18.2   19.7   19.9   18.2   18.1   18.7   22.2   18.9   19.1    NaN
## 5     5   13.4   15.3   13.9   15.8   12.9   15.7   16.1   15.5   15.4    NaN
## 6     6   10.4   13.0   11.5   10.2   11.5   11.2   13.0   11.2   12.8    NaN
## # … with 97 more variables: `1923` <dbl>, `1924` <dbl>, `1925` <dbl>,
## #   `1926` <dbl>, `1927` <dbl>, `1928` <dbl>, `1929` <dbl>, `1930` <dbl>,
## #   `1931` <dbl>, `1932` <dbl>, `1933` <dbl>, `1934` <dbl>, `1935` <dbl>,
## #   `1936` <dbl>, `1937` <dbl>, `1938` <dbl>, `1939` <dbl>, `1940` <dbl>,
## #   `1941` <dbl>, `1942` <dbl>, `1943` <dbl>, `1944` <dbl>, `1945` <dbl>,
## #   `1946` <dbl>, `1947` <dbl>, `1948` <dbl>, `1949` <dbl>, `1950` <dbl>,
## #   `1951` <dbl>, `1952` <dbl>, `1953` <dbl>, `1954` <dbl>, `1955` <dbl>,
## #   `1956` <dbl>, `1957` <dbl>, `1958` <dbl>, `1959` <dbl>, `1960` <dbl>,
## #   `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
## #   `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
## #   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
## #   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
## #   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
## #   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
## #   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>,
## #   `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>,
## #   `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>,
## #   `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>, `2017` <dbl>, `2018` <dbl>, `2019` <dbl>

Now we can summarize by month and plot.

monthly_temps <- temperature %>%
  group_by(month, year) %>%
  summarize(avg_temp = mean(temperature, na.rm = T)) 

ggplot(monthly_temps, aes(month, avg_temp, group=year, color=year)) +
  geom_line() + geom_point()

Did you notice that the color is along a scale? For R, year is a continuous variable so the scale is continuous. This probably isn’t what you want. You might view year as a factor to separate the years distincly.

ggplot(monthly_temps, aes(month, avg_temp, group=year, color=factor(year))) +
  geom_line() + geom_point()

On the other hand, although you can see lighter colors mostly at the top, all those data are really messy anyway. There are really too many categories for a discrete variable. Let’s try grouping the information by before and after 1964 to make things simpler. We have already seen mutate to add a column. This time we want to add a column that designates whether the date is before or after 1964 using 0 or 1. We use the ifelse function to do this. Then we take this updated dataframe and group by both month and time period and summarize the average temperature for each group. Note na.rm=T.

monthly_temps_by_period <- temperature %>%
  mutate(time_period = ifelse( year <= 1964, 0, 1)) %>%
  group_by(month, time_period) %>%
  summarize(avg_temp = mean(temperature, na.rm = T))

Now we can plot the result. The x-axis is month (notice we’ve added the month function so this looks like a date). The y-axis is the temperature. And we have separate lines for our two time periods.

ggplot(monthly_temps_by_period, aes(month(month, label = T), avg_temp, 
                                    group=time_period, col=factor(time_period))) +
  geom_line() + geom_point() +
  scale_color_discrete(name = "", labels = c("Before 1964", "After 1964")) +
  labs(y = "Average Temperature (°C)",
       x = "Month")

What do you observe about how monthly temperatures differ?

Save your plot!

ggsave('avg_monthly_temps.png')

We can also look at cities separately. Notice that I’ve redone my grouping for my summary table to include city. Notice the facet_wrap function here to automatically create separate plots.

monthly_temps_by_period <- temperature %>%
  mutate(time_period = ifelse( year <= 1964, 0, 1)) %>%
  group_by(month, time_period, city_name) %>%
  summarize(avg_temp = mean(temperature, na.rm = T)) 

ggplot(monthly_temps_by_period, aes(month(month, label = T), avg_temp, 
                                    group=time_period, col=factor(time_period))) +
  geom_line() + geom_point() + facet_wrap(~city_name) +
  scale_color_discrete(name = "", labels = c("Before 1964", "After 1964")) +
  labs(y = "Average Temperature (°C)",
       x = "Month")