>Use Data in R

> Quick and intuitive code to turn your data inside out.

\ Otho Mantegazza _ Dataviz for Scientists _ Part 1.2

Intro to the Tidyverse

The Tidyverse is an ecosystem of packages for Data Science

All the packages share a common design:

  • One function does one thing, well.
  • Designed for pipes.
  • Extensive user-friendly documentation.
  • Non-standard evaluation, to write code quickly and easily.

All packages can be loaded with library(tidyverse), but you can also load single packages one by one.

Data: Palmer Penguins

Photo credits: Arturo de Frias Marques

Package and Drawings:
Allison Horst

A great dataset for teaching

The Penguins Dataset stores real data about palmer penguins. This R data package was developed and is maintained by Allison Horst, Alison Hill and Kirsten Gorman for teaching purposes.

Let’s install the package…

install.packages('palmerpenguins')

…and load it in R.

library(palmerpenguins)

Penguins

The Palmer Penguins package exports two datasets:

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Penguins

The Palmer Penguins package exports two datasets:

penguins_raw
# A tibble: 344 × 17
   studyName `Sample Number` Species         Region Island Stage `Individual ID`
   <chr>               <dbl> <chr>           <chr>  <chr>  <chr> <chr>          
 1 PAL0708                 1 Adelie Penguin… Anvers Torge… Adul… N1A1           
 2 PAL0708                 2 Adelie Penguin… Anvers Torge… Adul… N1A2           
 3 PAL0708                 3 Adelie Penguin… Anvers Torge… Adul… N2A1           
 4 PAL0708                 4 Adelie Penguin… Anvers Torge… Adul… N2A2           
 5 PAL0708                 5 Adelie Penguin… Anvers Torge… Adul… N3A1           
 6 PAL0708                 6 Adelie Penguin… Anvers Torge… Adul… N3A2           
 7 PAL0708                 7 Adelie Penguin… Anvers Torge… Adul… N4A1           
 8 PAL0708                 8 Adelie Penguin… Anvers Torge… Adul… N4A2           
 9 PAL0708                 9 Adelie Penguin… Anvers Torge… Adul… N5A1           
10 PAL0708                10 Adelie Penguin… Anvers Torge… Adul… N5A2           
# ℹ 334 more rows
# ℹ 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
#   `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
#   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
#   `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>

Penguins

We will use the first one: penguins, which has already been cleaned.

Exercise

The print method for a tibble gives you a reasonable overview of the data stored in it.

Can you get more details with the package skimr?

Check its documentation, install it, try it out on the penguins dataset. Comment on the output: is it useful, how?

Tools: dplyr

Dplyr provides a grammar for manipulating data, with many useful verbs:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • group_by() performs operations by group.

Use dplyr

penguins %>%
  mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_meters <dbl>

Use dplyr

penguins %>%
  select(bill_length_mm) %>% 
  mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 344 × 2
   bill_length_mm bill_length_meters
            <dbl>              <dbl>
 1           39.1             0.0391
 2           39.5             0.0395
 3           40.3             0.0403
 4           NA              NA     
 5           36.7             0.0367
 6           39.3             0.0393
 7           38.9             0.0389
 8           39.2             0.0392
 9           34.1             0.0341
10           42               0.042 
# ℹ 334 more rows

Use dplyr

penguins %>%
  select(species, island, bill_length_mm) %>% 
  mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 344 × 4
   species island    bill_length_mm bill_length_meters
   <fct>   <fct>              <dbl>              <dbl>
 1 Adelie  Torgersen           39.1             0.0391
 2 Adelie  Torgersen           39.5             0.0395
 3 Adelie  Torgersen           40.3             0.0403
 4 Adelie  Torgersen           NA              NA     
 5 Adelie  Torgersen           36.7             0.0367
 6 Adelie  Torgersen           39.3             0.0393
 7 Adelie  Torgersen           38.9             0.0389
 8 Adelie  Torgersen           39.2             0.0392
 9 Adelie  Torgersen           34.1             0.0341
10 Adelie  Torgersen           42               0.042 
# ℹ 334 more rows

Use dplyr

penguins %>%
 count(island)
# A tibble: 3 × 2
  island        n
  <fct>     <int>
1 Biscoe      168
2 Dream       124
3 Torgersen    52

Use dplyr

penguins %>%
  select(species, island, bill_length_mm) %>% 
  filter(island == 'Dream') %>% 
  mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 124 × 4
   species island bill_length_mm bill_length_meters
   <fct>   <fct>           <dbl>              <dbl>
 1 Adelie  Dream            39.5             0.0395
 2 Adelie  Dream            37.2             0.0372
 3 Adelie  Dream            39.5             0.0395
 4 Adelie  Dream            40.9             0.0409
 5 Adelie  Dream            36.4             0.0364
 6 Adelie  Dream            39.2             0.0392
 7 Adelie  Dream            38.8             0.0388
 8 Adelie  Dream            42.2             0.0422
 9 Adelie  Dream            37.6             0.0376
10 Adelie  Dream            39.8             0.0398
# ℹ 114 more rows

Use dplyr

penguins %>%
  select(species, island, bill_length_mm) %>% 
  filter(island == 'Dream') %>% 
  mutate(bill_length_meters = bill_length_mm/1000) %>% 
  group_by(species)
# A tibble: 124 × 4
# Groups:   species [2]
   species island bill_length_mm bill_length_meters
   <fct>   <fct>           <dbl>              <dbl>
 1 Adelie  Dream            39.5             0.0395
 2 Adelie  Dream            37.2             0.0372
 3 Adelie  Dream            39.5             0.0395
 4 Adelie  Dream            40.9             0.0409
 5 Adelie  Dream            36.4             0.0364
 6 Adelie  Dream            39.2             0.0392
 7 Adelie  Dream            38.8             0.0388
 8 Adelie  Dream            42.2             0.0422
 9 Adelie  Dream            37.6             0.0376
10 Adelie  Dream            39.8             0.0398
# ℹ 114 more rows

Use dplyr

penguins %>%
  select(species, island, bill_length_mm) %>% 
  filter(island == 'Dream') %>% 
  mutate(bill_length_meters = bill_length_mm/1000) %>% 
  group_by(species) %>% 
  summarise(mean_bill_length_mm = mean(bill_length_mm),
            sd_bill_length_mm = sd(bill_length_mm))
# A tibble: 2 × 3
  species   mean_bill_length_mm sd_bill_length_mm
  <fct>                   <dbl>             <dbl>
1 Adelie                   38.5              2.47
2 Chinstrap                48.8              3.34

Use dplyr

dream_summary <- 
  penguins %>%
  select(species, island, bill_length_mm) %>% 
  filter(island == 'Dream') %>% 
  mutate(bill_length_meters = bill_length_mm/1000) %>% 
  group_by(species) %>% 
  summarise(mean_bill_length_mm = mean(bill_length_mm),
            sd_bill_length_mm = sd(bill_length_mm))

Let’s assign the output to a new variable dream_summary.

In the previous code we have seen also two additional aspects that feature heavily in the tidyverse:

  • The Pipe %>%.
  • Non-Standard Evaluation.

The Pipe %>%

The pipe is provided by the package magrittr; it is a forwarding operator:

The pipe operator takes the output of what comes before (LHS) and sends it to the first argument of the function that comes after (RHS).

LHS %>% RHS

The Pipe %>%

For example, you could write:

select(penguins,
       species, body_mass_g)
# A tibble: 344 × 2
   species body_mass_g
   <fct>         <int>
 1 Adelie         3750
 2 Adelie         3800
 3 Adelie         3250
 4 Adelie           NA
 5 Adelie         3450
 6 Adelie         3650
 7 Adelie         3625
 8 Adelie         4675
 9 Adelie         3475
10 Adelie         4250
# ℹ 334 more rows

The Pipe %>%

…but if you use the pipe, your code is easier to read…

penguins %>%
  select(species, body_mass_g)
# A tibble: 344 × 2
   species body_mass_g
   <fct>         <int>
 1 Adelie         3750
 2 Adelie         3800
 3 Adelie         3250
 4 Adelie           NA
 5 Adelie         3450
 6 Adelie         3650
 7 Adelie         3625
 8 Adelie         4675
 9 Adelie         3475
10 Adelie         4250
# ℹ 334 more rows

The Pipe %>%

…especially if you have to perform many operations one after the other…

penguins %>%
  select(species, body_mass_g) %>% 
  filter(species == 'Adelie')
# A tibble: 152 × 2
   species body_mass_g
   <fct>         <int>
 1 Adelie         3750
 2 Adelie         3800
 3 Adelie         3250
 4 Adelie           NA
 5 Adelie         3450
 6 Adelie         3650
 7 Adelie         3625
 8 Adelie         4675
 9 Adelie         3475
10 Adelie         4250
# ℹ 142 more rows

The Pipe %>%

…that otherwise would force you to nest your code horribly.

filter(
  select(
    penguins,
    species, body_mass_g
  ),
  species == 'Adelie'
)
# A tibble: 152 × 2
   species body_mass_g
   <fct>         <int>
 1 Adelie         3750
 2 Adelie         3800
 3 Adelie         3250
 4 Adelie           NA
 5 Adelie         3450
 6 Adelie         3650
 7 Adelie         3625
 8 Adelie         4675
 9 Adelie         3475
10 Adelie         4250
# ℹ 142 more rows

Non-Standard Evaluation

This one is difficult…

Non-Standard Evaluation

Which argument does the function select() take?

Let’s open its help page by typing ?select at the R console, or by opening it s online help page at https://dplyr.tidyverse.org/reference/select.html.

?select

Under the Usage section, the help page says:

select(.data, ...)

But what can we write in the argument?

Non-Standard Evaluation

In the Arguments section the help page explains:

.data: A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.

...: <tidy-select> One or more unquoted expressions separated by commas. Variable names can be used as if they were positions in the data frame, so expressions like x:y can be used to select a range of variables.

Non-Standard Evaluation

So, what do we mean if we write:

penguins %>% 
  select(species, island)

. . .

The penguins tibble fills the .data parameter through the pipe %>%.

The unquoted names species, island fill the argument , they represent the names of the columns to be selected.

Non-Standard Evaluation

But the name of columns in a tibble is a character vector.

colnames(penguins)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

Non-Standard Evaluation

Through non-standard evaluation, we can call element of a character vector like if they were variables (without quoting them).

penguins %>% 
  select(species, island)

Even if the variables species and island don’t exist outside of the dplyr function select().

species
Error in eval(expr, envir, enclos): object 'species' not found

Non-Standard Evaluation

With non-standard evaluation we can write names without quoting them. This makes writing code for iterative data exploration faster.

If you come from a more strict programming language, it could be hard to get use to this behavior.

Most function of the Tidyverse do non-standard evaluation.

Exercise

With the penguin dataset:

  • Select all numeric variables (columns).

  • Convert all variables that are expressed in millimeters into meters, rename them accordingly.

Get help from:

Exercise

How many penguins have bill_length_mm above average?

Repeat the analysis for each species.