> Quick and intuitive code to turn your data inside out.
\ Otho Mantegazza _ Dataviz for Scientists _ Part 1.2
The Tidyverse is an ecosystem of packages for Data Science
All the packages share a common design:
All packages can be loaded with library(tidyverse), but you can also load single packages one by one.
Data: Palmer Penguins
Photo credits: Arturo de Frias Marques
Package and Drawings:
Allison Horst
The Penguins Dataset stores real data about palmer penguins. This R data package was developed and is maintained by Allison Horst, Alison Hill and Kirsten Gorman for teaching purposes.
Let’s install the package…
…and load it in R.
The Palmer Penguins package exports two datasets:
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
The Palmer Penguins package exports two datasets:
# A tibble: 344 × 17
studyName `Sample Number` Species Region Island Stage `Individual ID`
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 PAL0708 1 Adelie Penguin… Anvers Torge… Adul… N1A1
2 PAL0708 2 Adelie Penguin… Anvers Torge… Adul… N1A2
3 PAL0708 3 Adelie Penguin… Anvers Torge… Adul… N2A1
4 PAL0708 4 Adelie Penguin… Anvers Torge… Adul… N2A2
5 PAL0708 5 Adelie Penguin… Anvers Torge… Adul… N3A1
6 PAL0708 6 Adelie Penguin… Anvers Torge… Adul… N3A2
7 PAL0708 7 Adelie Penguin… Anvers Torge… Adul… N4A1
8 PAL0708 8 Adelie Penguin… Anvers Torge… Adul… N4A2
9 PAL0708 9 Adelie Penguin… Anvers Torge… Adul… N5A1
10 PAL0708 10 Adelie Penguin… Anvers Torge… Adul… N5A2
# ℹ 334 more rows
# ℹ 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
# `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
# `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
# `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
We will use the first one: penguins, which has already been cleaned.
The print method for a tibble gives you a reasonable overview of the data stored in it.
Can you get more details with the package skimr?
Check its documentation, install it, try it out on the penguins dataset. Comment on the output: is it useful, how?
Dplyr provides a grammar for manipulating data, with many useful verbs:
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_meters <dbl>
# A tibble: 344 × 2
bill_length_mm bill_length_meters
<dbl> <dbl>
1 39.1 0.0391
2 39.5 0.0395
3 40.3 0.0403
4 NA NA
5 36.7 0.0367
6 39.3 0.0393
7 38.9 0.0389
8 39.2 0.0392
9 34.1 0.0341
10 42 0.042
# ℹ 334 more rows
penguins %>%
select(species, island, bill_length_mm) %>%
mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 344 × 4
species island bill_length_mm bill_length_meters
<fct> <fct> <dbl> <dbl>
1 Adelie Torgersen 39.1 0.0391
2 Adelie Torgersen 39.5 0.0395
3 Adelie Torgersen 40.3 0.0403
4 Adelie Torgersen NA NA
5 Adelie Torgersen 36.7 0.0367
6 Adelie Torgersen 39.3 0.0393
7 Adelie Torgersen 38.9 0.0389
8 Adelie Torgersen 39.2 0.0392
9 Adelie Torgersen 34.1 0.0341
10 Adelie Torgersen 42 0.042
# ℹ 334 more rows
penguins %>%
select(species, island, bill_length_mm) %>%
filter(island == 'Dream') %>%
mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 124 × 4
species island bill_length_mm bill_length_meters
<fct> <fct> <dbl> <dbl>
1 Adelie Dream 39.5 0.0395
2 Adelie Dream 37.2 0.0372
3 Adelie Dream 39.5 0.0395
4 Adelie Dream 40.9 0.0409
5 Adelie Dream 36.4 0.0364
6 Adelie Dream 39.2 0.0392
7 Adelie Dream 38.8 0.0388
8 Adelie Dream 42.2 0.0422
9 Adelie Dream 37.6 0.0376
10 Adelie Dream 39.8 0.0398
# ℹ 114 more rows
penguins %>%
select(species, island, bill_length_mm) %>%
filter(island == 'Dream') %>%
mutate(bill_length_meters = bill_length_mm/1000) %>%
group_by(species)
# A tibble: 124 × 4
# Groups: species [2]
species island bill_length_mm bill_length_meters
<fct> <fct> <dbl> <dbl>
1 Adelie Dream 39.5 0.0395
2 Adelie Dream 37.2 0.0372
3 Adelie Dream 39.5 0.0395
4 Adelie Dream 40.9 0.0409
5 Adelie Dream 36.4 0.0364
6 Adelie Dream 39.2 0.0392
7 Adelie Dream 38.8 0.0388
8 Adelie Dream 42.2 0.0422
9 Adelie Dream 37.6 0.0376
10 Adelie Dream 39.8 0.0398
# ℹ 114 more rows
penguins %>%
select(species, island, bill_length_mm) %>%
filter(island == 'Dream') %>%
mutate(bill_length_meters = bill_length_mm/1000) %>%
group_by(species) %>%
summarise(mean_bill_length_mm = mean(bill_length_mm),
sd_bill_length_mm = sd(bill_length_mm))
# A tibble: 2 × 3
species mean_bill_length_mm sd_bill_length_mm
<fct> <dbl> <dbl>
1 Adelie 38.5 2.47
2 Chinstrap 48.8 3.34
Let’s assign the output to a new variable dream_summary.
In the previous code we have seen also two additional aspects that feature heavily in the tidyverse:
The pipe is provided by the package magrittr; it is a forwarding operator:
The pipe operator takes the output of what comes before (LHS) and sends it to the first argument of the function that comes after (RHS).
LHS %>% RHS
For example, you could write:
…but if you use the pipe, your code is easier to read…
…especially if you have to perform many operations one after the other…
…that otherwise would force you to nest your code horribly.
This one is difficult…
Which argument does the function select() take?
Let’s open its help page by typing ?select at the R console, or by opening it s online help page at https://dplyr.tidyverse.org/reference/select.html.
Under the Usage section, the help page says:
select(.data, ...)
But what can we write in the … argument?
In the Arguments section the help page explains:
.data: A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.
...: <tidy-select> One or more unquoted expressions separated by commas. Variable names can be used as if they were positions in the data frame, so expressions like x:y can be used to select a range of variables.
Through non-standard evaluation, we can call element of a character vector like if they were variables (without quoting them).
Even if the variables species and island don’t exist outside of the dplyr function select().
With non-standard evaluation we can write names without quoting them. This makes writing code for iterative data exploration faster.
If you come from a more strict programming language, it could be hard to get use to this behavior.
Most function of the Tidyverse do non-standard evaluation.
With the penguin dataset:
Select all numeric variables (columns).
Convert all variables that are expressed in millimeters into meters, rename them accordingly.
Get help from:
How many penguins have bill_length_mm above average?
Repeat the analysis for each species.