Get Data into R

Rectangular Data

More often than not when we mention data sets or data frames, we speak about rectangular data, i.e., data in two-dimensional table, made of values organized in rows and columns.

Each cell stores a value.
Each value belongs to one column and one row.

Rectangular data are the easiest to use. When we get data that are not rectangular, we try to turn them in a rectangular shape.

Tools: Readr

Readr is a package that loads (reads) Rectangular Text data in R.

It’s fast, it guesses column types explicitly and it’s pipe friendly.

You can use it to read both local data and online data from a URL.

For example we can use it to read data in CSV and TSV formats and many more.

Read Data From URL

We can use again on the Palmer Penguins Dataset

The source code of this package, is on github; we can find the tidy CSV data in the inst/exdata folder.

Read Data From URL

penguin_csv_url <- 
  'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'
 
read_csv(penguin_csv_url)

Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Parsing

The tibble that we have loaded and generated from CSV is not identical to the one that comes already loaded with the palmerpenguins package:

penguins_from_csv <- 
  penguin_csv_url %>% 
  read_csv()

identical(
  penguins_from_csv,
  palmerpenguins::penguins
)

[1] FALSE

Parsing

Let’s compare them side by side:

palmerpenguins::penguins %>% 
  glimpse(width = 40)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adel…
$ island            <fct> Torgersen, T…
$ bill_length_mm    <dbl> 39.1, 39.5, …
$ bill_depth_mm     <dbl> 18.7, 17.4, …
$ flipper_length_mm <int> 181, 186, 19…
$ body_mass_g       <int> 3750, 3800, …
$ sex               <fct> male, female…
$ year              <int> 2007, 2007, …

penguins_from_csv %>% 
  glimpse(width = 40)

Rows: 344
Columns: 8
$ species           <chr> "Adelie", "A…
$ island            <chr> "Torgersen",…
$ bill_length_mm    <dbl> 39.1, 39.5, …
$ bill_depth_mm     <dbl> 18.7, 17.4, …
$ flipper_length_mm <dbl> 181, 186, 19…
$ body_mass_g       <dbl> 3750, 3800, …
$ sex               <chr> "male", "fem…
$ year              <dbl> 2007, 2007, …

Can you spot that column types are different?

Parsing

When we read data from text encoded “delimited” files, such as CSV, we use function that parse the file.

When we parse something, we formalize its structure applying a set of grammatical rules.

No parsing rule is perfect, thus we should often review the results and fix potential parsing mistakes”.

Parsing

# specify column types manually
penguins_from_csv <-
  penguin_csv_url %>% 
  read_csv(
    col_types = cols(
      species = col_factor(),
      island = col_factor(),
      flipper_length_mm = col_integer(),
      body_mass_g = col_integer(),
      sex = col_factor(),
      year = col_integer()
    )
  )

Parsing

penguins_from_csv %>% glimpse()

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Exercise

Find the source code of the readr package.

In the inst/extdata folder you can find 10 datasets that display different challenges that you might encounter when you have to load data from an external file.

Load in R at least 3 of those datasets using functions from readr.

Get help from readr’s documentation and the data import chapter of r4ds.

Which function did you use? Did you encounter any parsing failure? How did you fix it?

Read From Local File

PANGAEA, a Data repository for the environmental sciences.

For our exercise we will use this dataset from Wu et al:

“Effect of barite-bound Sr on detrital Sr isotope systematics in marine sediments with pertinent Mediterranean examples”.

Read From Local File

If you want to read data from a local file, first you must find the path to that file.

pangaea_filename <- 'Dataset_S2_HCl-leaching.tab'

pangaea_path <- here('data/Wu-etal_2022/datasets', pangaea_filename)

pangaea_path

[1] "/home/runner/work/dataviz-for-scientists-slides/dataviz-for-scientists-slides/data/Wu-etal_2022/datasets/Dataset_S2_HCl-leaching.tab"

You can learn how to navigate the shell in the first chapter of this open access intro to computer science.

Read From Local File

Let’s try to read the data file r pangaea_filename.

It’s a .tab file.

pangaea_data <- 
  pangaea_path %>% 
  read_delim()

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 177 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (3): /*, DATA, DESCRIPTION:

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Read From Local File

If we call problems() readr tells us what went wrong.

pangaea_data

# A tibble: 177 × 3
   `/*`                   DATA           `DESCRIPTION:`                         
   <chr>                  <chr>          <chr>                                  
 1 "Citation:\tWu,"       "Jiawang;"     "Liu, Zhifei; Michard, Annie; Tachikaw…
 2 "\tIn:"                "Wu,"          "J et al. (2022): Effect of barite-bou…
 3 "Keyword(s):\tBarite;" "Barium"       "(Ba); marine sediments; Mediterranean…
 4 "Related"              "to:\tWu,"     "Jiawang; Liu, Zhifei; Michard, Annie;…
 5 "Funding:\tChina"      "Postdoctoral" "Science Foundation (URI: http://res.c…
 6 "\tChina"              "Postdoctoral" "Science Foundation (URI: http://res.c…
 7 "\tEuropean"           "Commission"   "(EC), grant/award no. MAS2-CT93-0051:…
 8 "\tEuropean"           "Commission"   "(EC), grant/award no. MAS3 CT97-0137:…
 9 "\tEuropean"           "Commission"   "(EC), grant/award no. MAST1-90022C: E…
10 "\tNational"           "Natural"      "Science Foundation of China (NSFC) (U…
# ℹ 167 more rows

Read From Local File

Let’s provide more arguments to the parsing function:

delim = ’ to tell read_delim() that we are reading a file delimited by tabulator (\t).
skip = 49 to tell it that the first 49 rows must be skipped.

pangaea_data <- 
  pangaea_path %>% 
  read_delim(delim = '\t',
             skip = 49)

Now the data that we’ve imported into R looks fine.

pangaea_data %>% glimpse()

Rows: 128
Columns: 19
$ Event                      <chr> "SL73BC", "SL73BC", "SL73BC", "SL73BC", "SL…
$ Latitude                   <dbl> 39.6617, 39.6617, 39.6617, 39.6617, 39.6617…
$ Longitude                  <dbl> 24.5117, 24.5117, 24.5117, 24.5117, 24.5117…
$ `Elevation [m]`            <dbl> -339, -339, -339, -339, -339, -339, -339, -…
$ `Sample label (barite-Sr)` <chr> "SL73-1", "SL73-2", "SL73-3", "SL73-4", "SL…
$ `Samp type`                <chr> "non-S1", "non-S1", "non-S1", "S1b", "S1b",…
$ `Depth [m]`                <dbl> 0.0045, 0.0465, 0.0665, 0.1215, 0.1765, 0.1…
$ `Age [ka BP]`              <dbl> 1.66, 3.13, 4.13, 5.75, 7.30, 7.78, 8.65, 9…
$ `CaCO3 [%]`                <dbl> 61.5, 55.1, 53.0, 43.4, 41.8, 42.3, 42.6, 3…
$ `Ba [µg/g] (Leachate)`     <dbl> 72.6, 64.3, 37.4, 63.5, 101.0, 141.0, 75.2,…
$ `Sr [µg/g] (Leachate)`     <dbl> 767, 681, 690, 552, 527, 528, 551, 482, 391…
$ `Ca [µg/g] (Leachate)`     <dbl> 188951.9, 163260.4, 162188.7, 125937.6, 124…
$ `Al [µg/g] (Leachate)`     <dbl> 10612.7, 11428.4, 5463.0, 3261.5, 2121.7, 1…
$ `Fe [µg/g] (Leachate)`     <dbl> 5935.0, 6814.3, 2465.7, 3936.8, 189.6, 7711…
$ `Ba [µg/g] (Residue)`      <dbl> 171.0, 198.0, 251.0, 290.0, 315.0, 259.0, 3…
$ `Sr [µg/g] (Residue)`      <dbl> 66.6, 71.3, 90.3, 99.8, 106.0, 90.7, 108.0,…
$ `Ca [µg/g] (Residue)`      <dbl> 2315.5, 2369.6, 3007.7, 3447.9, 3713.2, 331…
$ `Al [µg/g] (Residue)`      <dbl> 29262.5, 35561.4, 45862.4, 52485.6, 55083.0…
$ `Fe [µg/g] (Residue)`      <dbl> 14834.8, 18301.4, 24534.5, 30745.3, 28532.2…

Now the data that we’ve imported into R looks fine.

pangaea_data %>% skimr::skim()

Data summary
Name	Piped data
Number of rows	128
Number of columns	19
_______________________
Column type frequency:
character	3
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Event	1	5	7	11
Sample label (barite-Sr)	1	5	7	128
Samp type	1	3	6	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Latitude	1	35.32	2.24	33.38	33.67	34.55	35.29	39.66	▇▂▁▁▃
Longitude	1	22.79	8.17	4.51	17.86	24.51	28.61	32.67	▂▁▆▆▇
Elevation [m]	1	-1807.65	829.55	-3390.00	-2373.00	-1587.00	-1375.00	-339.00	▂▇▁▅▅
Depth [m]	1	0.30	0.25	0.00	0.17	0.26	0.34	1.40	▇▅▁▁▁
Age [ka BP]	1	7.88	3.49	0.30	5.83	7.86	9.82	19.61	▃▇▇▂▁
CaCO3 [%]	1	57.94	8.29	33.70	52.58	57.80	64.45	74.60	▁▃▇▇▅
Ba [µg/g] (Leachate)	1	216.49	223.26	0.00	37.15	122.00	398.75	954.00	▇▂▂▁▁
Sr [µg/g] (Leachate)	1	784.98	207.08	341.00	635.15	767.50	886.25	1419.00	▃▇▇▂▁
Ca [µg/g] (Leachate)	1	179277.91	34034.64	82460.20	158421.22	182153.85	204214.33	240253.00	▁▃▆▇▅
Al [µg/g] (Leachate)	1	9075.10	3739.72	122.20	6036.95	9034.50	11691.80	21011.90	▂▇▇▅▁
Fe [µg/g] (Leachate)	1	7673.07	4235.50	189.60	4937.08	6918.25	9513.88	21428.70	▃▇▃▁▁
Ba [µg/g] (Residue)	1	184.58	109.97	54.80	98.30	142.00	251.00	507.00	▇▃▂▁▁
Sr [µg/g] (Residue)	1	46.15	17.69	20.50	33.53	41.10	51.85	120.00	▇▇▂▁▁
Ca [µg/g] (Residue)	1	1316.60	930.42	475.50	825.50	1057.45	1446.40	8343.50	▇▁▁▁▁
Al [µg/g] (Residue)	1	36668.75	8937.30	19227.20	29424.80	37297.75	43806.80	60033.10	▆▇▇▆▁
Fe [µg/g] (Residue)	1	21446.97	6866.21	10823.30	16197.70	20393.90	24408.27	48057.60	▇▇▃▁▁

Check for Missing Values

skimr::skim() shows you how many values are missing in your dataset:

How many missing value are there?
Where do they occur?

A more formal way to check for missing values.

pangaea_data %>% 
  summarise(
    across(
      .fns = ~is.na(.) %>% sum()
    )
  ) %>% 
  glimpse()

Rows: 1
Columns: 19
$ Event                      <int> 0
$ Latitude                   <int> 0
$ Longitude                  <int> 0
$ `Elevation [m]`            <int> 0
$ `Sample label (barite-Sr)` <int> 0
$ `Samp type`                <int> 0
$ `Depth [m]`                <int> 0
$ `Age [ka BP]`              <int> 0
$ `CaCO3 [%]`                <int> 0
$ `Ba [µg/g] (Leachate)`     <int> 0
$ `Sr [µg/g] (Leachate)`     <int> 0
$ `Ca [µg/g] (Leachate)`     <int> 0
$ `Al [µg/g] (Leachate)`     <int> 0
$ `Fe [µg/g] (Leachate)`     <int> 0
$ `Ba [µg/g] (Residue)`      <int> 0
$ `Sr [µg/g] (Residue)`      <int> 0
$ `Ca [µg/g] (Residue)`      <int> 0
$ `Al [µg/g] (Residue)`      <int> 0
$ `Fe [µg/g] (Residue)`      <int> 0

Checklist

When you read new data into R, always:

Check for missing values.
Check the column types, are they what you expect?
Check the row number and the column names.
Optional, check the head() and the tail() of the file.

You Can Read Data From Excel

Readr is a package that loads (reads) Rectangular data in R from excel files (.xls, .xlsx).

It’s fast, it reads also data types and it’s pipe friendly.

You can point it to specific sheets and cell ranges.

Exercise

Tidytuesday is a weekly data project aimed at learning, collaborating and networking the R ecosystem.

Find this week’s dataset and read it in R. Run the checklist from the previous slide on the data that you’ve read.

If you are down early, proceed reading data from the previous week or find a colleague to help.

Check Tidytuesday submissions on Twitter with the hashtag #TidyTuesday

>Get Data into R