>Exploratory Data Analysis

> It is important to understand what you can do before you learn to measure how well you seem to have done it. (John Tukey)

\ Otho Mantegazza _ Dataviz for Scientists _ Part 2.3

BOXPLOT

Year: 1977

Author: John Tukey

Book: Exploratory Data Analysis

The Boxplot is one of the main visual models used to explore data. It shows Summary Quantile Statistics and outlier for a stratified set of data.

Exploratory Data Analysis

Investigative Work

Exploratory data analysis, as stated by Tukey, is the investigative work on data.

When you explore data, you leave no stone unturned. Relying on graphical methods, robust summary statistic and dimension reduction, you quickly gain insights in all possible patterns, correlations and cause-effect relationships that are in the data. In technical terms, you generate hypothesis.

Confirmation

After you have done your investigative work, you should switch from inspector to judge and test your hypothesis with inference and tests.

But there’s no point in testing the statistical relevance of poorly formulated hypotheses. Exploratory Data Analysis is fundamental in modern statistics, because it allows to formulate the best hypothesis possible.

Visual Immediacy

When you explore data, you have to turn them into insightful formats. Most of the time this involves turning data into grafical and visual shapes.

STEM AND LEAF

Year: 1977

Author: John Tukey

Book: Exploratory Data Analysis

A graphical intuitive representation of car prices.

A big part Tukey’s book “Exploratory Data Analysis” relies on graphical representation of data that you can draw yourself with pen and paper. Luckily today you can use powerful software designed for data exploration purpose, such as the Tidyverse.

Semantics

If you want to represent data intuitively, first you have to learn terms that allow you to describe the structure of a dataset semantically.

The Tidy Data Theory lets us do just that.

TIDY DATA THEORY

Year: 2014

Author: Hadley Wickham

Paper: Tidy Data

A common framework to organize data semantically: if you organize data based on their structure, it’s easier for you to make sense of them, to realize what data you have and what’s missing.

If you organize data with a common framework, it’s also easier to share them with others.

Each column is a variable.
Each row is an observation.
Each cell is a value.
Each dataset is one observational unit.

Exploratory Data Analysis

Visualize
Summarize
Stratify
Transform

One Variable Summaries

Histogram

penguins %>% 
  ggplot() +
  aes(x = bill_depth_mm) +
  geom_histogram()

Histogram

# The number of bins is arbitrary.
# Different bins can highlight
# different patterns
penguins %>% 
  ggplot() +
  aes(x = bill_depth_mm) +
  geom_histogram(bins = 10)

Histogram

# You can add a rug under the plot
# to show the data
penguins %>% 
  ggplot() +
  aes(x = bill_depth_mm) +
  geom_histogram(bins = 10) +
  geom_rug(alpha = .1)

Exercise

Describe an histogram in terms of the grammar of graphics.

Which step do you have to define explicitely? Which step are defined implicitely by ggplot2?

Density

penguins %>% 
  ggplot() +
  aes(x = bill_depth_mm) +
  geom_density() +
  geom_rug(alpha = .1)

Density

# You can adjust how much
# the line responds to data
penguins %>% 
  ggplot() +
  aes(x = bill_depth_mm) +
  geom_density(adjust = 1/5) +
  geom_rug(alpha = .1)

Boxplot

Boxplot are widely used in exploratory data analysis to show a five points summary of continuous variables stratified by one or more categorical variables.

A boxplot show:

Maximum
Higher Quartile
Median
Lower Quartile
Minimum

(and outliers).

Boxplot

penguins %>% 
  ggplot() +
  aes(x = species,
      y = bill_depth_mm) +
  geom_boxplot()

Boxplot

# You can use colours to 
# further stratify the data
penguins %>% 
  ggplot() +
  aes(x = species,
      y = bill_depth_mm,
      fill = island) +
  geom_boxplot()

Two Variables Summaries

Scatterplot

diamonds %>%
  ggplot() +
  aes(x = x, 
      y = price) +
  geom_point()

Scatterplot

# remove points with
# unlikely values
diamonds %>%
  filter(x > 0) %>% 
  ggplot() +
  aes(x = x, 
      y = price) +
  geom_point()

Scatterplot

# avoid overplotting
diamonds %>%
  filter(x > 0) %>% 
  ggplot() +
  aes(x = x, 
      y = price) +
  geom_point(alpha = .1)

Scatterplot

# add a smoothed summary
diamonds %>%
  filter(x > 0) %>% 
  ggplot() +
  aes(x = x, 
      y = price) +
  geom_point(alpha = .1) +
  geom_smooth()

Scatterplot

# add a smoothed summary
# with a specific model
diamonds %>%
  filter(x > 0) %>% 
  ggplot() +
  aes(x = x, 
      y = price) +
  geom_point(alpha = .1) +
  geom_smooth(
    method = lm,
    formula = "y ~ poly(x, 2)"
    ) +
  scale_y_continuous(
    limits = c(NA, 19000)
    )

Scatterplot

# map another variable only in
# one layer
diamonds %>%
  filter(x > 0) %>% 
  ggplot() +
  aes(x = x, 
      y = price) +
  geom_point(
    aes(colour = clarity),
      alpha = .1) +
  geom_smooth(
    method = lm,
    formula = "y ~ poly(x, 2)"
    ) +
  scale_y_continuous(
    limits = c(NA, 19000)
    ) +
  guides(
    colour = guide_legend(
      override.aes = list(
        alpha = 1)))

Scatterplot

# Use facets to further
# stratify the data
diamonds %>%
  filter(x > 0) %>% 
  ggplot() +
  aes(x = x, 
      y = price) +
  geom_point(aes(colour = clarity),
      alpha = .1) +
  geom_smooth(
    method = lm,
    formula = "y ~ poly(x, 2)"
    ) +
  facet_wrap(facets = 'cut',
             ncol = 2) +
  scale_y_continuous(
    limits = c(NA, 19000)
    ) +
  guides(
    colour = guide_legend(
      override.aes = list(
        alpha = 1)))

Exercise

Describe the faceted scatterplot, that you can find on the previous page) in terms of the grammar of graphics.

How many layers are there?

Which step do you have to define explicitely? Which step are defined implicitely by ggplot2?

Residual Plot

We can use statistical models to transform data and to make more evident, visually, the patterns contained in the data.

We can also use visual exploration of the output of statistical model, to see if the model fit the data properly.

library(tidymodels)

Residual Plot

penguins %>%
  ggplot() +
  aes(x = bill_length_mm, 
      y = bill_depth_mm) +
  geom_point(aes(
      colour = sex
  )) +
  geom_smooth(
    method = 'lm'
    ) +
  facet_wrap(
    facets = 'species',
    ncol = 2
  )

Residual Plot

# We can define a model
# before plotting
d_fit <- 
  linear_reg() %>% 
  fit(bill_depth_mm ~ 
        bill_length_mm + 
        species,
      data = penguins)

# extract model residuals
(d_fit <- d_fit %>% 
  augment(new_data = penguins))

# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, .pred <dbl>, .resid <dbl>

Residual Plot

# Let's plot the residuals of
# the linear model
d_fit %>% 
  ggplot() +
  aes(x = bill_length_mm,
      y = .resid,
      colour = sex) +
  geom_point() +
  geom_hline(
    yintercept = 0,
    colour = 'black'
  )

Residual Plot

d_fit %>% 
  ggplot() +
  aes(x = bill_length_mm,
      y = .resid,
      colour = sex) +
  geom_point() +
  geom_hline(
    yintercept = 0,
    colour = 'black'
  ) +
  facet_wrap(
    facets = 'species',
    ncol = 2
  )

Heatmap

Heatmap are useful for exploring big datasets, where many observation are similar to one another. To avoid overplotting, on those datasets, you can turn scatterplots into heatmaps.

In a heatmap we map a quantitative value to a color. Heatmaps can be used both with categorical x and y axes, or binned continuous axes.

In this case we are binning the data both on the x and y axis, and mapping counts to a continuous color palette.

Heatmap

diamonds %>% 
  filter(x > 0) %>% 
  ggplot() +
  aes(x = x, 
      y = price) +
  stat_bin2d(
    aes(
      fill = after_stat(count)
    ),
    geom = 'tile'
  ) +
  scale_fill_continuous(
    guide = guide_colorbar(
      barwidth = 15
    )
  )

Heatmap

# If you work on a white background,
# use a colour palette that maps low 
# values to light colours and high values
# to dark colours
diamonds %>% 
  filter(x > 0) %>% 
  ggplot() +
  aes(x = x, 
      y = price) +
  stat_bin2d(
    aes(
      fill = after_stat(count)
    ),
    geom = 'tile'
  ) +
  scale_fill_viridis_c(
    direction = -1,
    option = 'G',
    guide = guide_colorbar(
      barwidth = 15
    ))

Pairs

You can use pair plots to plot each pairwise relationships in a dataset.

Pair plots are a very quick way to explore relationships in reasonably big datasets.

library(GGally)

Pairs

diamonds %>%
  ggpairs(
    lower = list(
      continuous = wrap(
        "points", alpha = 0.05
      )
    )
  )

Exercise - Detective Hat

A very bad criminal organization, have hidden a message for one of his hitmen in this file.

You have intercepted the file, but you must decode the message. You don’t have much time to stop a catastrophe. Work fast!

More details in the next page →

Exercise - Detective Hat

The aim of this exercise is to let you practice making many exploratory graphs, quickly.

Visualize the content of this dataset in different ways, until you find the secret message. Be fast, you have a lot of data to explore.

Be essential. You, right here, right now, are the only person that needs to undestand these graphs. Do not waste your time making the graphs nicer, change only what you need to change to understand them better.

Show the data. The message is often well hidden, if you summarize the data too much, they might get lost.