> It is important to understand what you can do before you learn to measure how well you seem to have done it. (John Tukey)
\ Otho Mantegazza _ Dataviz for Scientists _ Part 2.3
Exploratory data analysis, as stated by Tukey, is the investigative work on data.
When you explore data, you leave no stone unturned. Relying on graphical methods, robust summary statistic and dimension reduction, you quickly gain insights in all possible patterns, correlations and cause-effect relationships that are in the data. In technical terms, you generate hypothesis.
After you have done your investigative work, you should switch from inspector to judge and test your hypothesis with inference and tests.
But there’s no point in testing the statistical relevance of poorly formulated hypotheses. Exploratory Data Analysis is fundamental in modern statistics, because it allows to formulate the best hypothesis possible.
When you explore data, you have to turn them into insightful formats. Most of the time this involves turning data into grafical and visual shapes.
If you want to represent data intuitively, first you have to learn terms that allow you to describe the structure of a dataset semantically.
The Tidy Data Theory lets us do just that.
Visualize
Summarize
Stratify
Transform
Describe an histogram in terms of the grammar of graphics.
Which step do you have to define explicitely? Which step are defined implicitely by ggplot2?
Boxplot are widely used in exploratory data analysis to show a five points summary of continuous variables stratified by one or more categorical variables.
A boxplot show:
(and outliers).
# map another variable only in
# one layer
diamonds %>%
filter(x > 0) %>%
ggplot() +
aes(x = x,
y = price) +
geom_point(
aes(colour = clarity),
alpha = .1) +
geom_smooth(
method = lm,
formula = "y ~ poly(x, 2)"
) +
scale_y_continuous(
limits = c(NA, 19000)
) +
guides(
colour = guide_legend(
override.aes = list(
alpha = 1)))
# Use facets to further
# stratify the data
diamonds %>%
filter(x > 0) %>%
ggplot() +
aes(x = x,
y = price) +
geom_point(aes(colour = clarity),
alpha = .1) +
geom_smooth(
method = lm,
formula = "y ~ poly(x, 2)"
) +
facet_wrap(facets = 'cut',
ncol = 2) +
scale_y_continuous(
limits = c(NA, 19000)
) +
guides(
colour = guide_legend(
override.aes = list(
alpha = 1)))
Describe the faceted scatterplot, that you can find on the previous page) in terms of the grammar of graphics.
How many layers are there?
Which step do you have to define explicitely? Which step are defined implicitely by ggplot2?
We can use statistical models to transform data and to make more evident, visually, the patterns contained in the data.
We can also use visual exploration of the output of statistical model, to see if the model fit the data properly.
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, .pred <dbl>, .resid <dbl>
Heatmap are useful for exploring big datasets, where many observation are similar to one another. To avoid overplotting, on those datasets, you can turn scatterplots into heatmaps.
In a heatmap we map a quantitative value to a color. Heatmaps can be used both with categorical x and y axes, or binned continuous axes.
In this case we are binning the data both on the x and y axis, and mapping counts to a continuous color palette.
# If you work on a white background,
# use a colour palette that maps low
# values to light colours and high values
# to dark colours
diamonds %>%
filter(x > 0) %>%
ggplot() +
aes(x = x,
y = price) +
stat_bin2d(
aes(
fill = after_stat(count)
),
geom = 'tile'
) +
scale_fill_viridis_c(
direction = -1,
option = 'G',
guide = guide_colorbar(
barwidth = 15
))
You can use pair plots to plot each pairwise relationships in a dataset.
Pair plots are a very quick way to explore relationships in reasonably big datasets.
A very bad criminal organization, have hidden a message for one of his hitmen in this file.
You have intercepted the file, but you must decode the message. You don’t have much time to stop a catastrophe. Work fast!
More details in the next page →
The aim of this exercise is to let you practice making many exploratory graphs, quickly.
Visualize the content of this dataset in different ways, until you find the secret message. Be fast, you have a lot of data to explore.
Be essential. You, right here, right now, are the only person that needs to undestand these graphs. Do not waste your time making the graphs nicer, change only what you need to change to understand them better.
Show the data. The message is often well hidden, if you summarize the data too much, they might get lost.