>Better Graphs II

> Show the data. Declutter. Foster intuition. Know your audience.

\ Otho Mantegazza _ Dataviz for Scientists _ Part 3.2

Communicating to Others

When you explore your data, your goal should be to produce, quickly, as many graphs as possible, to gain insights.

When you want to communicate results to other, you goal should be to make as few graphs as possible that convey a message to your audience in a clear and informative way.

As a scientist, the most likely scenario would be that you are communicating very complex results to an highly educated and informed audience.

Keep in Mind

  1. Show the data (as much as you can).
  2. Declutter.
  3. Use intuition.
  4. Know your audience’s expectations.

Show the Data

You should use graphs to deliver a message to your audience.

When you do it, show as many details of the data as you can. In this way you’ll provide precious information to your audience.

If you have to hide your data behind a statistical transformation, use one which is as simple as possible.

Show the Data

# Do we even need a chart
# to show a point estimate?
penguins %>%
  drop_na(sex) %>% 
  ggplot() +
  aes(x = sex,
      y = body_mass_g) +
  stat_summary(
    geom = 'bar',
    colour = 'black',
    fill = 'grey80',
    size = 1
    ) +
  scale_y_continuous(
    expand = expansion(
      mult = c(0, .05)
    )
  )

Show the Data

# showing mean and 
# confidence interval
# is generally accepted,
# but it still hides most
# of the information
penguins %>%
  drop_na(sex) %>% 
  ggplot() +
  aes(x = sex,
      y = body_mass_g) +
  stat_summary(
    fun = mean,
    geom = 'bar',
    colour = 'black',
    fill = 'grey80',
    size = 1
    ) +
  stat_summary(
    fun.data = mean_cl_normal,
    geom = 'errorbar',
    colour = 'black',
    size = 1,
    width = .5
    ) +
  scale_y_continuous(
    expand = expansion(
      mult = c(0, .05)
    )
  )

Show the Data

# A boxplot shows intuitive and 
# robust statistics
# When we switch to this visual
# model, ggplot automatically cuts
# the axis to highlight relative
# comparisons
penguins %>% 
  drop_na() %>% 
  ggplot() +
  aes(x = sex,
      y = body_mass_g) +
  geom_boxplot(
    colour = 'black',
    fill = 'grey80',
    size = 1
    ) +
  scale_y_continuous(
    expand = expansion(
      mult = c(0, .05)
    )
  )

Show the Data

# Only when if we show all
# data points we can see
# that the penguins body weight 
# follows a bimodal distribution
penguins %>% 
  drop_na() %>% 
  ggplot() +
  aes(x = sex,
      y = body_mass_g) +
  geom_jitter(
    shape = 1,
    size = 3,
    stroke = 1
    ) +
  scale_y_continuous(
    expand = expansion(
      mult = c(0, .05)
    )
  )

Show the Data

# The species variable seems
# to explain the bimodal distribution
penguins %>% 
  drop_na() %>% 
  ggplot() +
  aes(x = sex,
      y = body_mass_g,
      colour = species,
      shape = species) +
  geom_jitter(
    shape = 1,
    size = 3,
    stroke = 1
    ) +
  scale_y_continuous(
    expand = expansion(
      mult = c(0, .05)
    )
  )

Show the Data

penguins %>% 
  drop_na() %>% 
  ggplot() +
  aes(x = sex,
      y = body_mass_g,
      colour = species,
      shape = species) +
  geom_jitter(
    shape = 1,
    size = 3,
    stroke = 1
    ) +
  stat_summary(
    fun = median,
    geom = 'point',
    shape = '_',
    size = 20,
    colour = 'black'
  ) +
  facet_grid(
    cols = vars(species)
  ) +
  scale_y_continuous(
    expand = expansion(
      mult = c(0, .05)
    )
  ) 

Show as Much Data as You Can

Show the data is a good mindset to make informative plots.

Though, for the sake of clarity and simplicity, you might decide to hide some of the raw data behind statistical transformation.

It’s up to you to find the balance between detail and simplicity that suits your audience.

Show as Much Data as You Can

# In this case overplotting 
# saturates the chart
diamonds %>% 
  filter(carat < 3) %>% 
  ggplot() +
  aes(x = carat,
      y = price) +
  geom_point(shape = 1)

Show as Much Data as You Can

# We can use transparency 
# to overcome overplotting a bit
diamonds %>% 
  filter(carat < 3) %>% 
  ggplot() +
  aes(x = carat,
      y = price) +
  geom_point(shape = 1,
             alpha = .05)

Show as Much Data as You Can

# And add a red point that shows a robust
# summary of the data: the median
# of price over binned carat
diamonds %>% 
  filter(carat < 3) %>% 
  ggplot() +
  aes(x = carat,
      y = price) +
  geom_point(shape = 1,
             alpha = .05) +
  stat_summary(
    aes(
      x = carat %>% round(1)
    ),
    fun = median,
    colour = '#f44702',
    geom = 'point',
    size = 3
  )

Show the Data

Your graph should always deliver a message.

While delivering this message, you should provide as much information as possible, showing the data.

If your graph gets too crowded, feel free to hide some of the data behind a statistical transformation.

Graphs are processed intuitively by your audience, be sure not to mislead them.

Declutter

Keep your message in front.

One of the main mindset for decluttering a graph is the “data to ink ratio” concept, developed by E. Tufte.

According to it, most ink in your graph should be used to represent data.

DATA TO INK RATIO

Year: 1983

Author: Edward Tufte.

Edward Tufte shows that removing ink from axes and other graphical elements that are not directly mapped to the data, the message of the graph might become more evident.

Checking the data to ink ratio of your graphs is a good way to help your message stand out.

User Research

User Research

User research is a design practice about researching how user interact with a tools or products like websites. Asking what would the user expects from a tool and how they interact with it.

User research proceeds through methods from behavioral psychology and ethnography, such as interviews, surveys, tests.

When you communicate your data driven results, graphs and charts are your products, and your audience is their user.

Know Expectations

Probably, you’ll collaborate with people from many different fields and disciplines.

Each field has its own way to represent specific types of data.

People in each field expect to see data represented in a way that they are used too.

In a few words, people don’t like change.

MA PLOT

Source: ggpubr

Field: Omics Disciplines

A highly specialized diagnostic visual model used in the omics disciplines.

The MA plots shows the average expression level against the log fold change in multiple parallel measurements between two conditions.

When you compare the expression levels of several transcripts, such as with a transcriptomic experiment, you expect few genes to be differentially expressed.

With the MA plot you can check if this assumption is supported by the data, or you can detect normalization issues.

CANDLESTICK CHART

Source: The Data Visualization Catalogue

Field: Stock Trading

A quick way to check if the stocks values increased or decreased between the opening and the closing of given day, and to check also the highest and the lowest value that they reached

For each day, the box represents the opening and the closing value, and the whiskers represent the highest and the lowest value of the stock. The color shows if the stock value increased or decreased during the day.

This graph is similar to a boxplot, but represents different values.

BIPLOT

Source: Exapmles on the Palmer Penguins Dataset

Field: Multivariate Analysis

A way to display, together, the loadings and projections of the first two components while performing a principal component analysis.

This graphs overlay two information: the projected data are shown as a scatterplot, and the loadings are shown as arrows starting from the center.

MANHATTAN PLOT

Source: Wikimedia Commons - M. Kamran Ikram et al

Field: Genomics / Genome Wide Association Studies

Across all chromosomes, which genetic marker is statistically associated with a phenotypic trait or an illness?

The x axis represents the loci, the position across all chromosomes, which are displayed by color. The y axis represents the minus log p-value of the association test between the loci and the phenotypic trait which is being studied.

CIRCULAR CHORD PLOT - GENOMICS

Source: Gupta et al., 2015, PNAS

Field: Genomics / Evolution

Circular Plots are often used in genomics to represent translocation and rearrangements of chromosome fragments across the chromosome of an organism. In general, circular visual models are very common in genomics.

The outer ring represents the chromosomes, the inner ring represents genomic rearrangements. The main axis, x, mapped to the position on the chromosome, and is represented by an angle, generally named theta, on polar coordinates.

CIRCULAR CHORD PLOT - MIGRATIONS

Source: Guy J. Abel - migest R package

Field: Humanitarian Studies / Migration

The same specialized visual model, can be used to represent analogous phenomenon across multiple disciplines, in this case the circular chord plot is used to represent migrations within and across continents.

The outer ring represents the scale of the migration by continent. The size of the migration is represented on the x axis and it is mapped to the theta angle in polar coordinates. The inner ring represent the direction and size of the migration.

HURRICANE PROBABILITY CONE

Source: NOAA - Hurricane Center

Field: Weather Forecast / Hurricane Warning

The uncertainty of the hurricane’s path is mapped to the cone. There’s evidence that, instead, we intuitively associate the size of the cone to the forecasted size of the hurricane, which we perceive as getting bigger with time.

The cone shows the foretasted path of the hurricane, from its current position. The size of the cone grows with the uncertainty in the path’s forecast

Changing Expectations

We’ve seen examples that for each data driven discipline there’s a set of specific and elaborated visual models that are used often and repeatedly.

You might think that they are great or that they are bad, but they are the plots being used.

If you try to change them you’ll encounter opposition, because you’ll require your readers to go through a mental burden, when they are exposed to a change to the visual model that they are used to process intuitively.

Changing Expectations

Data visualization is often processed intuitively. This is its power, that let’s us process complex information with very little cognitive burden.

If you change to much what people expect, you will place a burden on them. They might have a hard time understand the message you are trying to convey, or they might apply the wrong mental model to it, and be misled to wrong conclusions.

Change things gradually and with caution.

Test Your new Ideas

When changing the status quo, when you want to apply a new visual model to a known problem, test your changes.

They might be an improvements for you, but be misleading to others.

Your resources will probably fall short of what is needed for a full cognitive test. Nevertheless you can ask your colleagues, coworkers and friends to go through your graphs and give you feedback about what they understand and what they don’t.

Basic Perception

Nevertheless, there are fundamental perception experiments that tell us how accurately we tent to read different shapes of visual information.

If you want to study this topic, a great place to start is Kennedy Elliot’s blog post: 39 studies about human perception in 30 minutes.

Exercise

Recover your graphs from the exercises at the end of section 5 - Better Graphs Part 1.

If you feel is needed, take 30 minutes to improve them, based on what you have learned in this lesson.

Afterwards, go to your colleagues and ask them to read your graphs. →

Exercise

→ Your colleagues should read your graph and:

  • Describe how the graphical variables are mapped to the data variables (i.e. what’s on the x axis, what’s on the y axis, is some variable encoded by colour, etc.)
  • Select a point in the graph and explain it’s value.
  • Describe the overall trend of the data.

Let your colleagues ask you questions and note down the difficulties that they are facing. These are precious information on how to improve your graphs.

Remember, you are NOT testing your colleagues’ skills, you are testing the readability of your graphs.