Missing Values

What is a missing value?

A missing value is a data point that’s missing, no one knows what the value was, so you can’t conduct any operation on it.

In R missing values are written NA.

Operation on missing values

Most operation on missing values return NAs.

Ask yourself what is one plus “a number that I don’t know” (NA). The answer is “I don’t know” (NA).

NA + 1

[1] NA

NA / 4

[1] NA

NA == 1

[1] NA

NA == NA

[1] NA

Testing if a value is missing

To test if an element is a missing value we must use the function is.na().

1 %>% is.na()

[1] FALSE

'hello' %>% is.na()

[1] FALSE

TRUE %>% is.na()

[1] FALSE

NA %>% is.na()

[1] TRUE

Missing values in a variable

We can assign missing values to a variable and place them in a vector.

missing_value <- NA

missing_value %>% is.na()

[1] TRUE

Missing values in a vector

When we use is.na() on a vector, it returns a vector of booleans, with TRUE in the position where the values are missing.

vector_with_na <- c(1, 5, NA, 10)

vector_with_na %>% is.na()

[1] FALSE FALSE  TRUE FALSE

Remember, we can’t use the == statement to test if a vector stores NAs.

vector_with_na == NA

[1] NA NA NA NA

Missing values in a vector

We can count missing values in a vector with is.na() %>% sum().

vector_with_na %>% is.na() %>% sum()

[1] 1

Operating on missing values

Most functions such as mean(), median(), sd() give you the chance to remove missing values with the argument na.rm = TRUE.

vector_with_na %>% mean()

[1] NA

vector_with_na %>% mean(na.rm = T)

[1] 5.333333

Also, ggplot’s function remove NAs automatically for us, if they would hinder computations.

Data with missing values

Often data have NAs in them, for example the penguins dataset does.

To count missing values per each column, use this lines of code:

penguins %>% 
  summarise(
    across(
      everything(),
      ~is.na(.) %>% sum()
    )
  )

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    <int>  <int>          <int>         <int>             <int>       <int>
1       0      0              2             2                 2           2
# ℹ 2 more variables: sex <int>, year <int>

Data with missing values

We can also count them stratified by the fixed variables, for example:

penguins %>% 
  group_by(island) %>% 
  summarise(
    across(
      everything(),
      ~is.na(.) %>% sum()
    )
  )

# A tibble: 3 × 8
  island    species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>       <int>          <int>         <int>             <int>       <int>
1 Biscoe          0              1             1                 1           1
2 Dream           0              0             0                 0           0
3 Torgersen       0              1             1                 1           1
# ℹ 2 more variables: sex <int>, year <int>

Strategy for missing values

Often data have missing values.

The most important thing, when you get new data is to figure out how many missing values it contains, and where they are.

Afterward you can decide if you want to remove them, or to impute them.

Learn more about missing values at R4DS.

Exercise

Identify the columns of the penguins dataset that contain NAs.

Substitute the missing values:

With 0s.
With the mean of the column.

>Missing Values