[1] NA
> How to operate on values that are not known.
\ Otho Mantegazza _ Dataviz for Scientists _ Part 1.3
A missing value is a data point that’s missing, no one knows what the value was, so you can’t conduct any operation on it.
In R missing values are written NA.
When we use is.na() on a vector, it returns a vector of booleans, with TRUE in the position where the values are missing.
Remember, we can’t use the == statement to test if a vector stores NAs.
Often data have NAs in them, for example the penguins dataset does.
To count missing values per each column, use this lines of code:
We can also count them stratified by the fixed variables, for example:
# A tibble: 3 × 8
island species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <int> <int> <int> <int> <int>
1 Biscoe 0 1 1 1 1
2 Dream 0 0 0 0 0
3 Torgersen 0 1 1 1 1
# ℹ 2 more variables: sex <int>, year <int>
Often data have missing values.
The most important thing, when you get new data is to figure out how many missing values it contains, and where they are.
Afterward you can decide if you want to remove them, or to impute them.
Learn more about missing values at R4DS.
Identify the columns of the penguins dataset that contain NAs.
Substitute the missing values: