Summary statistics

Making sense of biological variation

BIOL33031/BIOL65161

Why summarise data?

Biological data are often large and messy.

Summary statistics reduce many data points down to a few values that capture the overall picture. Consider our penguins from last time.

This first plot contains the raw data of all recorded body masses for all 344 penguins.

The second contains a summary commonly known as a box plot.

Which one do you find easier for picking out meaningful differences?

The key tools

In the tidyverse, we use two main functions for summarisation:

  • summarise() – creates new summary values, like a mean or standard deviation.
  • group_by() – tells R to calculate summaries separately for each group (e.g. for each species).

There are also helper functions that make our life easier:

  • mean(), median(), sd() – common ways to summarise numbers used in biology
  • n() – the number of rows in each group
  • count() – a shortcut for counting rows in each group

Calculating means

Let’s start simple. Suppose we want the overall mean bill length across all penguins.

library(tidyverse)
library(palmerpenguins)

penguins |>
  summarise(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))
# A tibble: 1 × 1
  mean_bill_length
             <dbl>
1             43.9

The average bill length of all penguins that were sampled is 43.9 mm.

Often we don’t just want one overall number, but a number for each group in the data. For example, what if we want the average bill length per species?

penguins |>
  group_by(species) |>
  summarise(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))
# A tibble: 3 × 2
  species   mean_bill_length
  <fct>                <dbl>
1 Adelie                38.8
2 Chinstrap             48.8
3 Gentoo                47.5

Here group_by() separates the data into groups by species, and summarise() calculates the mean within each of those groups.

Tallying group sizes

Another simple question is, how many penguins do we have of each species?

penguins |>
  count(species)

The function count() is shorthand for:

penguins |>
  group_by(species) |>
  summarise(n = n())

How many of each were on each island?

penguins |>
  count(species, island)
# A tibble: 5 × 3
  species   island        n
  <fct>     <fct>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Dream        68
5 Gentoo    Biscoe      124

Only Adélie penguins were found on all three islands; Chinstrap and Gentoo were only recorded on Dream and Biscoe, respectively.

Notice count() hasn’t given us a zero-count for Chinstraps and Gentoos on those other islands. We can add the zeros in using complete():

penguins |>
  count(species, island) |>
  complete(species, island, fill = list(n = 0))
# A tibble: 9 × 3
  species   island        n
  <fct>     <fct>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Biscoe        0
5 Chinstrap Dream        68
6 Chinstrap Torgersen     0
7 Gentoo    Biscoe      124
8 Gentoo    Dream         0
9 Gentoo    Torgersen     0

Multiple summaries at once

We are not limited to one summary per group. For example, we might want the mean, the standard deviation, and the sample size all together.

penguins |>
  group_by(species, sex) |>
  summarise(
    mean_length = mean(bill_length_mm, na.rm = TRUE),
    sd_length = sd(bill_length_mm, na.rm = TRUE),
    n = n()
  )
# A tibble: 8 × 5
# Groups:   species [3]
  species   sex    mean_length sd_length     n
  <fct>     <fct>        <dbl>     <dbl> <int>
1 Adelie    female        37.3      2.03    73
2 Adelie    male          40.4      2.28    73
3 Adelie    <NA>          37.8      2.80     6
4 Chinstrap female        46.6      3.11    34
5 Chinstrap male          51.1      1.56    34
6 Gentoo    female        45.6      2.05    58
7 Gentoo    male          49.5      2.72    61
8 Gentoo    <NA>          45.6      1.37     5

The standard deviation, sd(), tells us something about the spread of the data around the mean, or how variable bill length is between different individual birds of the same species and sex. This provides a richer description of the variation in bill length than simply looking at the mean.

Measures of central tendency

Different summaries describe the centre of the data in different ways.

  • Mean: arithmetic average; sensitive to extreme values.
  • Median: middle value; robust to outliers.
  • Mode: most common value; rarely used for continuous measures.

Example (mean and median flipper length by species):

# A tibble: 3 × 3
  species   mean_flipper median_flipper
  <fct>            <dbl>          <dbl>
1 Adelie            190.            190
2 Chinstrap         196.            196
3 Gentoo            217.            216

Measures of spread

Two groups can have the same average but very different variability.
Measures of spread help us compare how consistent or scattered values are.

  • Range: max - min (influenced by outliers).
  • Interquartile range (IQR): the middle 50% of the data.
  • Standard deviation (SD): average distance from the mean.

Example (variation in flipper length by species):

# A tibble: 3 × 5
  species   min_flipper max_flipper iqr_flipper sd_flipper
  <fct>           <int>       <int>       <dbl>      <dbl>
1 Adelie            172         210           9       6.54
2 Chinstrap         178         212          10       7.13
3 Gentoo            203         231           9       6.48

When to use which?

Different measures of spread are useful for describing the data in different situations:

  • Mean + SD: Best when data are roughly symmetric (bell-shaped) and not affected by extreme values.
  • Median + IQR: Best when data are skewed or contain outliers, because they are more robust.
  • Range: Good for reporting extremes (minimum and maximum), but can be misleading if one unusual value dominates.

Always think about the biological question: are you interested in the “typical” value, the variability, or the extremes?

Standard deviation (SD) vs standard error (SE)

SD and SE are often confused, but they mean different things.

  • Standard deviation (SD): Describes how spread out the data are around the mean.
    Example: how variable are flipper lengths among individual Gentoo penguins?

  • Standard error (SE): Describes how precisely the mean has been estimated. It depends on both the spread of the data and the sample size.
    Example: how close is our sample mean flipper length to the true population mean?

SE is related to SD through the following formula: \[ SE = \frac{SD}{\sqrt{n}} \] You may see means reported with either SD or SE (e.g. mean ± SD, mean ± SE). They are not interchangeable: use SD to show biological variation, and SE to show precision of the mean. We will revisit SE when we look at the t-test.

Recap & next steps

  • Summarisation helps us see patterns in messy biological data.
  • summarise() and group_by() are the core tidyverse tools.
  • Helper functions like n(), mean(), and count() make common tasks quick.
  • Always watch out for missing values!

We’ll build on the tidyverse skills you’ve learned now to tackle one of the most important aspects of data analysis and scientific communication, data visualisation, in the next session.

Penguin jumping out of hole in icy water, its wing waived as if to say good-bye.

See you next time… Image by Christopher Michel CC BY-NC-SA 2.0