Generalised linear modelling

Author

BIOL33031 / BIOL65161

Understanding response distributions

Link to slides: PowerPoint and PDF.

So far, we’ve assumed that our response variable is continuous and roughly normally distributed — for example, when using lm() or lmer().
But many biological datasets don’t fit that assumption: we often measure counts, proportions, or binary outcomes.
To deal with these cases, we use generalised linear models (GLMs), which let us specify a different distribution for the response.

When normality doesn’t make sense

Let’s imagine two examples:

  1. Counting pigeons in a garden
    The data are counts (0, 1, 2, …) that can’t be negative.
    These follow a Poisson distribution, defined by a single parameter, λ (lambda), which represents both the mean and variance.

  2. Counting growth on plates
    You might have four plates and record “growth” or “no growth” for each, a binary outcome.
    These data are binomially distributed, with two parameters:

    • size: number of trials (e.g. number of plates)
    • prob: probability of success (e.g. probability of growth)

Both datasets are numeric, but neither is likely to have normally distributed residuals — so a normal‐error model is inappropriate.

Rather than transforming data to fit a normal model, GLMs let us directly specify an appropriate response family using the family argument in R functions such as glm() or glmer(). Examples include family = poisson for counts or family = binomial for binary outcomes. The video also briefly mentions other types of models — multinomial, negative binomial, and survival models — which follow similar logic but suit different data types. Later videos will show how to fit and interpret these models in practice.

Fitting and checking models

Link to slides: PowerPoint and PDF.

Now that we understand that GLMs allow us to choose different distributions for the response, this video shows how to fit these models and check whether they make sense.

Example: insect counts and spray treatments

We’ll use a classic dataset where insects were counted on agricultural plots treated with different sprays. These are count data, so we use a Poisson model.


Call:
glm(formula = count ~ spray, family = poisson, data = InsectSprays)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.67415    0.07581  35.274  < 2e-16 ***
sprayB       0.05588    0.10574   0.528    0.597    
sprayC      -1.94018    0.21389  -9.071  < 2e-16 ***
sprayD      -1.08152    0.15065  -7.179 7.03e-13 ***
sprayE      -1.42139    0.17192  -8.268  < 2e-16 ***
sprayF       0.13926    0.10367   1.343    0.179    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 409.041  on 71  degrees of freedom
Residual deviance:  98.329  on 66  degrees of freedom
AIC: 376.59

Number of Fisher Scoring iterations: 5

This model asks whether insect counts depend on the type of spray — conceptually similar to an ANOVA, but using a Poisson error distribution.

Understanding the output

The model summary looks familiar, but there are a few new elements:

  • Residual deviance and null deviance replace residual variance.
  • For Poisson models, the mean and variance are expected to be equal.
  • We can check this by dividing the residual deviance by its degrees of freedom.
  • If that ratio is around 1, great. If it’s much higher, we may have overdispersion — more variability than the model expects.
dispersion <- glm_pois$deviance / glm_pois$df.residual
dispersion

If this ratio is > 1, the data are more variable than a perfect Poisson model would expect.