12  Mixed-Effects Models

12.1 Prerequisites

Answer the following questions to see if you can bypass this chapter. You can find the answers at the end of the chapter in Section 12.20.

  1. What is the primary limitation of repeated-measures ANOVA for analysing data with multiple sources of random variation (e.g., participants and items)?
  2. What happens to the standard errors when you fit an ordinary linear regression to clustered (non- independent) observations?
  3. In a mixed-effects model, what do random effects represent, and when should you declare a factor as random rather than fixed?

12.2 Learning objectives

By the end of this chapter you should be able to:

  • Distinguish fixed, random, nested, and crossed effects.
  • Fit linear and generalised linear mixed models with lme4::lmer() and lme4::glmer() and read their output.
  • Specify random intercepts ((1 | group)), random slopes ((x | group)), and uncorrelated random effects ((1 + x || group)).
  • Explain why p-values and degrees of freedom are contested in mixed models, and use lmerTest, pbkrtest, or parametric bootstrap to obtain approximate tests.
  • Diagnose convergence warnings and take appropriate action: rescaling, simplifying random effects, switching optimiser.
  • Compute and interpret the intraclass correlation coefficient (ICC).
  • Recognise crossed random effects (e.g., subjects and items) and write the corresponding lmer() formula.

12.3 Orientation

Clustered and repeated-measures data are the norm in biomedicine: patients nested in hospitals, observations nested within subjects, multiple pathology slides per patient, multiple time points per individual. Ignoring the clustering leads to incorrect standard errors, usually too small, because non-independence inflates the effective sample size used by the inference. Fitting a model that treats every cluster level as a separate parameter (one intercept per subject) wastes degrees of freedom and does not generalise to new subjects.

Mixed-effects models are the principled middle path: they treat between-cluster variation as a random component drawn from a distribution, with the variance of that distribution estimated from the data. This produces correctly-calibrated standard errors, generalisable inference, and ‘partial pooling’ of cluster-specific estimates toward the overall mean. The lme4 package by Douglas Bates is the standard implementation in R.

12.4 The statistician’s contribution

Mixed models are not harder to fit than ordinary linear models, lmer(y ~ x + (1 | subject), data = d) is trivial syntax. They are harder to specify, interpret, and defend. The judgement calls are where the analysis lives.

Fixed or random? A factor is fixed when its levels are the only ones of interest (treatment vs. control: nobody cares about a hypothetical ‘treatment B’ that was not in the trial). It is random when the observed levels are a sample from a population, and the inferential target is the population (subjects, hospitals, sites, items in a psycholinguistic experiment). The same variable can be either depending on the question. With many levels and sparse per-level data, random tends to be more honest; with few levels and ample data per level, fixed may be cleaner. The decision is not purely statistical; it depends on the scientific question.

Which random effects to include. Barr et al. (2013) made the case for ‘maximal’ random effects, random slopes for every fixed effect that varies within cluster. The cost is convergence; the benefit is honest standard errors. Matuschek et al. (2017) argue for parsimony when maximal models do not converge. The right answer depends on the data: if you have 8 subjects and 1000 trials each, a random slope per subject is feasible; if you have 50 subjects and 4 trials each, it is probably not.

ICC interpretation. The intraclass correlation \(\rho = \sigma^2_\text{between} / (\sigma^2_\text{between} + \sigma^2_\text{within})\) measures the proportion of total variance attributable to the cluster level. ICC = 0.05 means clusters explain 5% of variance; ICC = 0.5 means half. The implications for analysis differ: low ICC, ignoring clustering may not disastrously bias inference (though it should still be modelled); high ICC, ignoring it definitely does. Knowing the ICC of your design before publication review is much better than learning it during.

Convergence warnings are not optional. A non-converging mixed model has not been fit. Reporting its coefficients is reporting numerical noise. The right responses, rescale predictors, simplify random effects, switch optimiser, are documented and reproducible. Ignoring the warning and reporting anyway is a serious methodological failure.

These decisions distinguish mixed-model analyses that survive peer review from ones that do not.

12.5 Why ordinary regression fails on clustered data

Suppose you have repeated measurements on each of \(n\) subjects: \(K\) trials per subject, \(nK\) observations total. A naive lm(y ~ x, data = d) treats all \(nK\) observations as independent.

Two problems result:

  1. Standard errors too small. Effective sample size is roughly \(n / (1 + (K-1) \rho)\), not \(nK\), where \(\rho\) is the ICC. For \(K = 10\), \(\rho = 0.5\), the effective sample size is \(n \cdot 10 / 5.5 \approx 1.8 n\), not \(10 n\). Ignoring this inflates Type-I error from the nominal 5% to 20–30% in many designs.

  2. Misallocated variance. A subject-specific intercept that is not modelled as a random effect leaks into the residual error term, inflating \(\hat\sigma\) and reducing the apparent precision of within-subject effects.

The ‘aggregating to subject means’ alternative throws away within-subject variability, often halving statistical power. The ‘separate analysis per subject’ alternative does not generalise to new subjects.

Mixed models fix both problems simultaneously by modelling between-cluster variation explicitly.

Question. A clinical trial has 100 patients, 50 in each arm, with 10 follow-up visits per patient. You fit lm(blood_pressure ~ treatment, data = visits) ignoring the patient-level clustering. The reported standard error on the treatment coefficient is 0.5. Roughly what should the standard error be from a properly specified mixed model, assuming an ICC of 0.6?

Answer.

The naive \(n\) is 1000; the effective \(n\) for the treatment effect is roughly \(1000 / (1 + 9 \cdot 0.6) = 1000 / 6.4 \approx 156\). The naive SE understates the true SE by a factor of about \(\sqrt{1000/156} \approx 2.5\). So the mixed-model SE should be roughly \(0.5 \cdot 2.5 = 1.25\). The factor of 2.5 between the naive and correct SE means a treatment effect that looked highly significant (\(p = 0.001\)) under the naive analysis may be only marginal (\(p = 0.05\)) under the correct one. This is the cost of ignoring clustering.

12.6 Fixed vs. random effects

A fixed effect estimates a parameter at the levels you observed. The treatment in a clinical trial is fixed: only those treatments are of interest, and replication of the trial would use the same arms.

A random effect treats the observed levels as a sample from a distribution, with that distribution’s parameters estimated from the data. Subjects in a longitudinal study are random: this study’s 100 subjects are not the inferential target; the population they were sampled from is.

Practical heuristic: if you would replicate the experiment with new units (subjects, items, sites), and the new units would not be the same physical entities as the old, that factor is random.

Mathematically, for a random intercepts model:

\[ y_{ij} = \alpha + \mathbf{x}_{ij}^T \boldsymbol\beta + u_j + \varepsilon_{ij}, \quad u_j \sim N(0, \sigma_u^2), \quad \varepsilon_{ij} \sim N(0, \sigma^2). \]

The fixed effects (\(\alpha\), \(\boldsymbol\beta\)) and the two variance components (\(\sigma_u^2\), \(\sigma^2\)) are estimated. The cluster-specific deviations \(u_j\) are predicted (BLUPs, Best Linear Unbiased Predictors) rather than estimated, and are pulled toward zero by an amount that depends on how much data each cluster has. This is partial pooling: clusters with little data borrow strength from clusters with more data.

12.7 lmer() syntax

library(lme4)

# random intercept by subject
lmer(y ~ x + (1 | subject), data = d)

# random intercept and random slope on x
lmer(y ~ x + (1 + x | subject), data = d)
lmer(y ~ x + (x | subject), data = d)              # equivalent

# uncorrelated random intercept and slope
lmer(y ~ x + (1 + x || subject), data = d)
lmer(y ~ x + (1 | subject) + (0 + x | subject), data = d)  # equivalent

# nested random effects: clinics within regions
lmer(y ~ x + (1 | region/clinic), data = d)
lmer(y ~ x + (1 | region) + (1 | region:clinic), data = d)  # equivalent

# crossed random effects: subjects and items
lmer(y ~ x + (1 | subject) + (1 | item), data = d)

The | separator: random effects formula on the left, grouping factor on the right. The double bar || removes the correlation between random intercept and slope.

The choice of random-effects structure is consequential. Random slopes for fixed effects that vary within cluster are sometimes essential for honest inference (Barr et al., 2013). They are also computationally hard: a model with many random slopes may not converge.

12.8 REML vs. ML

lmer() defaults to REML (restricted maximum likelihood) estimation, which produces unbiased estimates of variance components. For comparing nested models that differ in fixed effects, refit with ML (REML = FALSE):

m1 <- lmer(y ~ x1 + (1 | g), data = d, REML = FALSE)
m2 <- lmer(y ~ x1 + x2 + (1 | g), data = d, REML = FALSE)
anova(m1, m2)

anova() automatically refits with ML when comparing fixed-effect structures. For comparing models that differ in random effects, REML is correct (and is what anova() uses if both models are REML-fitted).

12.9 p-values and degrees of freedom

summary(lmer_fit) does not show p-values for fixed effects, because the appropriate reference distribution is not exactly known. The denominator degrees of freedom for the t-statistic depend on the random-effects structure, the data, and the test in subtle ways.

Three approaches:

  1. lmerTest::lmer() wraps lmer() and adds Satterthwaite-approximated degrees of freedom and p-values:

    library(lmerTest)
    fit <- lmer(y ~ x + (1 | subject), data = d)
    summary(fit)        # now includes p-values

    Reasonable default for moderate sample sizes.

  2. pbkrtest::KRmodcomp() uses the Kenward-Roger approximation, generally more accurate for small samples but more expensive.

  3. Likelihood ratio tests via parametric bootstrap. pbkrtest::PBmodcomp() is the gold standard for small- sample inference but slow.

For variance components, likelihood-ratio tests have a boundary issue: the null hypothesis \(\sigma^2 = 0\) places the parameter on the boundary of the parameter space, and the chi-square reference distribution is wrong (it should be a 50:50 mixture with a point mass at 0). The standard fix is to halve the reported p-value.

12.10 Predicted random effects (BLUPs)

ranef(fit) returns the cluster-specific deviations from the fixed-effect estimates. These are useful for:

  • Plotting cluster-specific fits.
  • Identifying outlier clusters (extreme BLUPs).
  • Reporting between-cluster variation.
library(lme4)
fit <- lmer(Reaction ~ Days + (Days | Subject),
            data = sleepstudy)

re <- ranef(fit)$Subject       # data frame of intercept and slope BLUPs
head(re)

BLUPs are shrunk toward zero relative to the cluster- specific OLS estimates. The shrinkage is greater for clusters with less data. This shrinkage is a feature, not a bug: it stabilises estimates for sparsely-sampled clusters by borrowing information from the population.

12.11 ICC: how much clustering is there?

For a random-intercept model, the intraclass correlation:

\[ \rho = \frac{\sigma^2_u}{\sigma^2_u + \sigma^2}. \]

vc <- as.data.frame(VarCorr(fit))
icc <- vc$vcov[1] / sum(vc$vcov)
icc

Or use performance::icc(fit) for a tidier interface. The ICC quantifies how much of the residual variance is between-cluster vs. within-cluster:

  • ICC = 0: clusters do not differ; ordinary regression would be fine (rare).
  • ICC = 0.05–0.2: typical of randomised trials with modest cluster effects.
  • ICC = 0.3–0.6: typical of biological measurements with strong subject differences.
  • ICC = 0.9: nearly all variance is between clusters; you may have very few independent observations.

Reporting the ICC alongside fixed-effect estimates makes the strength of clustering transparent.

Question. Your data have ICC = 0.4 with 20 patients in each cluster. By what factor is the effective sample size reduced compared to the nominal sample size?

Answer.

Design effect \(= 1 + (K - 1) \rho = 1 + 19 \cdot 0.4 = 8.6\). Effective sample size is the nominal divided by the design effect: if you have 100 nominal observations (5 clusters of 20), the effective sample size is roughly 12. This huge inflation factor is why ignoring clustering produces catastrophic over-confidence: the analysis behaves as though it had eight times more independent information than it actually does.

12.12 Convergence warnings

lmer warnings are common with realistic data and complex random-effects structures. Common messages:

  • ‘Model failed to converge with max|grad| …’
  • ‘Model is nearly unidentifiable: large eigenvalue ratio’
  • ‘singular fit: see ?isSingular’

Each indicates the optimiser stopped before finding a clearly-defined optimum. Reporting the model output without addressing the warning is unsafe.

Standard remedies, roughly in order:

  1. Rescale predictors. Continuous predictors with wildly different units inflate condition number. Centre and scale (scale() or standardize()).
  2. Simplify random effects. Drop random slopes for predictors that vary little within cluster, or use || to remove correlations between random effects.
  3. Switch optimiser. control = lmerControl(optimizer = "bobyqa") often succeeds where the default fails.
  4. Increase iteration limit. optCtrl = list(maxfun = 1e5).
  5. Use the singular-fit check. isSingular(fit) indicates that one or more variance components have collapsed to (near) zero. The corresponding random effect should probably be removed.

If none of these work, the data may not be rich enough to support the model you specified. Simplify the random- effects structure, starting with the slopes that have the smallest estimated variance, until you have a converged, interpretable model. Document what you tried.

12.13 GLMMs with glmer()

glmer() extends mixed models to non-normal outcomes:

glmer(y ~ x + (1 | subject),
      family = binomial, data = d)            # logistic GLMM

glmer(count ~ x + (1 | subject),
      family = poisson, data = d)              # Poisson GLMM

Computational cost is substantially higher than lmer: the random-effects integration has no closed form for non- normal outcomes. The default Laplace approximation works well for moderate ICC and reasonable cluster size; for small clusters or high ICC, increase nAGQ (adaptive Gauss-Hermite quadrature points): nAGQ = 7 is often a good compromise.

Convergence problems are even more common in GLMMs than LMMs. The same remedies apply, plus the consideration that binary GLMMs with sparse data often have boundary issues where one or more random effects have estimated variance near zero, a flag for re-thinking the model.

For overdispersed counts, MASS::glmmPQL() or glmmTMB::glmmTMB() provide more flexible alternatives; the latter is faster and supports a richer family of distributions (negative binomial, beta-binomial, zero- inflated, hurdle).

12.14 Worked example: sleep deprivation study

library(lme4)
library(lmerTest)

data(sleepstudy)

# subjects measured over 10 days; reaction time as response
fit <- lmer(Reaction ~ Days + (Days | Subject),
            data = sleepstudy)
summary(fit)

# fixed effect: average reaction time increases by ~10ms/day
# random effects: subjects vary in baseline RT and in slope on Days

# ICC of the intercept
performance::icc(fit)

# subject-specific predicted slopes
ranef(fit)$Subject

The fixed effects table shows the population-average effect of sleep deprivation. The random-effects variance components show how much subjects vary in baseline RT and in their day-by-day slopes. The correlation between the two random effects (often substantial in such data) indicates whether subjects with high baselines also have steep slopes.

12.15 Collaborating with an LLM on mixed models

LLMs handle the syntax of lmer reasonably well; they handle the conceptual judgements much less reliably.

Prompt 1: writing the formula. Describe the data structure (subjects nested within sites; items crossed with subjects; etc.) and ask: ‘write the lmer formula for the correct random-effects structure.’

What to watch for. Confusion between nested and crossed random effects is the most common LLM error. For ‘subjects within sites’, the formula is (1 | site/subject) if subject IDs are reused across sites, or (1 | site) + (1 | subject) if subject IDs are unique. For ‘subjects answering items’, subjects and items are crossed, so (1 | subject) + (1 | item).

Verification. Ask the LLM to expand its proposed formula into the explicit form (e.g., (1 | site/subject)(1 | site) + (1 | site:subject)). If the expansion is correct, the formula is correct.

Prompt 2: handling convergence warnings. Paste the warning message and ask: ‘what is happening, and what should I try?’

What to watch for. The standard suite of remedies. If the LLM suggests ‘use lmerTest’, that is not a fix for convergence; it is just a way to get p-values once the model converges. If it suggests ‘just use the model anyway’, push back: a non-converging model is not a fit.

Verification. Apply each suggested remedy in turn. Track which one resolves the warning. Often the answer is to rescale predictors and to drop a random slope; rarely is it any single one of the more exotic options.

Prompt 3: interpreting random-effect variance. Paste the summary(fit) output and ask: ‘how should I interpret the random-effects variance components?’

What to watch for. The variance components are reported as standard deviations and variances. The interpretation is on the response scale (or, for GLMMs, on the link scale). Make sure the LLM does not confuse the random- effect variance (\(\sigma^2_u\)) with the residual variance (\(\sigma^2\)) or the fixed-effect coefficients.

Verification. The ICC is a sanity check: compute it from the variance components and confirm it falls in a plausible range for your design.

12.16 Principle in use

Three habits define defensible mixed-model practice:

  1. Specify random effects deliberately. Identify which factors are sampled (random) and which are fixed. Include random slopes for within-cluster fixed effects when the data support them; drop them when convergence demands it.
  2. Report the ICC. The strength of clustering is part of the result, not a nuisance. Readers need to know whether you are reporting a tightly-clustered design with little independent information or a loosely- clustered one.
  3. Take convergence warnings seriously. A non-converging mixed model has not been fit. Document what you did to address the warning; ignoring it is not a defensible choice.

12.17 Exercises

  1. Using lme4::sleepstudy, fit lmer(Reaction ~ Days + (Days | Subject)). Extract the fixed effects, the variance components, and the BLUPs. Plot the subject-specific fitted lines on top of the data.
  2. Compare the fixed-effect slope from exercise 1 to the slope from lm(Reaction ~ Days) ignoring subject. Explain why they differ.
  3. Simulate data from a random-intercept logistic GLMM with a known ICC. Fit the model with glmer() and check whether the fitted ICC is close to the truth.
  4. Refit the sleepstudy model with uncorrelated random intercepts and slopes (||). Compare AIC. Does the correlation matter?
  5. Take the sleepstudy data and induce a non-convergence warning by including a random slope for a constructed covariate that has no within-subject variation. Apply the standard remedies in order and document which one resolves the warning.

12.18 Further reading

12.19 Practice test

The following multiple-choice questions exercise the chapter’s content. Attempt each question before expanding the answer.

12.19.1 Question 1

What is the primary limitation of repeated-measures ANOVA when analysing data where participants respond to multiple trials?

    1. They require balanced data designs
    1. They cannot model both participant-level and item-level variability simultaneously
    1. They are computationally too expensive
    1. They require normally distributed data

B. Repeated-measures ANOVA handles a single random- factor structure. Mixed-effects models generalise to multiple crossed or nested random factors.

12.19.2 Question 2

What happens when observations are not independent in standard regression analysis?

    1. The model will automatically correct for this violation
    1. The results will be exactly the same as with independent observations
    1. The independence assumption is violated, leading to incorrect standard errors
    1. The model will fail to converge

C. Non-independence inflates effective sample size, producing incorrectly narrow standard errors.

12.19.3 Question 3

In the context of mixed-effects models, what are ‘random effects’?

    1. Effects that are randomly selected by the researcher
    1. Statistical noise that should be eliminated
    1. Effects that model random variation in grouping factors (like participants)
    1. Unpredictable variables that cannot be modeled

C. Random effects allow inference to generalise to the population of groups from which the observed levels were drawn.

12.19.4 Question 4

lmer returns a ‘singular fit’ warning. The most likely interpretation is:

    1. A clear bug in lme4; report it.
    1. One or more variance components have collapsed to near zero, indicating a random effect that the data cannot support.
    1. The fixed effects are not identifiable.
    1. Ignore it; singular fits are still valid.

B. Drop the corresponding random effect (or simplify the structure) and refit. A singular fit is not a model that can be reported as-is.

12.19.5 Question 5

Your design has 20 clusters of size 30, with ICC = 0.3. Approximately how does the effective sample size compare to the nominal sample size of 600?

    1. The same, 600.
    1. About 65 (effective).
    1. About 300 (effective).
    1. About 60 (effective).

B. Design effect = \(1 + 29 \cdot 0.3 = 9.7\). Effective \(n = 600 / 9.7 \approx 62\). This dramatic reduction is the standard motivation for mixed models: the naive sample size enormously overstates the information content of clustered data.

12.20 Prerequisites answers

  1. Repeated-measures ANOVA cannot simultaneously model multiple crossed or nested sources of random variation (e.g., participants and items). It assumes a single random-factor structure and balanced data, both of which are commonly violated in modern designs.
  2. Non-independence leaves the point estimates approximately unbiased but produces incorrect standard errors, typically too small, so confidence intervals are too narrow and p-values too optimistic. The inflation factor is roughly \(\sqrt{1 + (K - 1) \rho}\) for clusters of size \(K\) with ICC \(\rho\). For a typical longitudinal study with ICC = 0.5 and 10 visits per subject, the naive standard errors are about \(\sqrt{5.5} \approx 2.3\) times too small.
  3. Random effects model variation in grouping factors whose levels are a sample from a larger population of interest (e.g., subjects, clinics, items). Declare a factor random when you wish to generalise inference to the population of levels, not just the specific levels observed; declare it fixed when the observed levels are themselves the inferential target. The same variable can be either, depending on the question.