Preface – Statistical Computing in the Age of AI

Why this book when an LLM can already ‘do it’?

In April 2026, a graduate biostatistics student can open a browser, type ‘bootstrap a 95% confidence interval for the median of this data’, and receive working R code in under ten seconds. The code will usually run. It will often be correct. And it will sometimes be subtly, dangerously wrong in ways the student is not equipped to detect.

This is the premise of the book.

The premise is not that large language models are bad. They are remarkable tools, and ignoring them in a statistics curriculum would be a pedagogical failure. The premise is not that statisticians should refuse to use them. The book assumes, and encourages, daily LLM use in statistical computing work. The premise is that an LLM’s usefulness to a biostatistician is limited by the statistician’s own ability to verify, critique, and extend what the LLM produces, and that ability must be built through training that does not treat the LLM as a substitute for learning.

Put differently: the student who learns the material in this book will use LLMs more effectively than the student who does not. Not because this book teaches prompt engineering (it teaches a little of that, at the end of each chapter), but because this book teaches the statistical and computational judgment that LLM output requires from a human reviewer.

What this book is not

This book is not:

A textbook that pretends LLMs do not exist. A curriculum designed as if 2019 were still the current year trains students for a workplace that has ceased to exist.
A textbook about LLMs as such. Prompt engineering, fine-tuning, and the engineering internals of language models are other subjects, covered better elsewhere.
An argument that students should skip the hard parts of statistical computing because the LLM will handle them. The opposite is true: the arrival of LLMs raised the bar for what human statisticians must know, because deciding whether the LLM is right is now the statistician’s core contribution.

The book is the middle path: teach statistical computing as it has always been taught, but with a consciously integrated treatment of where the human statistician contributes and where the LLM can be delegated to.

How this book earns its subtitle

Three structural features distinguish the book from its predecessors and deliver on the ‘in the Age of AI’ framing:

Front-loaded consciousness-raising. Each content chapter opens, after the orientation, with a section titled The statistician’s contribution. That section names the two to five decisions about the chapter’s topic that cannot be delegated to an LLM without human input: is this method appropriate? what is the data structure? which variant of the procedure applies? These sections are short (roughly one printed page each) but are the intellectual scaffolding of the whole book.
Adversarial LLM practice. Each chapter closes with a section titled Collaborating with an LLM on [topic], containing three prompts deliberately constructed to expose common failure modes. Each prompt is paired with a Verification step that tells the reader what to check and how. The prompts are not there to show off what LLMs can do; they are there to train the reader to catch what LLMs cannot.
Verification-first pedagogy throughout. Every non-trivial code example is paired with a concrete way to check the answer — a closed-form result, an alternative implementation, a simulation, or a unit test. Code that cannot be verified is called out as such.

In addition, the book inherits the structural conventions of the Posit book family (Wickham, 2019; Wickham et al., 2023; Wickham & Bryan, 2023):

A three-question Prerequisites quiz at the start of each chapter (in the style of Advanced R), with answers at the foot of the chapter.
Check your understanding collapsible callouts distributed through the content, serving as paced comprehension prompts.
A Further reading section curating the best next sources.

Positioning against other books

A reader entering graduate biostatistics has many good statistical computing texts to choose from. This book does not replace any of them; it fills a gap that has opened in the past three years.

Advanced R (Wickham, 2019) and R for Data Science (Wickham et al., 2023) teach R itself superbly, but predate widespread LLM use and do not address the human-LLM division of labour. Read them alongside this one.
Statistical Computing with R (Rizzo, 2019) covers the algorithms in more depth than this book, also predates LLM integration, and is longer and more mathematical. Consult it when you want theory.
The Posit book family on specific tools (ggplot2, Mastering Shiny, R Packages) remains the reference for those tools. This book cites them rather than replacing them.
Statistical Rethinking (McElreath, 2020) is the best current introduction to Bayesian computation; this book’s Bayesian chapter (14 Bayesian Computation) is a gateway, not a substitute.

What this book adds is a unifying pedagogical framework for statistical computing in the presence of competent LLMs. That framework is what the subtitle commits to, and it is what Parts I–VI of the book deliver.

Prerequisites

Readers should have:

Completed a one-quarter course in mathematical statistics at the level of Casella and Berger.
Basic familiarity with linear algebra (vectors, matrices, eigenvalues).
Access to R 4.4+ and RStudio (or another IDE of their choosing).

No prior experience with R or Git is assumed. Some prior exposure to an LLM (ChatGPT, Claude, Gemini, or similar) is helpful but not required.

Conventions

See the Conventions page for the visual cues used throughout the book.

How this book relates to its siblings

This volume is the introductory volume in a four-volume graduate sequence.

Biostatistics Practicum covers the workflow that surrounds the methods: reproducibility, Git, Docker, renv, Quarto reporting, tidyverse data wrangling, CDISC, SAS, AI-assisted coding, and clinical-trial case studies.
Statistical Computing in the Age of AI (this volume) covers the methods: programming in R, numerical algorithms, and the core inferential techniques (linear models, GLMs, mixed models, survival, Bayesian, bootstrap, simulation).
Advanced Statistical Computing in the Age of AI is the second methods volume, covering numerical stability, numerical linear algebra in depth, advanced optimisation, EM, Monte Carlo, MCMC, modern Bayesian computation, high-performance and distributed computing, high-dimensional methods, machine learning for biostatistics, software engineering, and advanced interactive visualisation.
Applied Generative AI for Health Sciences Research treats generative AI as the orthogonal axis: capability classes, reasoning models, biomedical RAG, multimodal medical AI, agents and the Model Context Protocol, evaluation, safety, regulation, and deployment.

Together the four books form a two-year graduate biostatistics curriculum. Each can be read independently.

Acknowledgements

The structural conventions of this book (chapter-opening diagnostic quiz, end-of-chapter answers, section-level exercises) follow the model used in the Posit open-source textbook series (Wickham, 2019; Wickham et al., 2023; Wickham & Bryan, 2023). The tidyverse R packages maintained by the Posit engineering team are used throughout the code examples, and their consistent interface is reflected in the style of the prose.

The content rests on the foundational work of numerous authors in statistical computing, including Efron and Tibshirani (Efron & Tibshirani, 1993) on the bootstrap; Bates and colleagues (Bates et al., 2015) on lme4; Golub and Van Loan (Golub & Van Loan, 2013) on matrix computations; and Nocedal and Wright (Nocedal & Wright, 2006) on optimization. Remaining errors are the author’s.

The graduate students who engaged with early drafts of this material contributed materially through their questions and their willingness to debate the appropriate role of large language models in a statistics curriculum.

Ronald “Ryy” G. Thomas
La Jolla, California
Spring 2026