1  Introduction to R

1.1 Prerequisites

Answer the following questions to see if you can bypass this chapter. You can find the answers at the end of the chapter in Section 1.14.

  1. What is the difference between c(1, 'a') and list(1, 'a')?
  2. In the expression df[df$x > 0, ], what kind of R object is df$x > 0, and what rule does R use to combine it with the comma to select rows?
  3. Predict the output of x <- 1:5; x[-1], x[c(TRUE, FALSE)], and x[[1]]. Which of these produces something that is not a numeric vector of length less than 5?

1.2 Learning objectives

By the end of this chapter you should be able to:

  • Install R and RStudio and configure a sensible working environment.
  • Read the R documentation system (?, ??, help()).
  • Distinguish between atomic vectors, lists, and the major data structures built on them.
  • Subset and mutate data with base R and with dplyr.
  • Write vectorised expressions and explain why they outperform equivalent for loops.

1.3 Orientation

R is the lingua franca of statistical computing in academic biostatistics. It is not the fastest language, nor the most elegant, but it has by far the richest library of statistical methods and the largest community of statisticians reading, writing, and reviewing code. This chapter gets you fluent enough in R to read the rest of the book.

1.4 The statistician’s contribution

Large language models can produce passable R code for almost any introductory task. Three decisions still require the statistician’s judgment.

1. Type discipline. R coerces silently between types in ways that can hide errors. c(1, '2') becomes c('1', '2') (character); TRUE + 1 becomes 2 (integer). An LLM will happily accept whatever types flow through its generated code. You must decide, before any computation runs, which types the data should carry, and assert them with stopifnot(is.numeric(x)) where it matters.

2. Base R vs tidyverse framing. The same transformation can be expressed in base R and in dplyr. An LLM will lean toward whichever style dominated its training data, usually dplyr for interactive work. The performance, dependency, and readability tradeoffs belong to you. Pick the style that fits the project’s needs, not the model’s default.

3. Edge cases in statistical data. R’s built-in functions handle NA, NaN, Inf, zero-length vectors, and length-mismatched vectors in different ways. The defaults are usually right for general-purpose data; they are often wrong for biomedical data where missingness is informative. An LLM will not check whether your data has the edge cases that would break its generated code. You will.

1.5 Installing R and RStudio

R is a free and open-source implementation of the S programming language. Download the base R installer from the Comprehensive R Archive Network (CRAN); Posit’s RStudio Desktop, a free integrated development environment, is available at https://posit.co/download/rstudio-desktop/. Install R first, then RStudio.

On macOS, the rig command-line tool (https://github.com/r-lib/rig) lets you install and switch between R versions cleanly:

brew install --cask rig
rig add release        # installs the current stable R
rig default release

Verify the installation:

R --version
# R version 4.5.3 (2026-02-28) -- 'Single Candle'

Launch RStudio. The default four-pane layout gives you a script editor, a console, an environment inspector, and a files/plots/help browser.

1.6 The R console, scripts, and projects

An R session is anchored to a working directory and a set of loaded packages. Two conventions make sessions reproducible.

Use RStudio projects. A .Rproj file tells RStudio that the enclosing directory is a project; opening the file sets the working directory automatically and isolates the session from the rest of your filesystem. Always work inside a project, never from a stray session rooted in ~/.

Prefer scripts over the console. The console is for quick exploration; scripts are for work that must be reproducible. A script is a .R or .qmd file you can rerun from top to bottom and obtain the same result. Analyses that live only in the console history evaporate when the session ends.

The .Rprofile file in a project runs automatically when you open a session. Use it to set session-wide options like the CRAN mirror or the preferred number of significant digits. renv extends this pattern to pin package versions; package and environment management is covered in detail in the practicum companion volume.

1.7 Atomic vectors and coercion

Everything in R is built from vectors. Even a single number like 5 is a vector of length 1.

R has five primary atomic types: logical, integer, double, character, and complex. All elements of an atomic vector share a type. If you try to combine mixed types, R silently coerces them to the most general type in the chain logical → integer → double → character.

c(TRUE, 1L)         # -> integer: 1, 1
c(TRUE, 1L, 2.5)    # -> double:  1.0, 1.0, 2.5
c(TRUE, 1L, 'x')    # -> character: 'TRUE', '1', 'x'

Silent coercion is a common source of bugs. sum(c(1, 2, NA)) returns NA, not 3. mean(c(TRUE, FALSE, TRUE)) returns 0.667 because the logicals coerce to c(1, 0, 1), which is numerically correct and often useful but can surprise readers.

Lists differ from atomic vectors in that their elements can each carry a different type:

list(1, 'a', TRUE)    # length-3 list, three distinct types

Operations that work on atomic vectors often do not work on lists; reach for lapply(), sapply(), or purrr::map_*() when iterating over list elements (Chapter 4).

Special values:

  • NA is missing data. Most operations involving NA return NA.
  • NULL is the absence of an object. Assigning NULL to a list element removes it.
  • NaN is ‘not a number’ (for example, 0/0).
  • Inf and -Inf are the signed infinities (for example, 1/0).

Q: What does c(1, 'a') return, and why?

A: The character vector c('1', 'a'). R cannot store 1 and 'a' in the same atomic vector, so it coerces both to the most general type (character). To keep the original types, use list(1, 'a').

1.8 Subsetting: [, [[, and $

R offers three subsetting operators and four kinds of indices. Confusion between them is the most common source of errors in a first R course.

The three operators.

  • [ returns a subset of the same type as the original. For an atomic vector it returns a vector; for a list it returns a list; for a data frame it returns a data frame.
  • [[ returns a single element, stripping the container. From a list, [[ gives you the element itself (which might be any type); from an atomic vector, [[ returns a length-1 atomic of the same type.
  • $name is shorthand for [['name']] on a list or data frame.

The four index types.

x <- c(10, 20, 30, 40, 50)

x[2]              # positive integer: 20
x[c(1, 3, 5)]     # vector of positive integers: 10, 30, 50
x[-1]             # negative integer: drop element 1 -> 20, 30, 40, 50
x[x > 25]         # logical: 30, 40, 50
x[c('a', 'b')]    # character: works only if x has names

For data frames, [row, col] applies the subsetting rules along both dimensions. The expression df[df$x > 0, ] uses a logical row index (df$x > 0) to keep rows where x is positive and a missing column index to keep all columns.

df <- data.frame(
  id  = 1:5,
  x   = c(3, -1, 5, 0, 8),
  grp = c('a', 'b', 'a', 'b', 'a')
)

df[df$x > 0, ]        # rows where x is positive
df[df$grp == 'a', ]   # rows in group 'a'
df[, c('id', 'x')]    # two columns, all rows

A logical index built from a column that contains NA produces NA rows, not dropped rows. This surprises people. Use which() or subset() if you want NA rows dropped.

1.9 Vectorisation

Vectorisation is applying an operation to every element of a vector at once, rather than writing an explicit loop. It is the central design principle of R: nearly every built-in function is vectorised.

x <- c(1, 2, 3, 4, 5)
x + 10                # 11 12 13 14 15, elementwise
x^2                   # 1 4 9 16 25
sqrt(x)               # 1.00 1.41 1.73 2.00 2.24

When two vectors of different lengths combine, R recycles the shorter vector to match the longer:

c(1, 2, 3, 4) + c(10, 20)    # 11 22 13 24  (recycled)

Recycling without warning is convenient for scalar + vector arithmetic but dangerous when lengths mismatch by accident. R warns if the longer length is not a multiple of the shorter, but not otherwise.

Loop vs vectorised. A concrete benchmark makes the stakes visible:

library(bench)

vectorised_zscore <- function(x) (x - mean(x)) / sd(x)

loop_zscore <- function(x) {
  z <- numeric(length(x))
  m <- mean(x)
  s <- sd(x)
  for (i in seq_along(x)) z[i] <- (x[i] - m) / s
  z
}

bench::mark(
  vectorised = vectorised_zscore(iris$Sepal.Length),
  loop       = loop_zscore(iris$Sepal.Length),
  iterations = 100
)

The vectorised version is typically 10–100× faster because the elementwise arithmetic happens in compiled C code, not in R’s interpreter. For a vector of length 1 million the gap grows to 1000× or more.

Q: Why is vectorised R code faster than an equivalent for loop that does the same arithmetic?

A: The vectorised operation hands the whole vector to a compiled C routine, which runs the loop internally at native speed. An R-level for loop invokes the R interpreter once per element, and the interpreter overhead dominates the actual arithmetic for simple operations.

Not every algorithm vectorises. Sequential updates (MCMC chains, sequential hypothesis tests, change-point detection) require each step to depend on the previous one and cannot be written as single elementwise expressions. For these cases, Rcpp (Chapter 3) lets you write a C++ loop that R calls as a single function.

1.10 Tidyverse essentials

The tidyverse is a coordinated set of packages (dplyr, tidyr, ggplot2, purrr, and others) that share a common grammar for data manipulation and visualisation. dplyr provides six core verbs and a pipe operator that chains them.

The native R pipe |> (introduced in R 4.1, 2021) passes its left-hand side as the first argument to the right-hand side:

x |> mean()          # same as mean(x)
x |> round(digits = 2)  # same as round(x, digits = 2)

Earlier tidyverse code uses the magrittr pipe %>%. The two are similar; this book uses the native pipe exclusively.

The six dplyr verbs apply to data frames:

Verb Action
filter(df, condition) Keep rows where condition is TRUE
select(df, cols) Keep or drop columns
mutate(df, new = ...) Add or modify columns
summarise(df, ...) Reduce to a single row (or one per group)
arrange(df, col) Sort rows
group_by(df, col) Group for later verbs

A typical pipeline:

library(dplyr)

ChickWeight |>
  filter(Time <= 14) |>
  group_by(Diet, Time) |>
  summarise(
    mean_weight = mean(weight),
    n           = n(),
    .groups     = 'drop'
  ) |>
  arrange(Diet, Time)

The pipe reads top-to-bottom like an analysis plan: start with ChickWeight, keep the first 14 days, group by diet and day, summarise, sort. The same computation in base R would fragment into intermediate variables and nested function calls; the pipe keeps the data flow visible.

Q: What does df |> group_by(x) |> summarise(m = mean(y)) return that summarise(df, m = mean(y)) does not?

A: A data frame with one row per distinct level of x, giving the mean of y within each group. Without group_by(), the second form collapses the whole data frame to a single row.

1.11 Collaborating with an LLM on R essentials

Section Section 1.4 named three decisions that require statistician judgment. The prompts below exercise each one. Treat the LLM’s response as a hypothesis to verify, not a result to trust.

1.11.1 Type coercion

Prompt. ‘Read this CSV and compute the mean of the dose column’, then paste a CSV whose dose column mixes numeric strings and the literal string 'NA'.

What to watch for. read.csv() treats 'NA' strings as the missing sentinel by default, but readr::read_csv() has different defaults depending on version. A dose column with mixed content can silently come in as character, and mean() on a character vector errors. An LLM may wrap the read in as.numeric() without warning you that values that fail to parse become NA.

Verification. Inspect str(df$dose) after the read. Confirm the type is numeric and the count of NA values matches the count of non-parseable source strings.

1.11.2 Base R vs tidyverse

Prompt. ‘Write R code to compute, for each level of a grouping variable, the mean of another variable, dropping missing values.’

What to watch for. An LLM will usually produce the dplyr form (group_by + summarise(mean(x, na.rm = TRUE))). It may not ask whether the project uses tidyverse at all. Production R code in some settings (pharma, legacy packages, FDA submissions via pharmaverse internals) runs on base R or data.table and imports no tidyverse.

Verification. Confirm the project’s dependency policy before accepting the suggestion. If tidyverse is permitted, keep the LLM’s code. If not, ask for the base R equivalent (aggregate(y ~ group, data = df, FUN = mean)) or the data.table equivalent.

1.11.3 Edge cases

Prompt. ‘Write a function that computes the ratio of two columns in a data frame.’

What to watch for. An LLM will typically write df$x / df$y without handling the cases that make the ratio undefined: zero denominators (produces Inf), negative denominators (changes sign), NA inputs (propagate), or zero-row inputs (returns a zero-length numeric). In a biomedical analysis each case signals a real data phenomenon: zero divisor as a structural zero, missing as a censored measurement, and so on.

Verification. Ask the LLM to list the inputs on which the function would produce unexpected output. Extend the function to raise an error (with an explanatory message) rather than silently returning Inf or NaN for cases that indicate broken data.

1.12 Exercises

  1. Write a function summarise_numeric(df) that takes a data frame and returns a tibble with one row per numeric column giving its mean, median, SD, and number of missing values. Do not use any package outside base R.
  2. Repeat exercise 1 using dplyr::summarise() across all numeric columns. Compare the two implementations for readability and speed.
  3. Explain, in at most three sentences, why sum(c(1, 2, NA)) == NA returns NA rather than TRUE.

1.13 Further reading

  • (Wickham et al., 2023), the canonical first book for R data analysis; the 2nd edition uses the native pipe throughout.
  • (Wickham, 2019) Chapters 1-3, essential for reading other people’s R code.
  • (Grolemund & Wickham, 2017), the 1st edition is still a readable slower-paced introduction to the tidyverse.

1.14 Prerequisites answers

  1. c(1, 'a') coerces both elements to the most permissive atomic type, producing c('1', 'a') (character). list(1, 'a') preserves the types of each element because lists are not atomic.
  2. df$x > 0 is a logical vector of length nrow(df). R applies logical subsetting along the first dimension of the data frame so that df[df$x > 0, ] keeps only the rows where the logical vector is TRUE. Rows where df$x > 0 is NA are returned as rows of all NA.
  3. x[-1] drops the first element and returns 2:5. x[c(TRUE, FALSE)] recycles the logical vector to length 5 and returns c(1, 3, 5). x[[1]] returns the scalar 1, which is still a length-1 numeric vector; the [[ ]] form is important conceptually (it strips list/vector structure) but on atomic vectors it behaves like [ ] with a single index. All three results are numeric vectors of length less than 5.