Getting Started with Bayesian Statistics

using Stan and Python

Author

Bob Carpenter

Published

June 17, 2023

Preface

Welcome to this introduction to Bayesian statistics using Stan in Python. The preface explains what we expect you to know before starting, how to install Stan, and provides the Python boilerplate we will use throughout.

Prerequisites

We will assume the reader will be able to follow text that includes basic notions from

differential and integral calculus in multiple dimensions,
matrix arithmetic (but not linear algebra),
probability theory, including probability density and mass functions, cumulative distribution functions, expectations, events, and the basic rules of probability theory, and
Python numerical programming with NumPy.

By basics, we really do mean basics. You won’t need to do any calculus, we will just use it to express what Stan computes. Similarly, we will use matrix notation to express models, and we will avoid advanced topics in linear algebra that are at the heart of some of Stan’s internals.

We include several appendices as both mathematical background and summary of notation, with rigorous definitions of the concepts used in this introduction. Those who are more mathematically inclined may wish to start with the appendices.

Source code and license

All of the source markdown, YAML, LaTeX, and BibTeX files are available in

Source Repository: github.com/bob-carpenter/stan-getting-started

Everything is open source, with licenses:

Code: BSD-3-Clause license
Text: CC-BY 4.0

Building this document from source

This presupposes that you are working in a terminal on the command line and have working versions of Quarto, Git, and make installed on your path. You can also build this document through RStudio.

You can download the text markdown source, Stan programs, and data used to create this document by cloning the Git repository where it is hosted.

> git clone https://github.com/bob-carpenter/stan-getting-started.git

You can then build the case study by changing into the cloned repository and making the document.

> cd stan-getting-started
> make

The final HTML file should be rendered into quarto/stan-getting-started.html.

Python, CmdStanPy, NumPy, pandas, and plotnine

For scripting language, we use Python 3.

To access Stan, we use the Python package CmdStanPy.

For numerical and statistical computation in Python, we use NumPy.

For plotting, we use the Python package plotnine. plotnine is a Python reimplementation of ggplot2, which is itself an implementation of the grammar of graphics (Wilkinson 2005).

We use pandas for representing wide-form data frames, primarily because it is the required input for plotnine.

Python boilerplate

We include the following Python boilerplate to import and configure packages we will use throughout this tutorial.

# PROJECT SETUP
# set DRAFT = False for final output; DRAFT = True is faster
DRAFT = False

import itertools
import logging
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings( "ignore", module = "plotnine\..*" )

import cmdstanpy as csp
csp.utils.get_logger().setLevel(logging.ERROR)

import numpy as np
import statistics as stat
import pandas as pd
import plotnine as pn
import patchworklib as pw
import time

class StopWatch:
    def __init__(self):
        self.start()
    def start(self):
        self.start_time = time.time()
    def elapsed(self):
        return time.time() - self.start_time
timer = StopWatch()

No module named 'seaborn'

<Figure size 96x96 with 0 Axes>

Acknowledgements

I’d like to thank Manny Mokel, Mitzi Morris, Barry Smith, Jeff Soules, and Brian Ward for extremely helpful feedback on earlier drafts. A lot of the current form of this document is due to their constructive suggestions.

Thanks to Ero Carrera for typo fixes.

Thanks to ChatGPT Plus 4.0 (May 5, 12, 24, 2023 versions) for extensive help with plotting and data frame construction, rewriting one sentence I couldn’t manage to write clearly, and for converting measure-theoretic notation to Lebesgue measures for me in order for me to be able to translate abstract theoretical presentations.

I’d also like to thank the entire Stan Development Team and all of our users. Stan is an open project and a large team effort and we’re always happy to help onboard new developers and new users.

1 Introduction

These notes are intended to introduce several technical topics to practitioners: Bayesian statistics and probabilistic modeling, Markov chain Monte Carlo methods for Bayesian inference, and the Stan probabilistic programming language.

1.1 Bayesian statistics

The general problem addressed by statistical inference is that of reasoning from a limited number of noisy observations. For example, we might want to perform inference about a population after measuring a subset of its members, or we might want to predict future events after observing past events.

There are many approaches to applied statistics. These notes focus on Bayesian statistics, a form of statistical modeling and inference that is grounded in probability theory. In the Bayesian approach to statistics, we characterize our knowledge of the world in terms of probabilities (e.g., there is a 24.3% chance of rain after lunch today, the probability that the next baby born in the United states is male is 51%).

Bayesian inference is always carried out with respect to a mathematical model of a stochastic data generating process. If the model is well-specified in the sense of matching the true data generating process, then Bayesian statistical inference can be shown to have several desirable properties, such as calibration and resistance to overfitting.

Appendix D. Bayesian statistics provides a short, but precise introduction to Bayesian inference, following Appendix A. Set theory and Appendix B. Probability theory, which provide background. The appendices establish a rigorous basis for the notation and provide more formal definitions of exactly what Stan is computing.

If you’re looking for a gentle introduction to Bayesian statistics, I highly recommend Statistical Rethinking (McElreath 2023). For a more advanced introduction, try Bayesian Data Analysis (Gelman et al. 2013), which is available from the authors as a free pdf.

1.2 Markov chain Monte Carlo methods

Bayesian inference for parameter estimation, prediction, or event probability estimation is based on posterior expectations. A posterior expectation is a high dimensional integral over the space of parameters. Stan adopts the standard approach to solving general high-dimensional integrals, which is the Monte Carlo method. Monte Carlo methods use random sampling (hence the name) to solve high-dimensional integrals (which is not itself a random quantity).

We cannot use standard Monte Carlo methods for most problems of interest in Bayesian statistics because we cannot generate a sample of independent draws from the posterior density of interest.¹ The exception is simple models in the exponential family with conjugate priors (Diaconis and Ylvisaker 1979). So instead, we have to resort to Markov chain Monte Carlo (MCMC) methods (Brooks et al. 2011), which create samples with correlation structure among the draws making up the sample.

Alternatives to MCMC include rejection sampling (Gilks and Wild 1992), sequential Monte Carlo (Doucet, De Freitas, and Gordon 2001), approximate Bayesian computation (ABC) (Marin et al. 2012), variational inference (Blei, Kucukelbir, and McAuliffe 2017), and nested Laplace approximation (Rue, Martino, and Chopin 2009), among others.

Among MCMC methods, Stan adopts Hamiltonian Monte Carlo (HMC) (Neal 2011), which is currently the most efficient and scalable MCMC method for smooth target densities. Popular alternatives include random-walk Metropolis-Hastings (Chib and Greenberg 1995) and Gibbs sampling (Casella and George 1992), both of which are simpler, but much less efficient than HMC for all but the easiest of problems.

1.3 Stan and probabilistic programming

Stan is what is known as a domain specific language (DSL), meaning it was written for a particular application. Specifically, Stan is a probabilistic programming language (PPL) designed for coding statistical models. Although Stan is used most widely for Bayesian inference, it can also perform standard frequentist inference (e.g., maximum likelihood, bootstrap, etc.), though we do not touch on those capabilities in this introduction.

A Stan program declares data variables and parameters, along with a differentiable posterior log density (of the parameters given the data) up to a constant. Although this is the only requirement, most Stan programs define the joint log density of the parameters and variables, which can be shown to be equal to the log posterior plus a constant.

Stan programs are probabilistic programs in the sense that their data and parameters can represent random variables. A random variable is something that takes on different values with some probability, although mathematically it is a deterministic function and gets its randomness from an underlying probability measure that determines how probable the values of a random variable are. An example of a random variable is the outcome of a coin flip. The variable takes on the value heads or tails, but we don’t know which. Somewhat confusingly, statistics often operates counterfactually, where we have actually observed the outcome of a coin flip but persist in treating it as if it were random and could have resulted in a different value.

In practice, Stan parameters, transformed paraemters, and posterior predictive quantities are all unobserved random variables. Such variables are inferred from the model and the values of other variables. Stan typically produces different values for unobserved random variables each time it is run. Stan programs can be made deterministic in order to provide reproducible results by fixing a random seed.

Stan programs are translated to C++ and we estimate quantities of interest using either MCMC or approximate methods like variational inference or Laplace approximation. Users provide data to Stan defining constants (like the number of observations) and providing the values of observed random variables. Stan provides output consisting of a sample, each draw of which provides a possible value for all of the unobserved random variables in the model. Stan is available through the open-source analytics languages Python, R, or Julia and compatible with these languages’ built-in Bayesian analysis tools as well as Stan’s own tools, some of which we will cover in this introduction. Stan’s also available in Mathematica, Stata, and MATLAB, but those interfaces are very basic compared to our open-source interfaces.

2 Pragmatic Bayesian statistics

There have been several schools of Bayesian statisticians, and Lin (2022) provides an excellent overview with primary references and Little (2006) provides a more in-depth summary comparing to frequentist philosophy. The two most prominent schools are the subjective Bayesians and the objective Bayesians. As suggested by the names, these two paradigms have diametrically opposed philosophical approaches. While both use proper priors in the sense of being probability distributions, the “subjective” approach tries to capture actual prior “beliefs,” whereas the “objective” approach tries to minimize the use of prior information. Both these groups trust their posterior inferences based on their chosen philosophical approach to priors.

We are going to follow a more pragmatic approach to Bayesian statistics that views model building as more of an engineering discipline than a philosophical exercise. This perspective is laid out in detail in Gelman et al. (2013) and refined in Gelman et al. (2020). The pragmatic approach feels more like modern machine learning than statistics, with its emphasis on predictive calibration Gneiting, Balabdaoui, and Raftery (2007). Roughly speaking, a probabilistic inference is calibrated if it has the right coverage for future data. For example, I might predict a 70% chance of rain on 100 different days. I would like to see roughly 70 of those days be rainy for my predictions to be well calibrated. Calibration is itself a frequentist notion, but we do not follow standard frequentist practice in that we are willing to modify our modeling assumptions once we have investigated their behavior on data (Gelman et al. 2013).

The fundamental distinguishing feature of frequentist statistics is that probabilities are understood as long-run frequencies of repeatable processes. This prohibits placing probability distributions over parameters, because there is no long-term repeatable process in the world generating new parameters. For example, the gravitational constant has a single value and is not the value of a potentially repeatable trial like a coin flip (other than in a philosophical, possible worlds sense).

In our pragmatic approach to Bayesian statistics, we treat probability as fundamentally epistemic rather than deontic, meaning it is about human knowledge, not about human belief. This is a subtle, but important distinction. Although frequentists sometimes worry that Bayesians are “putting their thumb on the scale” by including their prior knowledge in a model rather than “letting the data speak for itself,” this is an instance of the pot calling the kettle black. The biggest “subjective” decision in model building is shared between Bayesian and frequentist approaches, namely the likelihood assumed to model the data-generating process. In practice, we often sidestep the concerns of subjectivity by using weakly informative priors that indicate the scale, but not the particular value, of a prior. And we furthermore run sensitivity analyses to test the effect of our prior assumptions.

Pierre-Simon Laplace (1814) begins his book on probability by stating the general epistemic position on probability in terms of an entity that knows all (aka Laplace’s demon).

We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes.

John Stuart Mill (1882) is more explicit in laying out the epistemic view of probability as follows.

We must remember that the probability of an event is not a quality of the event itself, but a mere name for the degree of ground which we, or some one else, have for expecting it. $\ldots$ Every event is in itself certain, not probable; if we knew all, we should either know positively that it will happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence.

3 Stan examples: forward simulation and Monte Carlo

By forward simulation, we mean running a simulation of a scientific process forward from the parameter values to simulated data. For example, consider a simple clinical trial with $N$ subjects and a probability $\theta \in (0, 1)$ of a positive outcome. Given $\theta$ and $N$, we can simulate the number of patients $y \in 0{:}N$ with a successful outcome according to a binomial distribution (which we define below).

The inverse problem is that of estimating the probability of success $\theta,$ given an observation of $y$ successes out of $N$ subjects. For example, we might have $N = 100$ subjects in the trial, $y = 32$ of whom had a positive outcome from the trial. A simple estimate of $\theta$ in this case would be 0.32. We return to estimation and uncertainty quantification in later sections.

Let’s say we have $N = 100$ subjects in our clinical trial and the success rate is $\theta = 0.3$. We can simulate a result $y$ from the clinical trial by randomly generating the number of subjects with a successful outcome. Although this could be done by simulating the binary outcome for each patient, it wouldn’t be an efficient way to sample from a binomial distribution.

In statistical sampling notation, we write \[ Y \sim \textrm{binomial}(N, \theta) \] to indicate that there are $N \in \mathbb{N}$ patients with probability $\theta \in (0, 1)$ of a successful outcome, with $Y \in 0{:}N$ representing the number of successful outcomes out of $N$ patients.

The probability mass function function for $Y$, written $p_Y$, is defined for $N \in \mathbb{N}$, $\theta \in (0, 1)$, and $y \in 0{:}N$ by \[\begin{align} p_Y(y \mid N, \theta) &= \textrm{binomial}(y \mid N, \theta) \\[6pt] &= \binom{N}{y} \cdot \theta^y \cdot (1 - \theta)^{N - y}. \end{align}\] Unless necessary for disambiguation, we will drop the random variable subscripts on probability density or mass functions like $p_Y$ going forward, writing simply $p(y \mid N, \theta)$ and allowing context to disambiguate.

3.1 A first Stan program

Let’s say we wanted to generate random instantiations of $Y$ for given values of $N$ and $\theta$. For example, we can set $\theta = 0.3$ for a 30% chance of a successful outcome (“success” is the generic name in statistics for a “positive” outcome). We can then set $N = 10$ in order to simulate results for 10 patients. Then given $\theta = 0.3$ and $N = 10,$ we can generate a value of $Y$ between 0 and 10 for the number of patients out of 10 with a successful outcome. We can do this using the following Stan program, which we will unpack line by line after its listing.

stan/binomial-rng.stan

data {
  int<lower=0> N;
  real<lower=0, upper=1> theta;
}
generated quantities {
  int<lower=0, upper=N> y = binomial_rng(N, theta);
}

The first thing to notice is that a Stan program is organized into blocks. Here we have two blocks, a data block containing declarations of variables that must be input as data, and a generated quantities block, which not only declares variables, but assigns a value to them. In the case of this Stan program, the generated quantity variable y is assigned the result of taking a single draw from a $\textrm{binomial}(N, \theta)$ distribution, which Stan provides through the binomial_rng function.

The second thing to notice about a Stan program is that the variables are all declared with types. Stan uses static typing, which means that unlike Python or R, a variable’s type is declared in the program before it is used rather than determined at run time based on what is assigned to it. Once declared, a variable’s type never changes. Stan also uses strong typing, meaning that unlike C or C++, there is no way to get around the type restrictions and access memory directly.

The program declares three variables, N and y of type int (integer values in $\mathbb{Z}$), and theta of type real (real values in $\mathbb{R}$). On actual computers, our integers will have fixed upper and lower bounds and our real numbers are subject to all the vagaries of numerical floating point calculations. Stan uses double-precision (64-bit) floating point and follows the IEEE 754 standard other than in a few highly-optimized calculations that lose a few bits of precision.

A type may also have constraints. Because N is a count, it must be greater than or equal to zero, which we indicate with the bound lower=0. Similarly, the variable y is the number of successful outcomes out of N patients, so it must take on a value between 0 and N (inclusive); that is represented with the constraint lower=0, upper=N. Finally, the variable theta is real and declared to fall in the interval $[0, 1]$ with the constraint lower=0, upper=1. Technically, our bounds are open for real values, but in practice, we might wind up with 0 or 1 values due to underflow or rounding errors in floating point arithmetic.

At run time, the compiled Stan program must be given values for N and theta, at which point, each iteration it will sample a value of y using its built-in pseudorandom number generator. In code, we first define a dictionary for our data (variables N and theta), then construct an instance of CmdStanModel for our model from the path to its program, and finally sample from the model using the sample method of CmdStanModel.

N = 100
theta = 0.3
data = {'N': N, 'theta': theta}
model = csp.CmdStanModel(stan_file = '../stan/binomial-rng.stan')
sample = model.sample(data = data, seed = 123, chains = 1,
                      iter_sampling = 10, iter_warmup = 0,
                      show_progress = False, show_console = False)

The Python boilerplate above set the log level to ERROR for the cmdstanpy package in order to get rid of the warnings and informational messages that would otherwise provide updates on a running Stan program. Changing the log level to WARNING will retain warnings, changing to INFO retains ongoing updates as a program runs, and changing to DEBUG provides low-level info on the algorithm as it runs.

The constructor for CmdStanModel is used to construct a model from a Stan program found in the specified file. We highly recommend using a standalone file for Stan programs to make them easy to share, to allow both quotes for printing and apostrophes for transposition, and to make it easy to find the lines referenced by number in error messages. Under the hood, this first runs Stan’s transpiler to convert the Stan program to a C++ class. Then it compiles the C++ program, which will take on the order of twenty seconds due to the heavy use of optimization and template metaprograms.

In the Python interface, we create an object of the class CmdStanModel from the Stan program found in the file ../stan/binomial-rng.stan and assign it to the Python variable model. The constructor for CmdStanModel translates the specified Stan program to a C++ class and compiles the C++ class. We then call the sample() method on this model object in Python to generate a sample consisting of the specified number of draws. The sample() method takes the arguments

data: the data read in the data block of the Stan program,
seed: pseudorandom number generator for reproducibility,
chains: the number of simulation runs (parallel_chains indicates how many to run in parallel),
iter_sampling: number of draws (i.e., sample size) to return,
iter_warmup: number of warmup iterations to tune parameters of the sampling algorithm (not needed here, so set to 0),
show_progress: if True, print progress updates, and
show_console: pop up a GUI progress monitor.

The result of calling sample() on the model instance is assigned to the Python variable sample. It will contain the 10 draws we requested with argument iter_sampling = 10.

When model.sample(...) is called, CmdStan runs Stan as a standalone C++ program in a background process. This program starts by copying the data given in the Python argument data to a file, then reads in that data file to construct a C++ object representing the statistical model. Since our Stan program only has a generated quantities block, the C++ class’s only remaining task is to generate the requested number of draws. For each of the iter_sampling draws, Stan runs a pseudorandom number generator to generate a value from the specified binomial distribution.

Random number generation is determined by the seed value specified in the call. For more details on how pseudorandom number generation is performed, see the (free online) book by Devroye (1986). We describe the operational semantics of Stan in more detail in the section on Stan’s execution model below.

Once sampling has completed, we can extract the sample consisting of 10 draws for the scalar variable y as an array and then print their values along with the values of the data variables.

y = sample.stan_variable('y')
print("N = ", N, ";  theta = ", theta, ";  y(0:10) =", *y.astype(int))

N =  100 ;  theta =  0.3 ;  y(0:10) = 33 36 30 28 37 26 25 19 29 33

Let’s put that in a loop and see what it looks like by taking the number of patients N equal to 10, 100, 1000, and 10,000 in turn.

for N in [10, 100, 1_000, 10_000]:
    data = {'N': N, 'theta': theta}
    sample = model.sample(data = data, seed = 123, chains = 1,
                          iter_sampling = 10, iter_warmup = 0,
                          show_progress = False,
              show_console = False)
    y = sample.stan_variable('y')
    print("N =", N)
    print("  y: ", *y.astype(int))
    print("  est. theta: ", *(y / N))

N = 10
  y:  4 4 3 4 3 3 3 5 2 4
  est. theta:  0.4 0.4 0.3 0.4 0.3 0.3 0.3 0.5 0.2 0.4
N = 100
  y:  33 36 30 28 37 26 25 19 29 33
  est. theta:  0.33 0.36 0.3 0.28 0.37 0.26 0.25 0.19 0.29 0.33
N = 1000
  y:  322 324 306 333 311 318 294 323 282 311
  est. theta:  0.322 0.324 0.306 0.333 0.311 0.318 0.294 0.323 0.282 0.311
N = 10000
  y:  3049 3052 3012 3025 3042 3087 3051 2922 2943 3025
  est. theta:  0.3049 0.3052 0.3012 0.3025 0.3042 0.3087 0.3051 0.2922 0.2943 0.3025

On the first line for $N = 10$ trials, our simple frequency-based estimates range from 0.2 to 0.5. By the time we have 10,000 trials, the frequency-based estimates only vary between 0.292 and 0.309. We know from the central limit theorem that the spread of estimates is expected to shrink at a rate of $\mathcal{O}(1 / \sqrt{N})$ for $N$ draws (this result is only asymptotic in $N$, but is very close for large-ish $N$ in practice).

It is hard to get an impression fo the true uncertainty from a small set of results like this. To get a better handle on uncertainty, we will simulate 100,000 $y$ values (number of successful outcomes) for each value of the number of patients $N$ and plot histograms. The following histogram plots the distribution of frequency-based estimates based on 10, 100, and 1000 patients, each of which we run for 100,000 simulations.

np.random.seed(123)
ts = []
ps = []
theta = 0.3
M = 100 if DRAFT else 100_000
for N in [10, 100, 1_000]:
    data = {'N': N, 'theta': theta}
    sample = model.sample(data = data, seed = 123, chains = 1,
       iter_sampling = M, iter_warmup = 0, show_progress = False,
       show_console = False)
    y = sample.stan_variable('y')
    theta_hat = y / N
    ps.extend(theta_hat)
    ts.extend(itertools.repeat(N, M))
xlabel = 'estimated Pr[success]'    
df = pd.DataFrame({xlabel: ps, 'trials': ts})
print(pn.ggplot(df, pn.aes(x = xlabel))
  + pn.geom_histogram(binwidth=0.01, color='white')
  + pn.facet_grid('. ~ trials')
  + pn.scales.scale_x_continuous(limits = [0, 1], breaks = [0, 1/4, 1/2, 3/4, 1],
      labels = ["0", "1/4", "1/2", "3/4", "1"], expand=[0, 0])
  + pn.scales.scale_y_continuous(expand=[0, 0, 0.05, 0])
  + pn.theme(aspect_ratio = 1, panel_spacing = 0.15,
             strip_text = pn.element_text(size = 6),
             strip_background = pn.element_rect(height=0.08,
                                fill = "lightgray")))

Although the histograms have different heights and the first one is spiky, the key consideration here is that they all have the same area, representing the 100,000 simulated values of $y$. The trial size of 10 only has 10 possible values, 0.0, 0.1, …, 1.0, so the histogram (technically a bar chart here) just shows the counts of those outcomes. Here, $y = 3$ is the most prevalent result, with corresponding estimate for $\theta$ of $y / 10 = 0.3$. The trial size of 100 looks roughly normal, as it should as a binomial with trials $N = 100$. By the time we get to $N = 1,000$ trials, the draws for $y$ concentrate near 300, or near the value of $0.3$ for $\theta$. As $N$ grows, the central limit theorem tells us to expect that the width of these histograms to shrink at a rate of $\mathcal{O}(1 / \sqrt{N})$.

3.2 Pseudorandom numbers

As a probabilistic programming language, Stan relies on random number generation. Because Stan runs on traditional digital computers (i.e., von Neumann machines), it cannot truly generate random numbers. Instead, it does the next best thing and uses a pseudorandom number generator (PRNG) to generate a sequence of numbers deterministically. Specifically, we set a random number generation seed, which when combined with a PRNG, generates a sequence of numbers deterministically that have many of the properties of truly random numbers. The (free online) book by Devroye (1986) is the definitive reference for pseudorandom number generation for statistical distributions and contains a general introduction to PRNGs.

We can see how random number generators work in Stan by running our sampling method with seeds 123, 19876, and 123.

for seed in [123, 19876, 123]:
    sample = model.sample(data = data, seed = seed, chains = 1,
                          iter_sampling = 10, iter_warmup = 0,
                          show_progress = False, show_console = False)
    print(f"{seed = };  sample = {sample.stan_variable('y').astype(int)}")

seed = 123;  sample = [322 324 306 333 311 318 294 323 282 311]
seed = 19876;  sample = [286 303 286 309 289 276 317 298 328 294]
seed = 123;  sample = [322 324 306 333 311 318 294 323 282 311]

The two runs with seed 123 produce the same results. The code to extract values of y is clunky because the stan_variable() method always returns a floating point type and we have converted it to an integer-valued NumPy array.

Generally, we want our Stan programs to replicate similar results with different random seeds. Substantially different results from different seeds is a red flag that there is something wrong with the combination of model and data.

3.3 Monte Carlo integration

Bayesian computation relies on averaging over our uncertainty in estimating parameters. In general, it involves computing expectations, which are weighted averages with weights given by densities. In this section, we will introduce Monte Carlo methods for calculating a simple integral corresponding to the expectation of a discrete indicator variable. We’ll use the textbook example of throwing darts at a board randomly and using the random locations to estimate the mathematical constant $\pi$.

We start with a two-unit square centered at the origin. Then we will generate points uniformly at random in this square. For each point $(x, y)$, we will calculate whether it falls inside the unit circle circumscribed within the square, which is true if the distance to the origin is less than 1, \[ \sqrt{x^2 + y^2} < 1, \] which simplifies by squaring both sides to \[ x^2 + y^2 < 1. \] The proportion of such points gives the proportion of the square’s volume taken up by the circle. Because the square is $2 \times 2$, it has an area of 4, so the circle has an area of 4 times the proportion of points falling inside the circle (i.e., in the open unit disc).

Here’s the Stan code.

stan/monte-carlo-pi.stan

generated quantities {
  real<lower=-1, upper=1> x = uniform_rng(-1, 1);
  real<lower=-1, upper=1> y = uniform_rng(-1, 1);
  int<lower=0, upper=1> inside = x^2 + y^2 < 1;
  real<lower=0, upper=4> pi = 4 * inside;
}

The program declares variables x and y and constrains them to fall in the interval $(-1, 1)$ (numerical overflow may produce values -1 and 1) and assigns them uniform random values. The indicator variable inside is set to 1 if the Euclidean length of the vector $\begin{bmatrix}x & y\end{bmatrix}$ is less than 1 (i.e., it falls within an inscribed unit circle) and is set to 0 otherwise.
The variable pi is then set to four times the indicator value. As we see below, it is the sample mean of these values that is of interest.

First, we compile and then sample from the model, taking a sample size of M = 10_000 draws. Then we plot the draws.

M = 100 if DRAFT else 10_000
model = csp.CmdStanModel(stan_file = '../stan/monte-carlo-pi.stan')
sample = model.sample(chains = 1, iter_warmup = 0, iter_sampling = M,
                      show_progress = False, show_console = False,
                      seed = 123)
x_draws = sample.stan_variable('x')
y_draws = sample.stan_variable('y')
inside_draws = [int(i) for i in sample.stan_variable('inside')]
pi_draws = sample.stan_variable('pi')
inside_named_draws = np.array(["out", "in"])[inside_draws]
df = pd.DataFrame({'x': x_draws, 'y': y_draws,
                   'inside': inside_named_draws})
print(
  pn.ggplot(df, pn.aes(x = 'x', y = 'y',
                group='inside', color='inside'))
  + pn.geom_point(size = 0.1)
  + pn.labs(x = 'x', y = 'y')
  + pn.coord_fixed(ratio = 1)
)

Next, we take the sample mean of the inside-the-circle indicator, which produces an estimate of the probability of a point being inside the circle. This corresponds directly to the expectation \[\begin{align} \mathbb{E}[4 \cdot \textrm{I}(\sqrt{X^2 + Y^2} \leq 1)] &= \int_{-1}^1 \int_{-1}^1 \textrm{I}(x^2 + y^2 <1) \cdot p(x, y) \, \textrm{d}x \, \textrm{d}y \\[4pt] &= \int_{-1}^1 \int_{-1}^1 \textrm{I}(x^2 + y^2 < 1) \cdot \textrm{uniform}(x \mid -2, 2) \cdot \textrm{uniform}(y \mid -2, 2) \, \textrm{d}x \, \textrm{d}y \\[4pt] &= \int_{-1}^1 \int_{-1}^1 4 \cdot \textrm{I}(x^2 + y^2 < 1) \, \textrm{d}x \, \textrm{d}y \\[4pt] &= \pi, \end{align}\] where $\textrm{I}()$ is the indicator, which returns 1 if its argument is true and 0 otherwise. The posterior mean of the variable inside is the probability that a random point in the 2-unit square is inside the inscribed unit circle. The posterior mean for pi is thus our estimate for $\pi$.

Pr_is_inside = np.mean(inside_draws)
pi_hat = np.mean(pi_draws)
print(f"Pr[Y is inside circle] = {Pr_is_inside:.3f};")
print(f"estimate for pi = {pi_hat:.3f}")

Pr[Y is inside circle] = 0.786;
estimate for pi = 3.144

The true value of $\pi$ to 3 digits of accuracy is $3.142$, so we are close, but not exact, as is the nature of Monte Carlo methods. If we increase the number of draws, our error will go down. Theoretically, with enough draws, we can get any desired precision; in practice, we don’t have that long to wait and have to make do with only a few digits of accuracy in our Monte Carlo estimates. This is usually not a problem because statistical uncertainty still dominates our numerical imprecision in most applications; we discuss this important point later when contrasting estimation uncertainty and sampling uncertainty in the section on posterior predictive inference and when considering practical guidance on how long to run Stan.

Random points are far away in high dimensions

Suppose we wanted to sample points in the unit disc? One thing we could do is sample points in the unit square until we draw one that is in the unit disc. In two dimensions, this is fairly efficient, with 79% of the points falling in the circle. But what will happen in higher dimensions? Let’s write some Stan code and see.

stan/unit-hypersphere.stan

data {
  int<lower=0> D;
}
transformed data {
  array[D] real one_D = rep_array(1, D);
}
generated quantities {
  array[D] real y = uniform_rng(-1, one_D);
  int<lower=0, upper=1> inside = norm2(y) < 1;
}

In this case, we take a natural number D as input for the dimensionality of the hypercube. We have introduced a block for transformed data, and declared a variable one_D to be a size D array of 1 values. The transformed data block is executed once as data is read in and its values are constant outside of the transformed data block. In the generated quantities block, we use the array of 1 values to assign y to an array of values, each of which is independently generated from a $\textrm{uniform}(-1, 1)$ distribution. This means y will be uniformly distributed within the hypercube $[-1, 1]^D$. The integer inside will be set to 1 if y falls within the unit hypersphere centered at the origin (i.e., circumscribed within the unit hypercube).

By construction, the distance from the origin to the side of the hypercube and hence the radius of the inscribed hypersphere remain constant at 1. In contrast, the distance from the origin to a corner is $\sqrt{D}$ in $D$ dimensions (i.e., $\sqrt{1^2 + 1^2 + \cdots 1^2}$ for $D$ terms). Now let’s see what happens to the proportion of volume in the inscribed hypersphere as the dimensionality grows.

M = 100 if DRAFT else 10_000
model = csp.CmdStanModel(stan_file = '../stan/unit-hypersphere.stan')
in_probs = np.repeat(1.0, 13)
for D in range(1, 13):
    sample = model.sample(chains=1, iter_warmup = 0, iter_sampling = M,
                          data = {'D' : D}, show_progress = False,
              show_console = False, seed = 123)
    inside_draws = sample.stan_variable('inside')
    in_probs[D] = np.sum(inside_draws) / M

print(pn.ggplot(pd.DataFrame({'D':np.arange(1, 13), 'prob in hypersphere':in_probs[1:]}),
                 pn.aes(x = 'D', y='prob in hypersphere'))
       + pn.geom_line()
       + pn.geom_point(size=1)
       + pn.scale_x_continuous(breaks = [0, 2, 4, 6, 8, 10, 12]))

3.4 Markov chain Monte Carlo methods

In the previous sections, we generated a sample of draws by taking a sequence of independent draws. We then just averaged results to get plug-in estimates for expectations.

With modern applied Bayesian models, it is almost never possible to generate independent draws from distributions of interest, so we cannot apply simple Monte Carlo methods. There are some restricted cases involving special model forms for which we can take independent draws, but this only works for very simple models (Diaconis and Ylvisaker 1979). Before the MCMC revolution of the 1990s, Bayesian inference was largely restricted to these simple models. Even well into the MCMC revolution in the 1990s and 2000s, researchers still used simpler model forms to improve computation in the probabilistic programming language BUGS (Lunn et al. 2012) and even Stan programs are often influenced by computational concerns (Stan Development Team 2023c).

The introduction of automatic differentiation opened up the possibility of coding the more efficient and scalable Hamiltonian Monte Carlo method in Stan, which is a form of Markov chain Monte Carlo. This has greatly expanded the class of models that can be fit in reasonable time in practice.

In Markov chain Monte Carlo methods, we base each draw on the previous draw. A sequence of random variables each of which depends only on the previous variable generated is called a Markov chain. That is, a sequence of random variables $Y_1, Y_2, \ldots$ makes up a Markov chain if \[ p_{Y_{n+1} | Y_{1}, \ldots Y_N}(y_{n + 1} | y_1, \ldots, y_n) = p_{Y_{n+1} \mid Y_n}(y_{n+1} \mid y_n) \] This is saying that $Y_{n + 1}$ is conditionally independent of $Y_1, \ldots Y_{n - 1}$ given $Y_n.$

We can illustrate with a simple example of three Markov chains, all of which have a stationary distribution of $\textrm{bernoulli}(0.5).$ Technically, this means that as $n \rightarrow \infty,$ $p_{Y_n}$ approaches the density of a $\textrm{bernoulli}(0.5)$ distribution. As a result, the long-term average of all chains will also approach 0.5, because that’s the expected value of a $\textrm{bernoulli}(0.5)$ variable. We will introduce a parameter $\theta \in (0, 1)$ and take the probabilities of element $Y_{n+1}$ depending on the previous element $Y_{n}$ to be \[\begin{align} \Pr[Y_{n + 1} &= 1 \mid Y_n = 1] = \theta \\ \Pr[Y_{n + 1} &= 1 \mid Y_n = 0] = 1 - \theta \end{align}\] The first line says that if the last number we generated is 1, the probability of the next element being 1 is $\theta$. The second line says that if the last number we generated is 0, the probability of the next element being 1 is $1 - \theta$, and thus the probability of the next element being 0 is $\theta$. That is, there’s a probability of $\theta$ of generating the same element as the last element.

Here is a Stan program that generates the first $M$ entries of a Markov chain over outputs 0 and 1, with probability $\rho \in (0, 1)$ of generating the same output again.

stan/markov-autocorrelation.stan

data {
  int<lower=0> M;
  real<lower=0, upper=1> rho;  // prob of staying in same state
}
generated quantities {
  array[M] int<lower=0, upper=1> y;  // Markov chain
  y[1] = bernoulli_rng(0.5);
  for (m in 2:M) {
    y[m] = bernoulli_rng(y[m - 1] ? rho : 1 - rho);
  } 
}

The assignment to y[m] in this program is equivalent to this longer form.

    if (y[m - 1] == 1) {
      y[m] = bernoulli_rng(rho);
    } else {
      y[m] = bernoulli_rng(1 - rho);
    }

The more concise form exploits two features of Stan. First, boolean expressions are coded as 1 (true) or 0 (false) in Stan, like in C++ and Python. Because y[m - 1] is constrained to take on values 0 or 1, we know that y[m - 1] == 1 is equivalent to y[m - 1]. Second, it uses the ternary operator, also like in C++. The expression cond ? e1 : e2 involves three arguments, separated by a question mark (?) and a colon (:); its value is the value of e1 if cond is true and the value of e2 otherwise. Unlike an ordinary function, the ternary operator only evaluates e1 if the condition is true and only evaluates e2 if the condition is false.

We can simulate these models in Python and print the first 100 values simulated for the Markov chain $y$.

model = csp.CmdStanModel(
               stan_file = '../stan/markov-autocorrelation.stan')
M = 100 if DRAFT else 1000
rhos = []
iterations = []
draws = []
estimates = []
for rho in [0.05, 0.5, 0.95]:
    data = {'M': M, 'rho': rho}
    sample = model.sample(data = data, seed = 123, chains = 1,
                      iter_warmup = 0, iter_sampling = 1,
                      show_progress = False, show_console = False)
    y_sim = sample.stan_variable('y')
    cum_sum = np.cumsum(y_sim)
    its = np.arange(1, M + 1)
    ests = cum_sum / its
    draws.extend(y_sim[0])
    iterations.extend(its)
    estimates.extend(ests)
    rhos.extend(itertools.repeat(str(rho), M))
df = pd.DataFrame({'draw': draws, 'iteration': iterations,
                   'estimate': estimates, 'rho': rhos})
rho05 = np.array(df.query('rho == "0.05"').head(100)['draw'], dtype = 'int')
rho50 = np.array(df.query('rho == "0.5"').head(100)['draw'], dtype = 'int')
rho95 = np.array(df.query('rho == "0.95"').head(100)['draw'], dtype = 'int')
print("Markov chain draw with probability rho of repeating last value:\n")
print("rho = 0.05:", rho05, "\n")
print("rho = 0.50:", rho50, "\n")
print("rho = 0.95:", rho95, "\n")

Markov chain draw with probability rho of repeating last value:

rho = 0.05: [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1] 

rho = 0.50: [0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0
 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0
 0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0] 

rho = 0.95: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0]

With a 0.05 probability of staying in the same state, the Markov chain exhibits strong anti-correlation in its draws, which tend to bounce back and forth between 0 and 1 almost every iteration. In contrast, the 0.95 probability of staying in the same state means the draws have long sequences of 0s and 1s. The 0.5 probability produces independent draws from the Markov chain and we see short runs of 0s and 1s.

Next, we will show a running average of the 0 and 1 draws for 1000 iterations for the three chains.

print(
    pn.ggplot(df, pn.aes(x='iteration', y='estimate',
                  group='rho', color='rho'))
    + pn.geom_hline(yintercept = 0.5, color = 'black')        
    + pn.geom_line()
    + pn.labs(x = "iteration", y = "estimate")
)

The black horizontal line at 0.5 shows the true answer. It is clear that the anti-correlated chain (red) converges much more quickly to the true answer of 0.5 and more stably than the independent chain (green), which in turn converges much more quickly than the correlated chain (blue).

4 Hints for Python programmers

Python follows general programming language idioms like in C++ and Java, whereas Stan follows linear algebra idioms like in R and MATLAB.

4.1 Indexing from 1

Unlike Python, Stan uses the standard mathematical indexing for matrices, which is from 1. If I declare a vector[3], then the valid indexes are 1, 2, and 3. If v is a vector variable, then v[0] is an indexing error and will throw an exception and log a warning as it is caught and the resulting MCMC proposal is rejected.

4.2 Inclusive ranges

Unlike Python, Stan uses inclusive ranges, so that 1:3 represents the sequence 1, 2, 3. The main disadvantage of inclusive notation is that the length of L:U is U - L + 1.

Putting these together, Python loops over a container with N elements look as follows.

v = np.random.rand(N)
for n in range(0, N):
    ... process v[n] ...

The Python version visits elements v[0], v[1], …, v[N - 1].

Loops in Stan are as follows.

vector[N] v;
for (n in 1:N) {
   ... process v[n] ...
}

The Stan version visits elements v[1], v[2], …, v[N].

4.3 Strong typing

Stan variables are strongly typed. This means every variable is assigned a type when it is declared and that type never changes. Only expressions compatible with that type may be assigned, which means they have to have the same type or they have to be assigned a type that can be promoted to that type. For example, we can assign an integer to a real value, but not vice-versa.

int n = 5;
real y = n;  // OK: promote integer to real
int m = y;   // ILLEGAL! cannot demote real to integer

Similarly, we can assign a real or integer value to a complex variable. We also have what is known as covariant typing, which means if we can assign a value of type U to a variable of type V, then we can assign an array of U to an array of V.
Similarly, we can assign real vectors or matrices to their complex counterparts.

Top-level block variables and local variables must also declare their sizes and thereafter only allow assignment of properly sized values. Function argument types do not declare sizes so that functions can be defined generically to operate over arguments of any size.

Block variables can also be supplied with constraints. This does not affect assignment, but the constraints will be validated at the end of the block (or enforced implicitly in the case of the parameter block where there is no assignment).

There are three one dimensional (1D) real-valued container types (1D array, column vector, and row vector) and four two dimensional (2D) containers (2D array, 1D array of vectors, 1D array of column vectors, and matrix). These cannot be assigned to each other directly, but there are built-in functions to convert back and forth. Loops that just do reassignment are fast in Stan as there is no overhead from derivative calculations; see the sections of this document on automatic differentiation and containers for more details.

4.4 Block scope

Unlike Python or R, Stan follows the C/C++ convention whereby variables declared within a block are local to that block. For example, in this Stan program, logit_theta is only a valid variable within the conditional block; once control has left the conditional block, logit_theta falls out of scope.

if (z == 1) {
  real logit_theta = logit(theta);
} else {
  real logit_theta = logit(-theta);
}  
y ~ bernoulli_logit(logit-theta); // ILLEGAL: out of scope

If the user needs a reference to logit_theta outside of the conditional block, it can be declared outside the conditional block.

real logit_theta;
if (z == 1) {
  logit_theta = logit(theta);
  ...
} else {
  logit_theta = logit(-theta);
}  
y ~ bernoulli_logit(logit_theta);  // OK: in scope

In simple cases like this, Stan provides the ternary operator as an alternative,

real logit_theta = z == 1 ? logit(theta) : logit(-theta);

or even

real logit_theta = logit(z == 1 ? theta : -theta);

4.5 Loops are not slow

Unlike in R and Python, loops are fast in Stan because Stan is a compiled language. For operations that only involve indexing and assignment, loops can be faster in Stan than their vectorized counterparts, because they avoid intermediate allocations.

However, this is only half of the story. Whenever functions are applied to parameters, the operation, its result, and its operands are recorded in an expression graph. For operations that involve functions other than indexing or reshaping operations applied to parameters, vectorized versions of functions will almost always reduce peak memory usage and increase speed.

4.6 Whitespace and semicolons

Stan follows C/C++ conventions on whitespace in which any sequence of whitespace characters (space, tab, new line) are interchangeable semantically. Python and R are both space-sensitive in different ways—Python uses space as a block delimiter and R has an eager line-based parser. In Stan, we use semicolons (;) to mark the end of an atomic statement. This means it is OK to continue expressions on the following line without any special syntactic marker, e.g.,

lp[k] = bernoulli_lpmf(k | theta)
        + normal_lpdf(y[n] | mu[k], sigma[k]);

Both R and Python here would try to terminate the assignment to lp after the first line and leave a dangling expression for the second line. We recommend following mathematical conventions and breaking a line before an operator, ideally right before a term in a chain of additions or a factor in a chain of multiplications.

4.7 Matrices and vectors are real or complex-valued

The Stan types vector, row_vector, and matrix only allow real-valued entries. Entries may be assigned to integers, but the integers are promoted to real values. There are separate types complex_vector, complex_row_vector and complex_matrix to hold complex values. Real-valued vectors, etc., may be assigned to their complex counterparts, but not vice-versa.

There are no integer vectors, row vectors or matrices. The only containers for holding integers are arrays. Stan provides functions for converting integer arrays to vectors and matrices.

4.8 Storage order

One of the most important bottlenecks to mitigate in modern computing is how slow memory is compared to CPU. In order to take advantage of fast CPU operations, especially vectorized ones that operate in parallel, memory must be streamed in so that the CPU is never starved for memory. The problem is that when a memory value is requested, if it is not in the first level cache (L1, which holds about 64KB of data per core), it takes about 4 times as long to look in the second level cache (L2, which holds on the order of 512KB). If that misses, it goes to L3 cache, which can be quite large (up to 32MB), but is on the order of 10 times slower than fetching from L2 cache. Then if that misses, it goes to system RAM, which is about twice as slow as fetching from L3 cache. So end to end, if you get a complete cache miss and have to fetch data out of RAM, it will be on the order of 100 times slower. The CPU then just waits for data.

To get around this problem, code must be written in a way that accesses data in the order it is stored in memory. Modern CPUs and memory are very good at streaming data from RAM to the CPU very fast as long as it is accessed in order.

Matrices in Python vs. Stan

Python uses row-major layout for matrices. The following program evaluates the speed of Python traversing matrices of differing sizes in row major vs. column major order.

n_base = 128 if DRAFT else 512
ns = [n_base, n_base * 4, n_base * 16]
print("Python:")
for n in ns:
    A = np.random.rand(n, n)
    timer.start()
    for i in range(n):
        for j in range(n):
            _ = A[i, j]
    print(f"   row major: n = {n:>5};  {timer.elapsed():6.2f} seconds")
    timer.start()
    for j in range(n):
        for i in range(n):
            _ = A[i, j]
    print(f"column major: n = {n:>5};  {timer.elapsed():6.2f} seconds")

Python:
   row major: n =   512;    0.05 seconds
column major: n =   512;    0.05 seconds
   row major: n =  2048;    0.81 seconds
column major: n =  2048;    1.05 seconds
   row major: n =  8192;   13.06 seconds
column major: n =  8192;   18.16 seconds

As is evident from the results, this is much more important as matrices get bigger, because they will not fit into cache. It is also more important when there is parallelism, because that leads to much more memory contention.

Unlike Python, Stan uses a column-major memory layout for matrices. For instance, the memory layout of elements in a $2 \times 3$ matrix \[ A = \begin{bmatrix} x_1 & x_3 & x_5 \\ x_2 & x_4 & x_6 \end{bmatrix} \] is $x_1, x_2, x_3, x_4, x_5, x_6.$ That is, the order goes down the columns and then starts again at the top of the next column.

This means it’s more efficient to iterate by column than by row through a matrix in Stan, so that iteration code looks as follows. Consider the following Stan program, which can iterate either way.

stan/row-vs-col-major.stan

data {
  int<lower=0> N;
  int<lower=0, upper=1> row_major;
}
transformed data {
  matrix[N, N] A;
  for (j in 1:N)
    for (i in 1:N)
      A[i, j] = i;
}  
generated quantities {
  real sum = 0;
  if (row_major)
    for (i in 1:N)
      for (j in 1:N)
        sum += A[i, j];
  else
    for (j in 1:N)
      for (i in 1:N)
        sum += A[i, j];
}

It includes a transformed data block to generate a matrix and fill it cheaply. We coded the example this way because I/O is much more expensive than arithmetic in Stan (as it is in most situations). The generated quantities block then defines a scalar sum to be the sum of the matrix entries, traversing the matrix in row-major order if row_major is 1 and column-major order otherwise. We run the Stan calculation multiple times to ensure the traversal dominates the computation.

model = csp.CmdStanModel(stan_file = "../stan/row-vs-col-major.stan")
M = 1 if DRAFT else 10
print(f"Stan ({M} repetitions):")
for n in ns:
    A = np.random.rand(n, n)
    start_time = time.time()
    fit = model.sample({'N': n, 'row_major': 1, 'A': A},
                       iter_warmup=0, iter_sampling=M, seed=123,
                       show_progress = False, show_console = False)
    print(f"   row major: n = {n:>5};  {(time.time() - start_time):6.2f} seconds")
    start_time = time.time()
    fit = model.sample({'N': n, 'row_major': 0, 'A': A},
                       iter_warmup=0, iter_sampling=M, seed=123,
                       show_progress = False, show_console = False)
    print(f"   col major: n = {n:>5};  {(time.time() - start_time):6.2f} seconds")

Stan (10 repetitions):
   row major: n =   512;    0.41 seconds
   col major: n =   512;    0.39 seconds
   row major: n =  2048;    6.36 seconds
   col major: n =  2048;    5.85 seconds
   row major: n =  8192;  106.70 seconds
   col major: n =  8192;   95.79 seconds

Arrays in Python vs. Stan

Arrays in Stan are stored in row-major order, but they are not stored in a single array like in NumPy’s ndarray type. Arrays of primitives store the primitive values in memory-local order. Arrays of containers store pointers to the containers. Thus accessing array members can be non-local in memory, but will not require copying. In contrast, accessing a row or column of a matrix usually requires allocating memory and copying.

5 Stan example: Laplace’s live birth inverse problem

Given a forward model that generates data from parameters, the inverse problem is that of inferring parameter values from observed data.

5.1 Inverse problems

Suppose for example, that you are Galileo Galilei. You have the hypothesis that as a ball rolls down an inclined plane, its speed will be proportional to how far it has rolled and the distance covered will be quadratic in time. But there is an unknown factor in the model, which determines how fast the ball will roll. Given this unknown factor, you could make precise predictions about the ball’s location at any given time. A model that maps from parameters like the gravitational constant to predictions is called a forward model. Such models are typically deterministic, but they do not have to be.

Galilei (1638) describes his ramp as follows.

A piece of wooden moulding or scantling, about 12 cubits [about 7 m] long, half a cubit [about 30 cm] wide and three finger-breadths [about 5 cm] thick, was taken; on its edge was cut a channel a little more than one finger in breadth; having made this groove very straight, smooth, and polished, and having lined it with parchment, also as smooth and polished as possible, we rolled along it a hard, smooth, and very round bronze ball.

To time the ball, he devised a clever and indirect measurement process.

For the measurement of time, we employed a large vessel of water placed in an elevated position; to the bottom of this vessel was soldered a pipe of small diameter giving a thin jet of water, which we collected in a small glass during the time of each descent $\ldots$ the water thus collected was weighed, after each descent, on a very accurate balance; the difference and ratios of these weights gave us the differences and ratios of the times.

A process such as this is fraught with measurement error of weights and distances. Even though the forward model provides precise predictions, in the end, Galileo is left with less than perfect measurements of the time it took a ball to get to a prescribed position. In statistics, we typically include a probabilistic measurement model to account for how observations are made given a forward model. Sometimes, in cases like Galileo’s and complex survey or clinical trial designs, the measurement model can be more complicated than the forward model.

The inverse problem is that of reasoning backwards from observations through the measurement model and forward model to estimates of parameters like the gravitational constant. Solving general inverse problems is where Bayesian statistics shines.

5.2 Bayesian solutions to inverse problems

Bayes (1763) introduced the paradigm of statistical inference that has come to be known as Bayesian statistics.

Parametric statistics and variable types

Before introducing Bayes’s paradigm, we will try to settle on some terminology around variables. In Stan, we classify each variable as being a data variable or a parameter.

When talking about Bayesian statistics abstractly (i.e., without regard to a specific model form), we will use the variable $y$ to represent all of the model’s data, which are values that are known or observed. Data my include

constants: sizes, bounds,
unmodeled data: covariates (aka features), constant parameters of priors, and
modeled data: outcome measurements, individual or group-level covariates in generative models.

Similarly, we use the variable $\theta$ for all of our model’s parameters, which are values that are not known and must be inferred. Parameters may include

process parameters: parameters of the forward scientific process,
measurement parameters: parameters of the measurement process,
hyperpriors: hyperparameters for prior distributions,
missing data: unknown data items, latent but potentially observable values, and
predictive quantities: event probability indicators, predictions, forecasts, backcasts.

Constants are things like the sizes of matrices, the number of components in a mixture model, the bounds on a constrained variable, etc. Covariates are typically used as inputs (aka features, predictors, independent variables) to a regression. For instance, I might want to predict someone’s presidential vote based on their income and state of residence. In this case, the income and state of residence are unmodeled covariates and the vote can be modeled data if it is known and missing data if it is not.

In Stan, the number of parameters may depend on the data, but once the data $y$ is fixed, the number of parameters is also fixed. This means we can implement some non-parametric models (e.g., Gaussian processes), but have to approximate others (e.g., Dirichlet processes).

The Bayesian process

Gelman et al. (2013) summarize the process of developing a Bayesian model for an applied problem as follows,

Define a joint probability model over all of the observables and non-observables that is consistent with what we know about the underlying scientific process and underlying measurement process (which together form the data-generating process) and our prior knowledge.
Condition on observed data to infer the posterior distribution of any quantities of interest such as predictions or event probabilities.
Evaluate the model in terms of fit to data and resulting predictions, how sensitive the predictions are to model assumptions, and if necessary, go back to step (1) and refine the model.

By joint probability here, we mean a multivariate distribution over all of the observed variables $y$ and all of the unobserved variables $\theta,$ for example, as given by a density $p(y, \theta).$ Usually, the joint probability density $p(y, \theta)$ is factored as $p(y \mid \theta) \cdot p(\theta)$ into the product of

the sampling distribution combining a forward model and measurement model with density $p(y \mid \theta),$ and
the prior distribution, capturing what we know about the parameters ahead of time, with density $p(\theta).$

In model formulation (step 1), we choose the verb “know” rather than “believe” because the choice of prior and likelihood are both meant to be scientific processes, not tied up with personal convictions. See Section 1: Pragmatic Bayesian statistics for background.

Stan helps with all three steps in the Bayesian workflow: (1) expressing models, (2) performing posterior inference, and (3) model analysis and comparison.

Bayes’s formulation

Bayes (1763) formulated the inverse problem of determining a posterior distribution with density $p(\theta \mid y)$ from a prior with density $p(\theta)$ and a sampling distribution with density $p(y \mid \theta)$. Bayes provided the following derivation, which has come to be known as Bayes’s rule. \[\begin{align} p(\theta \mid y) &= \frac{p(y, \theta)} {p(y)} & \textrm{[definition of conditional probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {p(y)} & \textrm{[chain rule for densities]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y, \theta) \, \textrm{d}\theta} & \textrm{[law of total probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta}. & \textrm{[chain rule for densities]} \end{align}\]

The first steps reveal the key densities involved in Bayesian inference for data $y$ and parameters $\theta$, \[ \underbrace{p(\theta \mid y)}_{\textrm{posterior}} = \underbrace{p(y \mid \theta)}_{\textrm{likelihood}} \cdot \underbrace{p(\theta)}_{\textrm{prior}} \ / \underbrace{p(y)}_{\textrm{evidence}}. \] When considered as a function of the parameter $\theta$ for fixed data $y$, as it is in Bayes’s rule, the density $p(y \mid \theta)$ is known as the likelihood.

The next step in the derivation expresses the evidence in terms of the likelihood and prior. \[ p(y) = \int_{\Theta} p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta. \] In applications, this integral is usually intractable in the sense of there being no closed-form analytic solution in terms of elementary functions. For example, even a simple logistic regression presents an intractable integral for the evidence, with \[\begin{align} p(\theta) &= \textrm{normal}(\theta, 1), \\[2pt] p(y \mid \theta, x) &= \prod_{n=1}^N \textrm{bernoulli}(y_n \mid \textrm{logit}^{-1}(x_n \cdot \theta)) \end{align}\]

The good news is that most applications of Bayesian statistics do not require us to solve the integral presented in the denominator of Bayes’s rule, even numerically. Rather than computing the evidence, we work up to a proportion (i.e., a multiplicative constant), where we have
\[ p(\theta \mid y) \propto p(y \mid \theta) \cdot p(\theta), \] because $p(y)$ is a constant in the sense that it does not depend on $\theta$. This is what lets us express the log posterior as the sum of the log likelihood and the log prior in Stan, \[ \log p(\theta \mid y) = \log p(y \mid \theta) + \log p(\theta) + \textrm{const}, \] where the constant is the negative log evidence, $- \log p(y)$.

One case where we need the evidence or normalizing constant, is to evaluate Bayes factors for model comparison. Following Gelman et al. (2013), we prefer other means of evaluating models like posterior predictive checks or cross-validation, both of which are beyond the current scope of this getting-started tutorial.

5.3 Live birth data

A decade after Bayes published his rule, Pierre Simon Laplace (1774) used Euler’s beta function to derive a closed form to solve his inverse problem. In this section, we will work through Laplace’s analysis using Stan.

Laplace gathered data on the sexes of babies in live births in Paris between 1745 and 1770.

Live births in Paris between 1745 and 1770.
sex	live births
female	105,287
male	110,312

It sure looks like the probability of boy births is higher than that of girl births.

5.4 Laplace’s model

With $y$ male births out of $N$ total births, Laplace adopted the sampling distribution \[ y \sim \textrm{binomial}(N, \theta), \] which requires a number of trials $N \in \mathbb{N}$, chance of male birth $\theta \in (0, 1)$, and number of male births $y \in \{ 0, 1, \ldots N \}.$

Laplace used the prior \[ \theta \sim \textrm{beta}(1, 1), \] where \[ \textrm{beta}(\theta \mid \alpha, \beta) \propto \theta^{\alpha - 1} \cdot (1 - \theta)^{\beta - 1}. \] for $\alpha, \beta \in (0, \infty)$, and $\theta \in (0, 1)$. The distribution $\textrm{beta}(1, 1)$ is uniform over probabilities $\theta \in (0, 1)$ because the density is proportional to a constant, \[\begin{align} \textrm{beta}(\theta \mid 1, 1) &\propto \theta^{1 - 1} \cdot (1 - \theta)^{1 - 1} \\[4pt] &= 1. \end{align}\]

Analytic posterior

As an aside, Laplace’s model here is simple enough to solve for the posterior analytically as we show in Appendix D.3: Conjugate priors.

\[\begin{align} p(\theta \mid y, N) &\propto p(y \mid N, \theta) \cdot p(\theta) \\[4pt] &= \textrm{binomial}(y \mid N, \theta) \cdot \textrm{beta}(\theta \mid 1, 1) \\[4pt] &\propto \theta^y \cdot (1 - \theta)^{N - y} \cdot \theta^{1 - 1} \cdot (1 - \theta)^{1 - 1} \\[4pt] &= \theta^{y + 1} \cdot (1 - \theta)^{N - y + 1} \\[4pt] &\propto \textrm{beta}(\theta \mid y + 1, N - y + 1). \end{align}\] Despite the intermediate proportional-to steps, the normalizing constant is unique, so we can conclude that \[ p(\theta \mid y, N) = \textrm{beta}(\theta \mid y + 1, N - y + 1). \] This entire derivation went through without ever worrying about the normalizing constant for the beta distribution or the binomial distribution.

5.5 Stan program for birth data

Unlike the first Stan model we saw, which only generates data, the following Stan program requires data to be provided, specifically the number of male births ($y$) and the total number of births ($N$). The model will then allow us to estimate the probability of a male birth ($\theta$) as well as the probability that boys are more likely to be born than girls ($\theta > 0.5$). The Stan program for Laplace’s model is as follows.

stan/sex-ratio.stan

data {
  int<lower = 0> N;
  int<lower = 0, upper = N> y;
}
parameters {
  real<lower=0, upper=1> theta;
}
model {
  theta ~ beta(1, 1);
  y ~ binomial(N, theta);
}
generated quantities {
  int<lower=0, upper=1> boys_gt_girls = theta > 0.5;
}

5.6 Parameter and model blocks

In this Stan program, we see that both the number of total births $N$ and the number of male births $y$ are given as data. Then there are two additional blocks we did not see in our earlier program, a parameters block, which is used to declare unknown values (here, just the male birth rate $\theta$), and a model block, which is where we define our target log density up to an additive constant. The target log density is typically a Bayesian posterior expressed as a log prior and log sampling distribution. The parameters block declares the type of theta, which is a real value constrained to fall in the interval $[0, 1]$. The model block defines the prior, which we take to be uniform over the possible values for theta. The model block also defines the sampling distribution, which codes the assumption that the observed data y was generated from a binomial distribution with N trials and theta probability of a male birth. Finally, we have a generated quantities block that defines a single binary indicator variable, boys_gt_girls. This variable will take the value 1 if the probability of a boy is greater than the probability of a girl.

5.7 Sampling from the posterior

When we run a Stan program, Stan generates a sequence of $M$ random draws, which are approximately identically distributed according to the posterior, \[ \theta^{(1)}, \ldots, \theta^{(M)} \sim p(\theta \mid y). \] If we were to take $M \rightarrow \infty$, the draws will converge to being identically drawn from the posterior. In simple or well behaved models, coupling arguments show that they converge to true draws from the posterior in hundreds or thousands (or sometimes more) draws (Jacob, O’Leary, and Atchadé 2020); they become numerically indistinguishable from true draws well before that time.

Stan’s sampler

Stan uses a Markov chain Monte Carlo (MCMC) algorithm, which can lead to autocorrelation in the random draws from the posterior. That is, the draws are not typically independent, with each draw being correlated (or anti-correlated) with the previous draw.

Autocorrelation does not introduce bias into the Monte Carlo estimates. Positive autocorrelated draws, which we see in more complex models, increase estimation variance compared to independent draws. Increased variance increases expected square error, which is a combination of error due to bias (zero here) and due to variance. Negative autocorrelation, which we see in very simple models, reduces variance with respect to independent draws and hence reduces expected square error when compared to independent draws.

In order to scale to high dimensional problems, we need to exploit gradients and/or Hessians. Duane et al. (1987) introduced the gradient-based Hamiltonian Monte Carlo (HMC) algorithm; the overview by Neal (2011) discusses theoretical and practical scaling results and Livingstone et al. (2019) provide geometric ergodicity results.

For Stan, Hoffman and Gelman (2014) developed an adaptive form of Hamiltonian Monte Carlo (HMC) known as the no-U-turn sampler (NUTS), and Betancourt (2017a) improved it with an adaptive preconditioner and multinomial sampling over trajectories. NUTS can be hyper-efficient in the sense of generating anti-correlated draws that can lead to more efficient Monte Carlo estimates than independent draws.

Fitting in Stan

We fit Laplace’s model by compiling the model, constructing a dictionary for the data, and then calling the sample method on the compiled model with the dictionary. We call the sample method with 1,000 warmup iterations and 10,000 sampling iterations; we are taking so many draws in order to show smooth plots later. The Stan programs considered in this introduction are all quite simple and inexpensive to run for many iterations. We consider how many draws are required for statistical inference in the section on practical guidelines.

model = csp.CmdStanModel(stan_file = '../stan/sex-ratio.stan')
boys = 110312
girls = 105287
data = {'N': boys + girls, 'y': boys}
M = 100 if DRAFT else 10_000
sample = model.sample(data = data, seed = 123,
                      iter_sampling = M, iter_warmup = 1000,
                      show_progress = False, show_console = False)

As before, we proceed by first extracting the draws for the variables theta and boys_gt_girls.

theta_draws = sample.stan_variable('theta')
boys_gt_girls_draws = sample.stan_variable('boys_gt_girls')

We can plot a histogram of approximate draws $\theta^{(m)} \sim p(\theta \mid N, y)$ from the posterior to give us a sense of the value of $\theta$ and its uncertainty given our observed data $y$.

print(
  pn.ggplot(pd.DataFrame({'theta': theta_draws}),
            pn.aes(x = 'theta')) +
  pn.geom_histogram(color='white') +
  pn.labs(x = 'θ') +
  pn.theme(axis_text_y = pn.element_blank(),
           axis_title_y = pn.element_blank(),  
           axis_ticks_major_y = pn.element_blank())
)

All of the draws have a value for $\theta$ between 0.50 and 0.52. In the next sections, we will see how to use these draws to estimate a single value for $\theta$ as well as to compute probabilities, such as the probability that $\theta > 0.5$ or $\theta > 0.51$.

5.8 Bayesian point estimates

In Bayesian terms, a point estimate for a parameter $\Theta$ conditioned on some observed data $Y = y$ is a single value $\hat{\theta} \in \mathbb{R}^D$ that in some way summarizes the posterior $p(\theta \mid y)$. The notation $\hat{\theta}$ is conventional statistics notation for an estimate of a parameter $\theta.$ In this section we define three estimators and discuss how the two Bayesian estimators minimize a loss function between the true value and estimate. We come back to the loss function and the properties of the estimators after defining them.

Posterior mean estimator

The most common Bayesian point estimate for a parameter is the posterior mean,

\[\begin{align} \widehat{\theta} &= \mathbb{E}[\Theta \mid Y = y] \\[6pt] &= \int_{\Theta} \theta \cdot p(\theta \mid y) \, \textrm{d}\theta \\[6pt] &= \lim_{M \rightarrow \infty} \, \frac{1}{M} \sum_{m=1}^M \theta^{(m)} \\[6pt] &\approx \frac{1}{M} \sum_{m=1}^M \theta^{(m)}, \end{align}\] where in the last two lines, each draw is distributed approximately according to the posterior, \[ \theta^{(m)} \sim p(\theta \mid y). \]

We have snuck in conditional expectation notation in the first line of this definition. Expectations are just weighted averages, with weights given by a probability density. Bayesian inference involves expectations over the posterior, the concise notation for which is conditional expectation notation, \[ \mathbb{E}\! \left[ f(\Theta) \mid Y = y \right] \ = \ \int_{\mathbb{R^N}} f(\theta) \cdot p_{\Theta \mid Y}(\theta \mid y) \, \textrm d\theta, \] where $\Theta$ and $Y$ are random variables, whereas and $\theta$ and $y$ are ordinary bound variables.

For Laplace’s model, the estimate for the male birth rate $\theta$ conditioned on the birth data $y$ is calculated as the sample mean for the extracted draws for theta.

theta_hat = np.mean(theta_draws)
print(f"estimated theta = {theta_hat:.3f}")

estimated theta = 0.512

Posterior median estimator, quantiles, and intervals

A popular alternative Bayesian point estimate is the posterior median, $\theta^+$. The median is such that for each dimension $d \in 1{:}D$, \[ \Pr[\Theta_d \leq \theta^+_d] = \frac{1}{2}. \]

The posterior median can be calculated by taking the posterior median of the draws, as follows.

theta_plus = np.median(theta_draws)
print(f"estimated (median) theta = {theta_plus:.3f}")

estimated (median) theta = 0.512

Because our posterior distribution is nearly symmetric with Laplace’s data, the posterior mean and posterior median are very close.

Quantiles

Other posterior quantiles are estimated the same way. For example, if we want the posterior 95% quantile, we just take the empirical 95% point in the sorted chain of draws. For example, here are the 5% quantile and 95% quantile for Laplace’s posterior, calculated with empirical quantiles.

quantile_05 = np.quantile(theta_draws, 0.05)
quantile_95 = np.quantile(theta_draws, 0.95)
print(f"""0.05 quantile = {quantile_05:.3f};
0.95 quantile = {quantile_95:.3f}""")

0.05 quantile = 0.510;
0.95 quantile = 0.513

Posterior intervals

Together, the 5% quantile and 95% quantile give us the bounds of our 90% central probability interval, which is defined as the interval containing 90% of the posterior probability mass such that half of the remaining mass (5%) is below the interval and half (5%) is above.

print(f"central 90% posterior interval for theta")
print(f"    = ({quantile_05:.3f}, {quantile_95:.3f})")

central 90% posterior interval for theta
    = (0.510, 0.513)

Other intervals are computed in the exact same way.

Estimation error and bias

The error of an estimate is its difference from the true value, \[ \textrm{err} = \hat{\theta} - \theta. \] Our estimate $\hat{\theta}$ is implicitly a function of the data $y$ and so is $\textrm{err}$, so we can make this explicit and write \[ \textrm{err}(y) = \hat{\theta}(y) - \theta. \]

The bias of an estimator is defined as its expected error (as averaged over the data distribution for the random variable $Y$), \[\begin{align} \textrm{bias} &= \mathbb{E}[\textrm{err}(Y)] \\[6pt] &= \mathbb{E}[\hat{\theta}(Y) - \theta] \\[6pt] &= \int_Y \hat{\theta}(y) - \theta \ \textrm{d}y. \end{align}\]

Posterior mode estimator

A popular non-Bayesian estimator is the posterior mode, which is just the value at which the posterior density is highest, \[ \theta^* = \textrm{arg max}_\theta \ p(\theta \mid y). \] The estimate $\theta^*$ is often called the maximum a posteriori (MAP) estimate. The posterior mode is not a Bayesian estimator because it does average over uncertainty to minimize a loss function with respect to the true parameter values. It suffers the further problem that it might not even exist for models where the density grows without bound (e.g., hierarchical Bayesian models or for simple distributions like $\textrm{exponential}(1)$).

Loss functions and properties of estimators

The posterior mean is a popular Bayesian estimator for two reasons. First, it is an unbiased estimator in the sense of having zero bias. Second, it has the minimum expected square error among unbiased estimators, where the square error of an estimate is defined by \[ \textrm{err}^2(y) = \left(\hat{\theta}(y) - \theta\right)^2. \] This is our first example of a loss function, which is a function of an estimate $\hat{\theta}$ and true value $\theta.$ Posterior means might not exist if the posterior has very wide tails, like the standard Cauchy distribution.

The posterior median $\theta^+$ has three pleasant properties. First, it is always well defined, even for densities with poles or very wide tails. The second important property of the posterior median estimator is that it minimizes expected absolute error. Third, the median is less sensitive to changes in outliers, as one might expect with minimizing absolute vs. squared error. This is easy to see with code.

print(f"{np.median([1, 10, 11]) = }")
print(f"{np.median([1, 10, 100]) = }")
print(f"{np.median([1, 20, 100]) = }")
print(f"{np.mean([1, 10, 11]) = }")
print(f"{np.mean([1, 10, 100]) = }")
print(f"{np.mean([1, 20, 100]) = }")

np.median([1, 10, 11]) = 10.0
np.median([1, 10, 100]) = 10.0
np.median([1, 20, 100]) = 20.0
np.mean([1, 10, 11]) = 7.333333333333333
np.mean([1, 10, 100]) = 37.0
np.mean([1, 20, 100]) = 40.333333333333336

Another way to see this is that if you have a sequence $x = x_1, x_2, x_3$ with $x_1 < x_2 < x_3$, then $\frac{\textrm{d}}{\textrm{d}x_3} \textrm{median}(x) = 0,$ as seen with the first two results. On the other hand, it is more sensitive to changes of the central value, with $\frac{\textrm{d}}{\textrm{d}x_2} \textrm{median}(x) = 1$ versus $\frac{\textrm{d}}{\textrm{d}x_2} \textrm{mean}(x) = \frac{1}{3}.$

We will concentrate on posterior means in this quick introduction to Bayes and Stan.

(Markov chain) Monte Carlo error and effective sample size

The Markov chain we use to sample is itself a random variable in the sense that it is made up of a sequence of random variables (the twist is that it’s infinite dimensional, and thus takes values in $\mathbb{R}^\mathbb{N}$). Re-running the sampler will produce slightly different results due to Monte Carlo error (the error introduced by using only a finite sample of $M$ draws).

Stan reports Markov chain Monte Carlo standard error along with its estimates of the mean. The MCMC standard error is for a scalar parameter $\theta_d$ and is defined as \[ \textrm{mcmc-se} = \frac{\textrm{sd}[\Theta_d \mid Y = y]}{N^{\textrm{eff}}}, \] where the numerator is the standard deviation of the parameter $\theta_d$ in the posterior and $N^{\textrm{eff}}$ is the effective sample size. The posterior variance and standard deviation are defined as follows.

\[\begin{align} \textrm{var}\left[\Theta_d \mid Y = y\right] &= \mathbb{E}\left[\left(\Theta_d - \mathbb{E}\left[\Theta_d \mid Y = y\right]\right)^2 \ \big| \ Y = y\right] \\[6pt] \textrm{sd}\left[\Theta_d \mid Y = y\right] &= \sqrt{\textrm{var}[\Theta_d \mid Y = y]}. \end{align}\]

In the usual central limit theorem, the sample size (number of independent draws) appears in place of $N^{\textrm{eff}}$. The effective sample size for a sample of size $M$ is defined to be \[ N^{\textrm{eff}} = \frac{M}{\textrm{IAT}}, \] where $\textrm{IAT}$ is the integrated autocorrelation time. We are not going to define it formally, but it may be thought of as the interval between effectively independent draws in our Markov chain. If we have low autocorrelation, $\textrm{IAT}$ will be close to 1 and if the autocorrelation is higher, it can be much higher. If the $\textrm{IAT}$ is much higher than 100, it can become difficult to estimate. If the autocorrelation is negative, the $\textrm{IAT}$ is less than 1 and the effective sample size is larger than the number of draws. Thus $N^{\textrm{eff}}$ is the number of independent draws that would lead to the same error as our correlation draws using a Markov chain.

5.9 Estimating event probabilities

Laplace wasn’t looking for a point estimate for $\theta$. He wanted to know the probability that $\theta > \frac{1}{2}$ after observing $y$ male births in $N$ trials. In the notation of probability theory, he wanted to estimate an event probability.

A subset of parameters is known as an event. We can convert conditions on parameters into events. For example, the condition $\theta > \frac{1}{2}$ can be turned into the event \[ A = \left\{ \theta \in \Theta : \theta > \frac{1}{2} \right\}. \] Events are what are assigned probabilities by a measure in probability theory. Given a probability measure, the probability of the event $A$, that the rate of boy births is higher than girl births, will be well defined. Because we can convert conditions to events, we will be sloppy and treat the conditions as if they were events. This allows us to write $\Pr\!\left[\Theta > \frac{1}{2} \, \big| \, N, y\right]$ for the probability of the event $\Theta > \frac{1}{2}$.

Event probabilities via indicators

The indicator function $\textrm{I}$ maps propositions into the value 1 if they are true and 0 if they are false. For example, $\textrm{I}(\theta > \frac{1}{2}) = 1$ if the proposition $\theta > \frac{1}{2}$ is true, i.e., when $\theta$ is greater than one half.

Event probabilities are defined as posterior conditional expectations of indicator functions for events. \[\begin{align} \Pr[\Theta > 0.5 \mid N, y] &= \mathbb{E}\!\left[\textrm{I}[\Theta > 0.5] \,\big|\, N, y\right] \\[8pt] &= \int_{\Theta} \textrm{I}(\theta > 0.5) \cdot p(\theta \mid N, y) \, \textrm{d}\theta \\[8pt] &\approx \frac{1}{M} \sum_{m=1}^M \textrm{I}(\theta^{(m)} > 0.5), \end{align}\] where we assume $\theta^{(m)} \sim p(\theta \mid N, y)$ is distributed according to the posterior for $m \in 1{:}M$. Following physics conventions, we use square brackets for functors (functions that apply to functions); that means we write $\textrm{I}[\cdot]$ when we apply the indicator function to a random variable and $\textrm{I}(\cdot)$ when we apply it to a primitive scalar.

Events as indicators in Stan

In Stan, we code the value of the indicator function directly and assign it to a variable in the generated quantities block.

generated quantities {
  int<lower=0, upper=1> boys_gt_girls = theta > 0.5;
}

Conditional expressions like theta > 0.5 take on the value 1 if they are true and 0 if they are false. In mathematical notation, we would write $\textrm{I}(\theta > 0.5)$, which takes on value 1 if $\theta > 0.5$ and 0 otherwise; in Stan, like in C++, we treat > as a binary operator that returns either 0 or 1, so we just write theta > 0.5 in Stan.

The answer to Laplace’s question

The posterior mean of the variable boys_gt_girls is thus our estimate for $\Pr[\theta > 0.5 \mid N, y]$. It is essentially 1. Printing to 15 decimal places, we see

Pr_boy_gt_girl = np.mean(boys_gt_girls_draws)
print(f"estimated Pr[boy more likely] = {Pr_boy_gt_girl:.15f}")

estimated Pr[boy more likely] = 1.000000000000000

The value of 1 returned as an estimate brings up the important problem of numerical precision. As we can see from the histogram, all of our sampled values for $\theta$ are greater than $\frac{1}{2}$.

Laplace calculated the result analytically, which is \[ \Pr\!\left[\Theta > \frac{1}{2} \ \bigg| \ N, y\right] \approx 1 - 10^{-27}. \] Thus we would need an astronomical number of posterior draws before we would generate a value of $\theta$ less than $\frac{1}{2}$. As given, the answer of 1.0 is very close to the true answer and well within our expected Monte Carlo error. As an aside, using the standard double-precision (8 byte, 64 bit) floating point representation for real numbers, the number $1 - 10^{-27}$ will round to $1$ because machine precision, which is defined as the largest $\epsilon$ such that $1 - \epsilon \neq 1$, is about $10^{-16}$. Let’s try it in Python, which, like Stan, uses double-precision arithmetic by default.

print(f"{(1.0 == (1.0 - 1e-27)) = }")

(1.0 == (1.0 - 1e-27)) = True

which one would expect given $10^{-27}$ is below the machine precision,

print(f"Machine precision: {np.finfo(float).eps}")

Machine precision: 2.220446049250313e-16

MCMC summary statistics from Stan

With Stan, we can print a summary for the variable $\theta$ in the posterior, which reports all of these values. We just call the .summary() function on the sample.

sample.summary(sig_figs = 3)

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	-149000.000	0.004790	6.660000e-01	-149000.00	-149000.000	-149000.000	19300.0	39600.0	1.0
theta	0.512	0.000009	1.090000e-03	0.51	0.512	0.513	15300.0	31500.0	1.0
boys_gt_girls	1.000	NaN	9.380000e-14	1.00	1.000	1.000	NaN	NaN	NaN

The rows are for our random variables, with lp__ indicating the unnormalized log density defined by the Stan program. The unnormalized log density includes any change-of-variables adjustments required for transforming constrained parameters; the transforms for constrained parameters and associated change-of-variables adjustments are detailed in (Stan Development Team 2023b).

The other two rows are for variables defined in the Stan program, theta, and boys_gt_girls. The number of significant figures used in the results can be controlled in the summary function, as can the quantiles being reported.

The first column reports the posterior mean, and agrees with our earlier calculations for both variables. The second column is the Monte Carlo standard error (based on an estimated effective sample size) and a posterior standard deviation estimate. The next three columns are quantiles we computed earlier, and they also agree with our calculations. Next is the effective sample size, which can vary from variable to variable, and the effective sample size rate (per second). The final column reports $\widehat{R}$, which we discuss in the next section.

6 Warmup and convergence monitoring

When running Markov chains, we want to make sure that we have moved far enough that our draws are approximately from the posterior. A standard way to monitor convergence is to start multiple Markov chains at different initializations (ideally chosen from a diffuse initialization like a draw from the prior) and measure whether they are producing draws from the same distribution.

6.1 Warmup

During its initial warmup iterations, Stan tries to find the high probability mass region in which it should be sampling, adapt a good step size, and estimate posterior variance. The variance is used to precondition the sampler through scaling; see (Neal 2011) for details. Stan can also estimate a dense covariance matrix and precondition with rotation and scaling (i.e., with a Euclidean metric).

Warmup converges when the step size and posterior covariance estimates no longer change. With multiple chains, it’s possible to test that they have all converged to roughly the same step size and covariance estimate. Unless things are going wrong, we typically don’t bother measuring whether adaptation has converged and instead measure our end goal directly, which is whether we are getting reasonable posterior draws after warmup.

Warmup doesn’t form a single coherent Markov chain because it uses memory to adapt. Once Stan starts sampling, the result is a Markov chain. All of our posterior analysis will be with draws from the Markov chain, not from warmup. We can save and extract the warmup draws to investigate the behavior of warmup.

6.2 Potential scale reduction and $\widehat{R}$

Stan uses the potential scale reduction statistic $\widehat{R}$ (pronounced “R hat”). Given a sequence of Markov chains, Stan splits each of them in half to make sure the first half and second half of the chain agree, then calculates variances within each chain and across all chains and compares. The statistic $\widehat{R}$ converges to 1 as the Markov chains converge to the same distributions.

6.3 How many chains for how long?

A simple rule of thumb is to run four chains until $\hat{R} \leq 1.01$ and effective sample size (ESS) is greater than 100. The reason we recommend an effective sample size of “only” 100 is that it means the standard error will be $\frac{1}{10}$ the size of the standard deviation. Given that posterior standard deviation represents residual uncertainty, calculating means to higher precision is rarely worthwhile.

The easiest way to achieve $\widehat{R} \leq 1.01$ and $N^{\textrm{eff}} > 100$ is to start with 100 warmup iterations and 100 sampling iterations. If there are $\widehat{R}$ values that are too large or there are effective sample size values that are too low, then double the number of sampling and warmup iterations, and try again. Running more warmup iterations is important because sampling will not be efficient if warmup has not converged. At most, using the same number of warmup and sampling iterations costs a factor of two over the optimal settings, which are not known in advance.

Even if we use more than four chains, we need to make sure that our effective sample size is at least 25 per chain. It is not that we need so many draws for inference, but that we do not trust our effective sample size estimator if it is much lower. One way to check that the ESS estimator is OK is to double the number of draws and make sure that the ESS also doubles. If it doesn’t, it’s a sign that the first ESS estimate is unreliable.

6.4 Running chains concurrently

You can set the number of chains that are run using the chains argument of the sample() method and you can set the maximum number of chains to execute concurrently using parallel_cores (which defaults to 1, sequential execution). If you set the maximum number of parallel chains to be too low, CPU resources are potentially unused. If you set the number too high, then either CPU or memory can bottleneck performance and cause it to be slower overall than running with fewer chains. The only advice I can give here is to experiment. In personal projects on our own hardware, the goal is usually the largest effective sample size in the minimum amount of time. On the other hand, I sometimes find I need to leave enough processing power left over to continue to work on documents, email, etc. In a server setting, memory usage, latency, throughput, and I/O need to be balanced even more carefully between throughput and resource usage.

7 Stan example: A/B testing

A common application of statistics is to compare two things, such as the effectiveness of a new drug versus the current drug used to treat a condition. Another application might compare the effectiveness of two different advertisement presentations in getting users to click through. This is usually called A/B testing in a nod to comparing a hypothetical option A and option B.

Let’s consider three good Mexican restaurants in New York City, Downtown Bakery II in the East Village, Taqueria Gramercy in Gramercy, and La Delicias Mexicanas in Spanish Harlem. Here’s the number of reviews and 5-star reviews for these restaurants on the web site Yelp.

name	5-star reviews	total reviews
Downtown Bakery II	141	276
Taqueria Gramercy	84	143
La Delicias Mexicanas	41	87

We can estimate a few things. First, we can estimate the probability that each restaurant really is a 5-star restaurant. We will parameterize this directly with a rate of 5-star reviews parameter for each restaurant. Then we can rank the restaurants based on their probability of being a 5-star restaurant. What does it mean to “be a 5-star restaurant” in this sense? It means getting 5-star reviews from Yelp reviewers. Our model is going to treat the reviewers as exchangeable in the sense that we don’t know anything to distinguish them from one another. Now whether this notion of 5-star restaurant is useful will depend on how similar the reader is to the population of reviewers.

We will assume that the number of 5-star reviews for a restaurant $k \in 1{:}K$ depends on its underlying quality $\theta_k \in [0, 1]$, \[ n_k \sim \textrm{binomial}(N_k, \theta). \] Here $N_k \in \mathbb{N}$ is the number of reviews for restaurant $k$ and $n_k \in 0{:}N_k$ the number of 5-star reviews. We further assume the probabilities of 5-star reviews have a beta distribution, \[ \theta_k \sim \textrm{beta}(\alpha, \beta). \] In a beta distribution, the sum $\alpha + \beta$ determines how much to regularize estimates toward $\alpha / (\alpha + \beta)$, with $\alpha = \beta = 1$ providing a uniform distribution.

For inference, we will be interested in the posterior distribution $p(\theta \mid N, n)$ of 5-star review probabilities. This gives us the posterior density of the restaurants’ probability of receiving a 5-star review. With this posterior, we can rank restaurants based on their probability of receiving a 5-star review and calculate the probability that each is the best restaurant, \[\begin{align} \Pr[\Theta_k = \max(\Theta) \mid N, n] &= \mathbb{E}\!\left[\textrm{I}[\Theta_k = \max(\Theta)] \mid N, n\right] \\[4pt] &= \int \textrm{I}(\theta_k = \max(\theta) \cdot p(\theta \mid N, n) \, \textrm{d}{\theta} \\[4pt] &\approx \frac{1}{M} \sum_{m=1}^M \textrm{I}\!\left(\theta^{(m)}_k = \max\!\left(\theta^{(m)}\right)\right), \end{align}\] where we draw $\theta^{(m)} \sim p(\theta \mid N, n)$ from the posterior.

7.1 Stan model for A/B testing

We will assume there are a total of $K$ items being compared. In traditional A/B testing, $K = 2$, but our example uses $K = 3$ and our code works for any $K$. We define the indicator array for our event probability estimates in the generated quantities block.

stan/ab-test.stan

data {
  int<lower=0> K;
  array[K] int<lower=0> trials;
  array[K] int<lower=0> successes;
  real<lower=0> alpha;  
  real<lower=0> beta;  
}
parameters {
  array[K] real<lower=0, upper=1> theta;
}
model {
  successes ~ binomial(trials, theta);
  theta ~ beta(alpha, beta);
}
generated quantities {
  array[K] int<lower=0, upper=1> is_best;
  for (k in 1:K) {
    is_best[k] = theta[k] == max(theta);
  }
}

We have coded the data for $K$ items directly in terms of number of trials and number of times A is preferred to B (as a “success” in binomial parameter nomenclature). We have also supplied the prior parameters for the probability of success as data variables alpha and beta. The model is the same binomial as we had before, except now the likelihood and priors are vectorized. In general, Stan is able to take something like the binomial distribution, which has an integer number of trials, integer number of successes, and a scalar success probability and take vectors for all of these. What we have written above is identical to what we would get with a loop,

  for (k in 1:K) {
    successes[k] ~ binomial(trials[k], theta[k])
  }

The vectorization of theta is different in that only theta is an array, whereas alpha and beta are scalars. The sampling statement for theta is equivalent to

  for (k in 1:K) {
    theta[k] ~ beta(alpha, beta);
  }

Because alpha and beta are scalars, they are not indexed. Rather, they are broadcast, meaning that the same alpha and beta are reused for each dimension of theta.

So now let’s call and fit this model and print the summary. We are setting $\alpha = \beta = 2$, which is equivalent to setting them equal to 1 and adding 2 trials and 1 success to the data for each restaurant.

M = 100 if DRAFT else 1000
model = csp.CmdStanModel(stan_file = '../stan/ab-test.stan')
data = {'K': 3, 'trials': [276, 143, 87],
        'successes': [141, 84, 41],
        'alpha': 2, 'beta': 2}
sample = model.sample(data = data, seed = 123,
                      iter_warmup = M, iter_sampling = M,
                      show_progress = False, show_console = False)
sample.summary(sig_figs = 2)

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	-360.000	0.02600	1.200	-360.00	-360.00	-360.00	2100.0	29000.0	1.0
theta[1]	0.510	0.00048	0.030	0.46	0.51	0.56	3813.0	51528.0	1.0
theta[2]	0.590	0.00062	0.040	0.52	0.59	0.65	4184.0	56547.0	1.0
theta[3]	0.470	0.00071	0.051	0.39	0.47	0.56	5185.0	70066.0	1.0
is_best[1]	0.065	0.00440	0.250	0.00	0.00	1.00	3074.0	41541.0	1.0
is_best[2]	0.900	0.00540	0.300	0.00	1.00	1.00	3035.0	41017.0	1.0
is_best[3]	0.034	0.00310	0.180	0.00	0.00	0.00	3393.0	45850.0	1.0

First, we make sure that our $\widehat{R}$ values are less than 1.01 and that our $N^{\textrm eff}$ are all greater than 25 per chain (4 chains here with default settings). In fact, we can see that our sampling is nearly as efficient as if we had independent posterior draws. The probability of a 5-star review is our value for theta, which are 51% for Downtown Bakery, 59% for Taqueria Gramercy, and 47% for La Delicias. Moving onto the probability of being the best restaurant in terms of garnering 5-star reviews, Taqueria Gramercy has the highest probability of their next review being a 5-star one. But our estimated probability that Taqueria Gramercy has the highest probability of a 5-star review is far less than 100%. This is because binomial data is weak and we only have observations in the hundreds. (All three of these restaurants are good, but La Delicias is my favorite for both food and atmosphere.)

8 Stan’s execution model

Stan programs consist of several blocks. Here is a summary of the blocks, when they are executed, and what they do. None of the blocks are required, but the blocks that do appear must be in the order listed here.

block	executed	behavior
`functions`	as needed	user-defined function definitions
`data`	once	read data to construct model
`transformed data`	once	define transformed data
`parameters`	once / log density	constrain parameters (w. Jacobian)
`transformed parameters`	once / log density	define transformed parameters
`model`	once / log density	evaluate model log density
`generated quantities`	once / draw	define generated quantities

8.1 Data and transformed data

The data block contains only variable declarations. The variables declared in the data block are read once at data load time.

The transformed data block contains variable declarations and definitions. It defines functions of the data, such as standardized predictors, constants to use for priors, etc. The transformed data block may use pseudorandom number generation. The transformed data block is executed once, after the data is read, to define the transformed data variables.

All variable declarations at the block level must declare types and sizes (which may be data dependent). Local variables within blocks are declared without sizes.

Constraints in the data block are evaluated as data is read in and constrained in the transformed data block are evaluated at the end of the block’s execution. Constraint violations in data or transformed data raise exceptions which terminate execution.

Transformed data variables may be assigned in the transformed data block, but after these blocks have executed, the variables may not be reassigned.

8.2 Parameters and transformed parameters

The parameters block declare the variables over which the model is defined. It contains only variable declarations with sizes. When executed, it is supplied with concrete parameter values.

Constraints on parameters are used to define a transform of the constrained space to $\mathbb{R}^D$. For example, a variable declared with a lower=0 constraint is log-transformed to an unconstrained variable, whereas a variable declared as a covariance matrix is Cholesky factored and then a log transform is applied to the diagonal elements to render an $N \times N$ matrix as a vector in $\mathbb{R}^{\binom{N}{2}}$. It is critical that parameters are declared with all necessary constraints so that the model has support (i.e., non-zero density, finite log density) over the entire transformed space.

The transformed parameters block defines functions of parameters and data. This is where users can define their own parameter transforms. Any constraints declared on transformed parameters are validated at the end of the block’s execution. If the constraints are violated, an exception will be thrown, which typically causes the current proposal to be rejected.

Variables declared in the parameters block are like function argument declarations. The log density function defined by a Stan program takes the parameters as an argument. Thus the parameter values are always supplied externally to a Stan program.

After the transformed parameters block has executed, variables declared in it may not be reassigned.

The only difference between variables declared as local variables in the model block and those declared in the transformed parameters block is that the transformed parameters are printed and also available in the generated quantities block.

8.3 Model

The purpose of the model block is to define the log density. Once the data is loaded, a Stan program’s main purpose is to provide an unnormalized log density function over the unconstrained parameters. External algorithms, such as optimizers, samplers, or variational inference will supply values of the unconstrained parameters to evaluate.

The unnormalized log density value that is returned by model evaluation is held in an accumulator called target. Posterior densities are defined by multiplying factors corresponding to probability density or mass functions. Posterior log densities are defined by adding terms terms corresponding to unnormalized log density or mass functions.

The unnormalized log density accumulator target is initialized at zero and then incremented throughout the execution of a Stan program. As mentioned in the last section, the first thing this unnormalized log density function does is constrain the parameters and add the log change of variables adjustment to the target accumulator. This is all done automatically. This provides the constrained values of the parameters to the code that will execute next in the model block.

The target log density that will be returned may be incremented directly, as in the following statement.

target += -0.5 * x^2;

Although the built-in accumulator target may not be used directly as a variable, its current value is available through the function target(); printing it can be useful for debugging.

Sampling statements are just syntactic sugar for incrementing the target log density. For example, the sampling statement

x ~ normal(0, 1);

is equivalent to the target increment statement

target += normal_lupdf(x | 0, 1);

The _lupdf indicates that it is a log unnormalized probability density function.

The vertical bar is used to separate outcome variates from the parameters. The notation lpdf denotes a log probability density function, with lpmf for log probability mass functions. The variants lupdf and lupmf, as used above, are for their unnormalized counterparts, which may drop normalizing constants that do not depend on the parameters. Unless the normalizing constants are needed, for example in a mixture model component, it is more efficient to use the lupdf and lupmf forms either explicitly by incrementing the target accumulator or implicitly through sampling statements.

8.4 Generated quantities

The generated quantities block is evaluated once per draw rather than once per log density evaluation. With algorithms like Hamiltonian Monte Carlo, each draw may require a few, dozens, or even hundreds of log density evaluations. The further advantage of generated quantities is that they are evaluated with double-precision floating point values because they do not require differentiation, which is a factor of four or more faster than autodiff. The generated quantities block may also use pseudo-random number generation. Any constraints are evaluated at the end, but exceptions do not cause rejections, just potential warnings and potentially undefined (not-a-number) values.

Generated quantities do not contribute to the definition of the log density being sampled, but they are nevertheless part of the statistical model. The typical role of generated quantities is for posterior predictive inference, which is conditionally independent of the observed data given the model parameters. Posterior predictive inferences are inferences (in the sense of “inductive inference”) for new data items based on simulated parameter values; see Appendix D2. Posterior predictive inference for a precise definition.

8.5 User-defined functions

Users can define functions in the functions block. These can be applied anywhere in a Stan program, just like built-in functions. In addition to ordinary mathematical functions, users can define several types of special-purpose functions with Stan-specific behavior.

User-defined probability density and mass functions

Defining a function with the suffix _lpdf defines a continuous log probability density function where the first variable is a real- or complex-valued variate (these may be containers like matrices). Similarly, the suffix _lpmf marks a discrete log probability density function and Stan will validate that the first argument is declared as an integer.

It is not possible to code a user-defined _lupmf or lupdf function because Stan does not provide the hooks to detect if arguments are constants and hence do not need to be included.

These functions are called using standard vertical bar notation to separate the variate and its parameters and they may also be used in sampling statements. For example, consider the following custom log probability density function.

functions {
  real foo_lpdf(real y, real theta) {
    return normal_lpdf(y | 0, exp(-theta))
  }
}

With this definition, the model block may use the function to increment the target log density,

  target += foo_lpdf(z | phi);

User-defined functions ending in _lpdf may also be used directly in sampling statements after removing the suffix.

  z ~ foo(phi);

A user-defined _lpdf function implicitly defines an equivalent _lupdf function so that user-defined _lpdf functions may be used with sampling notation. Recall that user-defined functions ending in _lupdf or _lupmf are not allowed.

User-defined random number generators

Only functions defined with _rng suffixes will be able to call other functions with _rng suffixes and they will only be able to be used in Stan programs in the transformed data and generated quantities blocks (i.e., the blocks that are not part of the model definition and hence do not need to be differentiated).

For example, a simple way to generate chi-squared random variates is to literally sum a sequence of squared normal variates (in a real program, one would use the built-in chi-square _rng function).

real simple_chi_sq_rng(int n) {
  real y = 0;
  for (i in 1:n) {
    y += normal_rng(0, 1)^2;
  }    
  return y;
}

Modifying the target

Functions that use the suffix _lp (for “log probability density or mass function”) can access and modify the target log density accumulator target either directly or through sampling statements. This restricts the use of functions ending in _lp to the transformed parameters and model blocks. For example, the following function implements what Stan does implicitly for a variable declared with a lower=0 constraint.

real pos_constrain_lp(real v) {
  target += v;    // change of variables adjustment
  return exp(v);  // change of variables
}

The change of variables adjustment is just the log absolute value of the derivative of the constraining transform (see Appendix : Change of variables (univariate). In this case, the constraining transform is $\exp:\mathbb{R} \rightarrow (0, \infty)$, so the log change of variables adjustment is \[ \log \left| \exp'(v) \right| = \log \left| \exp(v) \right| = \log \exp(v) = v, \] where $\exp'$ is the derivative of the exponential function.

8.6 Automatic differentiation

Stan uses automatic differentiation to define gradients of the log density function. This is done by building up the complete expression graph of the log density being defined by the Stan model. A concrete value for unconstrained parameters is supplied, these are constrained and the log absolute determinant of the Jacobian of the unconstraining transforme is added to the expression graph. Then each operation involving parameters is added to the expression graph. The root is the final log density value, which is the value of target. The leaves are the unconstrained parameters. A reverse pass over this expression graph propagates the partial derivatives from the log density value down to the unconstrained parameters using the chain rule for derivatives (it applies in the reverse order of the expression construction, which is a topological sort of the directed graph). See Carpenter et al. (2015) for details of Stan’s C++ architecture for automatic differentiation; current details are only in the developer documentation and code repository.

9 Stan example: Regression and prediction

In this section, we will go over simple regression models in Stan and see how to generate predictions for new items based on fitted parameters.

9.1 Fisher’s iris data set

We will analyze Fisher’s classic iris data set, which provides sepal and petal length and width (in centimeters) as well as the species of iris. First we read it in from the file stan/iris-data.csv that is supplied with the distribution (direct link: and can be accessed directly as iris-data.csv, then we will plot petal width versus petal length, with species indicated by color.

df = pd.read_csv('../stan/iris-data.csv')
print(
  pn.ggplot(df, pn.aes(x='petal_width', y='petal_length',
                       color='species')) +
  pn.geom_point() +
  pn.labs(x = "petal width (cm)", y = "petal length (cm)")
)

The three species are of very different sizes, but there is a roughly linear relation between petal width and petal length across the different species.

The width values all have exactly one decimal place of accuracy, which means they were recorded to the nearest millimeter. There are models that deal with this kind of measurement quantization, but we will not consider them now and will instead proceed as if there were no rounding of our measurements.

9.2 Linear regression

A simple univariate linear regression models one measurement as a linear function of another measurement. In this example, we will model an iris flower’s petal length as a linear function of its petal width. Let $x_n$ be the petal width for flower $n$ and let $y_n$ be the petal length. A linear regression says that we expect $y_n$ to be a linear function of $x_n$, or roughly, the expected value of $y_n$ will be $\alpha + \beta \cdot x_n$ for some intercept $\alpha$ and slope $\beta$.

The linear relation in a linear regression only holds in expectation. That is, we expect the value of $y_n$ to be $\alpha + \beta \cdot x_n$, but in real observations there will be error due to either measurement or a failure of the linear relationship. In the Iris data plot, you can see that there are multiple irises with exactly the same measured petal width but varying petal lengths.

We can introduce a variable $\epsilon_n$ for the difference between a petal’s length and its expected length given the linear relationship, \[ y_n = \alpha + \beta \cdot x_n + \epsilon_n. \] It is traditional to assume the error is normally distributed with a zero mean and scale $\sigma > 0$, \[ \epsilon_n \sim \textrm{normal}(0, \sigma). \] We can rearrange terms and write this in a fashion that will mirror how it’s coded in Stan, \[ y_n \sim \textrm{normal}(\alpha + \beta \cdot x_n, \sigma). \] In this form, the linear prediction is the expectation for the random variable $Y_n$ and the standard deviation is the scale, \[ \mathbb{E}[Y_n] = \alpha + \beta \cdot X_n \qquad \mathrm{sd}[Y_n] = \sigma. \]

Here is a simple Stan model to regress petal length on petal width; we replace $x$ with petal_width and $y$ with petal_length.

stan/iris-petals.stan

data {
  int<lower=0> N;
  vector<lower=0>[N] petal_width;
  vector<lower=0>[N] petal_length;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  petal_length ~ normal(alpha + beta * petal_width, sigma);
  alpha ~ normal(0, 5);
  beta ~ normal(0, 5);
  sigma ~ lognormal(0, 1);
}

The data block says there are $N$ observations of petal width and length. The widths and lengths are declared to have type (column) vector, with a constraint lower=0 and size (number of rows) N. We are declaring these types as vectors because we are going to apply vector arithmetic to them.

The model parameters are coded in the parameter block. The lower bound for sigma is required here as the normal distribution is only defined for positive values of sigma. Lower bounds for continuous values in Stan are exclusive mathematically, but can devolve to inclusive due to the vagaries of floating point. For example, we unconstrain a positive value using logarithms, so the inverse constraining transform is the exponential function. Mathematically, $\exp:\mathbb{R} \rightarrow (0, \infty)$ and cannot produce a 0 value. But we don’t live in math world on computers. With floating point arithmetic, it’s possible that $u$ is so big (more than 500 or so) that $\exp(u)$ is smaller than the smallest number that can be represented in double-precision floating point arithmetic, and it underflows to zero. If a constraining transform underflows for the scale and then we use it in a normal distribution, Stan will raise an exception, the capture of which by the inference algorithm will generate an informative warning message and reject the current iteration.

Stan requires every parameter value that satisfies the declared constraints to be in the support of the density, which is the set of values for which the density is non-zero (equivalently, the log density is finite).

The model codes the regression following the math, but without the subscripts. What’s going on is that the sampling statement for petal_length is vectorized. The variables petal_width and petal_length are both vectors of size N, whereas alpha, beta, and sigma are all scalars. The expression alpha + beta * petal_width returns a vector, each entry of which is alpha plus beta times the corresponding petal width. The scalar-vector multiplication is defined in the usual way. The reuse of alpha for each petal width is known as broadcasting in programming languages.

Now we have a vector for the location parameter of a normal distribution, a scalar scale (sigma), and a vector outcome (petal_length). Stan’s distributions will broadcast scalar like sigma to use for each element of a container. The end result is equivalent to the following, but much more compact and efficient.

for (n in 1:N) {
  petal_length[n] ~ normal(alpha + beta * petal_width[n], sigma);
}

The rest of the model is priors for the regression coefficients and error scale. The coefficients are unconstrained, but the error scale is constrained to be positive, so we use a lognormal distribution. We could have made other choices for the prior here such as a positive half normal distribution or a positive half Student-t (unlike lognormal distributions, a positive half normal distribution centered at zero can accomodate scale values near zero; Student-t distributions have wider tails if there is more prior uncertainty as to the exact scale of the model).

We can then compile and fit the model and display the resulting fit for alpha and beta as a scatterplot. Note that we are providing data for species that is not used—this makes it easy for use to reuse our data frame for later models that do take species into account. Stan will only read the data variables that it needs and only throws an error if a required value is missing.

def iris_data_frame():
    return pd.read_csv('../stan/iris-data.csv')

def iris_data():
    df = iris_data_frame()
    N = df.shape[0]
    petal_width = df['petal_width']
    petal_length = df['petal_length']
    species = 1 + pd.Series(df['species']).astype('category').cat.codes
    num_species = 3
    data = {'N': N,
            'K': num_species,
            'species': species,
            'petal_width': petal_width,
            'petal_length': petal_length,}
    return data

M = 100 if DRAFT else 1000
model = csp.CmdStanModel(stan_file = '../stan/iris-petals.stan')
sample = model.sample(data = iris_data(), seed = 123,
                      iter_warmup = M, iter_sampling = M,       
                      show_progress = False, show_console = False)
sample.summary(sig_figs = 2)

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	34.00	0.03400	1.200	31.00	34.00	35.00	1200.0	6800.0	1.0
alpha	1.10	0.00180	0.074	0.97	1.10	1.20	1700.0	9200.0	1.0
beta	2.20	0.00130	0.053	2.10	2.20	2.30	1700.0	9400.0	1.0
sigma	0.48	0.00071	0.028	0.44	0.48	0.53	1500.0	8300.0	1.0

This summary doesn’t give us a good feeling for the uncertainty in the linear relationship. To do that, we will plot multiple draws from the posterior along with the data.

alpha_draws = sample.stan_variable('alpha')
beta_draws = sample.stan_variable('beta')
plot =  pn.ggplot(df, pn.aes(x='petal_width', y='petal_length',
                             color='species'))
plot = plot +  pn.geom_point()
plot = plot +  pn.labs(x = "petal width (cm)", y = "petal length (cm)")
for a, b in zip(alpha_draws[0:20], beta_draws[0:20]):
    plot = plot + pn.geom_abline(intercept = a, slope = b,
                                 alpha = 0.5, size = 0.2)
print(
  plot
)

In order to fit all three groups of data, the plot doesn’t do such a great job of fitting any of them—most iris versicolor instances and the wider instances among iris setosa have underestimated petal lengths.

We didn’t include the error scale, but it’s roughly 0.5. That means it won’t be uncommon to get errors greater than 1 or less than -1. For iris setosa, this could easily result in predictions with negative lengths!

9.3 Built-in generalized linear model functions

Stan supplies built-ins for the most common generalized linear model (GLM) functions. Generalized linear models involve a linear predictor like a linear regression, but then apply a non-linear link function to map to constrained expectations. For example, a logistic regression model might apply the log odds ($\textrm{logit}()$) function to transform values in $\mathbb{R}$ to $(0, 1)$. This is then coupled with an appropriate likelihood model, generalizing from the normal of linear regression to a Bernoulli distribution for logistic regression (or a Poisson distribution for a count regression after mapping $\mathbb{R}$ to $(0, \infty)$ for the expected count value).

The GLM functions in Stan allow linear and non-linear regressions to be coded easily and efficiently. For example, our linear regression could be coded with the normal_id_glm function (indicating a normal distribution and identity link), whereas a logistic regression has a GLM function bernoulli_logit_glm (indicating a Bernoulli distribution with logit link). The linear regression for the Iris data is most compactly and efficiently written as follows.

  petal_length ~ normal_id_glm(petal_width, alpha, beta, sigma);

To accomodate the signature (argument order and types) of normal_id_glm,n data and parameter type declarations are lifted to container types.

  matrix<lower=0>[N, 1] petal_width;
  vector[N] petal_length;
  real alpha;
  vector[1] beta;
  real<lower=0> sigma;

The generalization of a vector of predictors to a matrix allows multivariate regressions with multiple predictors per item (e.g., using population density, speed limit, traffic density, and time of day to predict pedestrian traffic accidents).

9.4 Robust regression

Although we won’t consider this model in detail, Stan’s plug-and-play design makes it very simple to convert our linear regression into a robust regression by swapping out the normal error model for a Student-t error model,

  petal_length ~ student_t(dof, alpha + beta * petal_width, sigma);

We have just used dof for the degrees of freedom variable, but setting it at a value like 4 provides wider tails and setting it at 1 produces the very wide-tailed Cauchy distribution, which does not even have a finite mean or variance.

9.5 Posterior predictive inference

Now let’s say we have a new observation where we know the petal width and want to predict its length. Mathematically, to make a prediction for a new item, we use posterior predictive inference. Posterior predictive inference uses our posterior distribution over possible parameter values to predict new possible data values given new predictors.

First, let’s consider evaluating the log density of a new petal’s length ($\tilde{y}$) given its observed width ($\tilde{x}$), having observed our original data $x$ and $y$. We use the notation $\tilde{y}$ and $\tilde{x}$ to distinguish the posterior predictions and their predictors from the “training” data.

In Bayesian inference, the posterior predictive distribution is $p(\tilde{y} \mid \tilde{x}, x, y),$ which is the posterior distribution over new data items $\tilde{y}$ given “training” data $x, y$ and the covariates $\tilde{x}$ for the new data items. The full definition is \[\begin{align} p(\tilde{y} \mid \tilde{x}, x, y) &= \mathbb{E}[p(\tilde{y} \mid \tilde{x}, \Theta) \mid x, y] \\[6pt] &= \int_{\mathbb{R}^N} p(\tilde{y} \mid \tilde{x}, \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta \\[6pt] &\approx \frac{1}{M} \sum_{m=1}^M \, p\!\left(\tilde{y} \mid \tilde{x}, \theta^{(m)}\right), \end{align}\] where $\theta^{(1)}, \ldots, \theta^{(m)} \sim p(\theta \mid x, y)$ are draws from the posterior. The parameters are marginalized out. It’s perhaps easiest to understand by considering the Monte Carlo approximation, which just averages the sampling log density over draws of parameters from the posterior.

Sampling uncertainty and estimation uncertainty

It can help to break the posterior predictive integral down into the two components of uncertainty, sampling uncertainty due to our sampling distribution and posterior uncertainty in the values of our parameters, \[ \int_{\Theta} \underbrace{p(\tilde{y} \mid \tilde{x}, \theta)}_{\textrm{sampling uncertainty}} \cdot \underbrace{p(\theta \mid x, y)}_{\textrm{estimation uncertainty}} \, \textrm{d}\theta \] In cases where $\tilde{y}$ and $\tilde{x}$ are known, we can use this expression to evaluate $p(\tilde{y} \mid \tilde{x}, x, y).$ We do this, for example, in cross-validation, a form of model evaluation. In cases where $\tilde{x}$ is known, but not $\tilde{y},$ we can sample $\tilde{y} \sim p(\tilde{y} \mid \tilde{x}, x, y).$

In this section, we have taken $\theta = \begin{bmatrix}\alpha & \beta & \sigma\end{bmatrix}$ and the sampling distribution is \[ p(\tilde{y} \mid \tilde{x}, \alpha, \beta, \sigma) = \textrm{normal}(\tilde{y} \mid \alpha + \beta \cdot \tilde{x}, \sigma). \] Even if we know the parameter values $\alpha$, $\beta$, and $\sigma$ exactly, we can only predict $\tilde{y}$ to within $\epsilon$, where $\epsilon \sim \textrm{normal}(0, \sigma)$. If we plug in a point estimate $\widehat{\alpha}, \widehat{\beta}, \widehat{\sigma}$ for our parameters, we might get approximate inference that takes into account sampling uncertainty, but not estimation uncertainty, e.g., \[ p(\tilde{y} \mid \tilde{x}, x, y) \approx p(\tilde{y} \mid \tilde{x}, \widehat{\alpha}, \widehat{\beta}, \widehat{\sigma}). \]

So far, this only gives us a way to evaluate the log density of a resulting outcome $\tilde{y}$ given a predictor $\tilde{x}$. If we want to simulate possible $\tilde{y}$, we have to make sure to add sampling uncertainty and draw \[ \tilde{y}^{(m)} \sim \textrm{normal}\!\left(\alpha^{(m)} + \beta^{(m)} \cdot \tilde{x} \mid \sigma^{(m)}\right), \]

where $\alpha^{(m)}, \beta^{(m)}, \sigma^{(m)} \sim p(\alpha, \beta, \sigma \mid x, y)$ are posterior draws. In general, if we then want to estimate $\tilde{y}$, we can take posterior means of these values. In the case here, our sampling distribution is symmetric, so that the expectation of the random draws and the expectation of their mean is identical.

If our only goal is to estimate an expectation, it is better to average expectations than to average random draws that happen to have those expectations. The sense in which it is better is that it will lead to lower variance in the estimator and hence lower expected square error (recall that expected square error has contributions from both bias and variance).

Example: Rao-Blackwell. Suppose we have two random variables, \[ Y \sim \textrm{normal}(\pi, 1) \qquad \quad Z \sim \textrm{normal}(Y, 2). \] From the definition of the normal distribution, $\mathbb{E}[Y] = \pi$ and with a little more work, we can establish that $\mathbb{E}[Z] = \pi.$ Although $\textrm{sd}[Y] = 1,$ it is possible to work out analytically that $\textrm{sd}[Z] = \sqrt{1^2 + 2^2} > 1,$ because the variance of a sequential normal like this is the sum of their variances.

Because they both have the right expected value, we can use either averages of draws of $Y$ or averages of draws of $Z$ to estimate $\pi.$ The Rao-Blackwell theorem tells us that expected error using $Z$ will be higher than using $Y.$ Rather than going through the theory, we will provide a simulated example with code.

Here is simulation code and histograms of the result of 10,000 simulations of 100 draws each for the estimators based on simulating from $Y$ and $Z$, with a vertical line at $\pi,$ the value being estimated.

N = 100
M = 10_000
y_hat = np.zeros(M)
z_hat = np.zeros(M)
for m in range(M):
    y = np.random.normal(np.pi, 1, size = N)
    z = np.random.normal(y, 2, size = N)
    y_hat[m] = np.mean(y)
    z_hat[m] = np.mean(z)
df = pd.DataFrame({
    'estimate': np.concatenate([y_hat, z_hat]),
    'estimator': np.concatenate([['mean(Y)'] * len(y_hat), ['mean(Z)'] * len(z_hat)])
})
print(pn.ggplot(df, pn.aes(x = 'estimate'))
     + pn.geom_histogram(color = 'white', binwidth=0.05)
     + pn.facet_wrap('~ estimator')
     + pn.geom_vline(xintercept = np.pi, color = "blue")
     + pn.scale_x_continuous(limits=[2, 4.5])
     + pn.theme(aspect_ratio = 1))

The plots show that using the mean of the $Y$ simulations is more accurate than using the mean of the $Z$ simulations. How much accuracy is gained depends on the variance of $Z.$

Standalone generated quantities

Let’s put all this together into a Stan program. First, let’s make posterior predictions for some new $\tilde{y}$. We can do this using Stan’s standalone generated quantities feature that lets us run the generated quantities block of a new model given draws from another model. We could’ve also just included the generated quantities block in the original program.

stan/iris-posterior-predictive-sim.stan

data {
  int<lower=0> N_tilde;
  vector<lower=0>[N_tilde] petal_width_tilde;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
generated quantities {
  vector<lower=0>[N_tilde] E_petal_length_tilde
    = alpha + beta * petal_width_tilde;
  vector<lower=0>[N_tilde] petal_length_tilde
    = to_vector(normal_rng(E_petal_length_tilde, sigma));
}

This program declares two new data variables, N_tilde for the number of predicted items, and petal_width_tilde, a vector of petal widths of size N_tilde.

It is important that the program used for standalone generated quantities use exactly the same parameters as the original program. After our original fit, we read that sample back in for the parameters, then run generated quantities (and the transformed parameter block if there is one).

The generated quantities block first calculates the expected petal length given the petal width and assigns it to the variable E_petal_length_tilde. This code is vectorized so that it calculates all of the input petal widths at once. The second variable is then set by sampling according to the sampling distribution using the normal_rng function. This function is also vectorized, but it returns an array, so we convert it to a vector just to keep all the types the same.

data = {'N_tilde': 3,
        'petal_width_tilde': [0.4, 1.75, 3.8]}
model = csp.CmdStanModel(
     stan_file = '../stan/iris-posterior-predictive-sim.stan')
pps_sample = model.generate_quantities(data = data, seed = 123,
                                       previous_fit = sample,
                                       show_console = False)
print('Estimated petal lengths given petal widths')
for i in range(3):
  length_draws = \
      pps_sample.stan_variable('petal_length_tilde')[0:100, i]
  E_length_draws = \
      pps_sample.stan_variable('E_petal_length_tilde')[0:100, i]
  print(f"petal_width_tilde[{i}] = {data['petal_width_tilde'][i]}")
  print(f"  mean(E_petal_length_tilde[{i}]) = {np.mean(E_length_draws):.2f}")
  print(f"  sd(E_petal_length_tilde[{i}]) = {np.std(E_length_draws):.2f}")
  print(f"  mean(petal_length_tilde[{i}]) = {np.mean(length_draws):.2f}")
  print(f"  sd(petal_length[{i}]) = {np.std(length_draws):.2f}\n")

Estimated petal lengths given petal widths
petal_width_tilde[0] = 0.4
  mean(E_petal_length_tilde[0]) = 1.98
  sd(E_petal_length_tilde[0]) = 0.07
  mean(petal_length_tilde[0]) = 1.88
  sd(petal_length[0]) = 0.45

petal_width_tilde[1] = 1.75
  mean(E_petal_length_tilde[1]) = 4.98
  sd(E_petal_length_tilde[1]) = 0.06
  mean(petal_length_tilde[1]) = 4.96
  sd(petal_length[1]) = 0.50

petal_width_tilde[2] = 3.8
  mean(E_petal_length_tilde[2]) = 9.55
  sd(E_petal_length_tilde[2]) = 0.17
  mean(petal_length_tilde[2]) = 9.52
  sd(petal_length[2]) = 0.48

For each of our input petal widths $[0.4 \quad 1.75 \quad 3.8]$, we see the posterior mean prediction for petal width and its standard deviation calculated two ways. First, we take the posterior draws for the expected petal length, E_petal_length_tilde, which is just the linear prediction of petal value. The uncertainty comes only from estimation uncertainty in $\alpha$ and $\beta$. Second, we take the posterior draws for petal length, which include an additional normal error term with scale $\sigma$. The means are roughly the same either way, but the standard deviation is much higher in the case of sampling. We generally prefer estimators with lower standard deviation as they lead to lower standard error with Monte Carlo methods (as we illustrated with an example above). The second variable, petal_length_tilde, includes both sources of posterior uncertainty, estimation uncertainty and uncertainty from the sampling distribution.

9.6 Lognormal and transformed data

In this section, we will convert our iris model to the log scale using the transformed data block to convert the petal width to a log scale and use a lognormal regression on the width to predict the length.

The lognormal distribution is a simple transform of a normal distribution with support over positive values. If \[ U \sim \textrm{lognormal}(\mu, \sigma), \] then \[ log U \sim \textrm{normal}(\mu, \sigma). \] This will allow us to transform our Stan program. We first translate the petal width to the log scale in the transformed data block, then model length as a lognormal regression on log width in the model block.

stan/iris-petals-log.stan

data {
  int<lower=0> N;
  vector<lower=0>[N] petal_width;
  vector<lower=0>[N] petal_length;
}
transformed data {
  vector[N] log_petal_width = log(petal_width);
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  petal_length ~ lognormal(alpha + beta * log_petal_width, sigma);
  alpha ~ normal(0, 3);
  beta ~ normal(0, 3);
  sigma ~ lognormal(0, 0.6);
}

We have introduced a new block for transformed data, in which we have defined a vector of log petal widths to use as a predictor. The petal lengths are already positive and we do not modify those. The effect is that we are regressing log petal length on log petal width and the sampling statement could have been replaced with

  log(petal_length) ~ normal(alpha + beta * log_petal_width, sigma);

The result would be the same up to a constant normalizing term involving petal length (which is constant) in the lognormal density.

Let’s run the model and summarize the results.

M = 100 if DRAFT else 1000
log_model = csp.CmdStanModel(stan_file =
                               '../stan/iris-petals-log.stan')
log_sample = log_model.sample(data = iris_data(), seed = 123,
                              iter_warmup = M, iter_sampling = M,      
                              show_progress = False,
                  show_console = False)
log_sample.summary(sig_figs = 2)

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	56.00	0.02800	1.3000	53.00	56.00	57.00	2000.0	15000.0	1.0
alpha	1.30	0.00021	0.0140	1.30	1.30	1.30	4100.0	30000.0	1.0
beta	0.57	0.00022	0.0130	0.55	0.57	0.59	3800.0	27000.0	1.0
sigma	0.16	0.00017	0.0098	0.15	0.16	0.18	3300.0	24000.0	1.0

The estimated values ae $\widehat{\alpha} = 1.3$ and $\widehat{\beta} = 0.57$ and $\sigma = 0.16$. One advantage of the log scale is that the regression makes more sense because $\exp(\alpha + \beta \cdot \log w) = \exp(\alpha) \cdot w^{\beta}$. By taking exponents, the intercept $\alpha$ becomes a multiplicative factor $\exp(\alpha)$ and the slope $\beta$ becomes an exponent. The error also moves to the multiplicative scale, which makes it relative. Multiplicative error makes sense here as we wouldn’t expect the same error scale for petals of width 0.5cm as for petals of width 5cm. The final relationship derived through this model is as follows. \[\begin{align} \textrm{length} &= \exp(\alpha + \beta * \log(\textrm{width}) \ \pm \ 2 \cdot \sigma) \\[6pt] &= 3.67 \cdot \textrm{width}^{0.56} \cdot 1.38^{\pm 1}. \end{align}\] The plus or minus 1 on the exponent turns into either multiplication or division by 1.38.

The posterior intervals are tighter for the lognormal model, which generally indicates a better fit to data. Let’s see what the data and twenty posterior draws of the regression look like on the log scale.

df = iris_data_frame()
df['log_petal_length'] = np.log(df['petal_length'])
df['log_petal_width'] = np.log(df['petal_width'])
alpha_draws = log_sample.stan_variable('alpha')
beta_draws = log_sample.stan_variable('beta')
plot =  pn.ggplot(df, pn.aes(x='log_petal_width',
                             y='log_petal_length',
                 color='species'))
plot = plot +  pn.geom_point()
plot = plot +  pn.labs(x = "petal width (log cm)",
                       y = "petal length (log cm)")
for a, b in zip(alpha_draws[0:20], beta_draws[0:20]):
    plot = plot + pn.geom_abline(intercept = a, slope = b,
                                 alpha = 0.5, size = 0.2)
print(plot)

Now the iris versicolor and iris virginica look good, but the smaller iris setosa is still poorly characterized. We’ll fix this problem in the next section.

9.7 Multi-indexing: varying slopes and varying intercepts

We can see from the previous section that a single regression line doesn’t fit all three species of iris well. While the lognormal model is better than the linear model, it still fails to capture the characteristics of the smaller iris setosa.

On the log scale, it does look like it will be possible to capture iris setosa’s scale, as long we can build separate regressions for each species. The varying slopes and varying intercepts are sometimes called random effects, in contrast to the previous models’ fixed effects, which do not vary by data grouping.

Mathematically, we now have three intercepts and three slopes, one pair for each species of iris. We will represent these as 3-vectors, $\alpha, \beta \in \mathbb{R}^3$. Given our $N$ data items, we will introduce an indexing data item $\textrm{species} \in \{ 1, 2, 3 \}^N$, where $\textrm{species}_n$ is the species of the $n$-th data item. On the log scale, our regression is \[ \textrm{length}_n \sim \textrm{lognormal}(\alpha_{\textrm{species}_n} + \beta_{\textrm{species}_n} \cdot \log \textrm{width}_n, \sigma). \] The value $\alpha_{\textrm{species}_n}$ is the intercept for $\textrm{species}_n \in \{1, 2, 3\}$. We have kept the same error term $\sigma$, though we could have also split that out on a per-species basis like the slope and intercept.

The Stan code follows the statistical notation.

stan/iris-petals-varying.stan

data {
  int<lower=0> N;
  vector<lower=0>[N] petal_width;
  vector<lower=0>[N] petal_length;
  int<lower=0> K;
  array[N] int<lower=1, upper=K> species;
}
transformed data {
  vector[N] log_petal_width = log(petal_width);
}
parameters {
  vector[K] alpha;
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  petal_length ~ lognormal(alpha[species] + beta[species] .* log_petal_width, sigma);
  alpha ~ normal(0, 3);
  beta ~ normal(0, 3);
  sigma ~ lognormal(0, 0.6);
}

Here, we take $K$ to be the number of species and the species indicator thus takes on values between $1$ and $K$ (inclusive). Where before we had alpha and beta for the intercept and slope, we now have alpha[species] and beta[species]. These will pick our vectors of size $N$ (number of data items), as explained below. Further, the product of the slope and width is now an elementwise product (.*), which we also consider below. The sampling statement above is equivalent to

for (n in 1:N) {
  petal_length[n]
    ~ lognormal(alpha[species[n]] + beta[species[n]] * log_petal_width[n],
                sigma);
}

Elementwise products in Stan

The elementwise product (.*) uses MATLAB syntax and is defined so that

(a .* b)[n] == a[n] * b[n]  // true

Elementwise division (./) works the same way and we don’t need elementwise addition or subtraction because those operations are already defined as regular vector addition and subtraction.

Multi-indexing in Stan

This model uses a technique we call multi-indexing and it works the similarly to the way indexing works in NumPy. Suppose we have a vector of size J and an array of indexes of size N.

array[N] int<lower=1, upper=J> idxs;
vector[J] foo;

We can use the multiple indexes in idxs to index foo as foo[idxs]. The result is a size N vector (the type is taken from foo and the size from idxs), with values given by indexing into idxs. Indexing is defined as follows.

cols(foo[idxs]) == N          // true
foo[idxs][n] == foo[idxs[n]]  // true

This tells us the number of columns of foo[idxs] is N and that indexing is done by first indexing into idxs and then using the result.

This works the same way for range indexing, such as foo[1:3], and for indexing with array literals like {7, 9, 3}, e.g.,

foo[3:5] == {foo[3], foo[4], foo[5]}         // true
foo[{7, 9, 3}] == {foo[7], foo[9], foo[3]}   // true

Fitting our varying effects model

First, we add a species column to our data frame to represent the species of the iris as an integer 1, 2, or 3. Then we will just read in the data and fit and summarize the results.

M = 100 if DRAFT else 1000
vary_model = csp.CmdStanModel(
                  stan_file = '../stan/iris-petals-varying.stan')
vary_sample = vary_model.sample(data = iris_data(), seed = 123,
                                iter_warmup = M, iter_sampling = M,     
                                show_progress = False,
                show_console = False)
vary_sample.summary(sig_figs = 2)

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	130.000	0.04900	1.900	130.000	130.000	130.00	1500.00	1900.00	1.0
alpha[1]	0.490	0.00110	0.051	0.410	0.490	0.57	2335.00	3044.00	1.0
alpha[2]	1.300	0.00056	0.028	1.200	1.300	1.30	2551.00	3326.00	1.0
alpha[3]	1.500	0.00140	0.072	1.400	1.500	1.70	2572.00	3353.00	1.0
beta[1]	0.076	0.00069	0.033	0.024	0.076	0.13	2299.00	2997.00	1.0
beta[2]	0.590	0.00180	0.093	0.440	0.590	0.75	2549.00	3323.00	1.0
beta[3]	0.230	0.00200	0.100	0.064	0.230	0.40	2588.00	3374.00	1.0
sigma	0.100	0.00000	0.010	0.090	0.100	0.11	3305.42	4309.54	1.0

There are a few things worth noting in this summary. First, recall that the parameter estimates in the fixed effects model were $\widehat{\alpha} = 1.3$, $\widehat{\beta} = 0.57$, and $\widehat{\sigma} = 0.16$. In the varying effects model, we see that the estimates for $\alpha_k$ and $\beta_k$ vary considerably by $k$. For example, as we see in the data, there is very little effect of width on length for iris setosa, with $\widehat{\beta}_1 = 0.075$, wheres there is a large effect for iris versicolor of $\widehat{\beta}_2 = 0.6$. Furthermore, our value for $\sigma$ is lower, which indicates lower residual error in our predictors. We will leave evaluating varying error scales to the reader.

We now have three different regression lines, which we plot on the log scale with color matching the data..

df = iris_data_frame()
df['log_petal_length'] = np.log(df['petal_length'])
df['log_petal_width'] = np.log(df['petal_width'])
plot =  pn.ggplot(df, pn.aes(x='log_petal_width',
                             y='log_petal_length',
                 color='species'))
plot = plot +  pn.geom_point()
plot = plot +  pn.labs(x = "petal width (log cm)",
                       y = "petal length (log cm)")
plot = plot + pn.geom_abline(intercept = .49, slope = 0.076, color='red')
plot = plot + pn.geom_abline(intercept = 1.30, slope = 0.59, color='green')
plot = plot + pn.geom_abline(intercept = 1.50, slope = 0.23, color='blue')
print(
  plot
)

We now have a good fit for each of the groups and, as we might expect, a lower error scale $\sigma$ than for the model with a single regression.

10 Containers: arrays, vectors, and matrices

Stan is strongly typed, and each of the types has a distinct purpose. Constraints act as error checks in the data, transformed data, transformed parameter, and generated quantities blocks. In the parameter block, they are used to transform from unconstrained to constrained values (with an implicit change of variables adjustment). Stan Development Team (2023b) provides complete details on the full set of transforms, inverse transforms, and their log absolute Jacobian determinants.

10.1 Integer and real primitives

The two primitive types are int and real. In the compiled C++ code, these are 32-bit signed integers and 64-bit floating point values, respectively.

The third scalar type is complex, where a complex value consists of a real component and an imaginary component, both of which are represented as 64-bit floating point values.

Constrained primitive types

Integer and real types may be constrained with lower bounds, upper bounds, or both. For example, real<lower=0> is the type for a concentration, real<lower=0, upper=1> is the type for a probability, real<lower=-1, upper=1> is the type for a correlation, int<lower=0> is the type for a count, and int<lower=1, upper=5> is the type for an ordinal response to a survey.

There is a second kind of modifier, which is not technically a constraint, but is a transform. If we declare

vector<offset=mu, multiplier=sigma>[K] alpha;

then the unconstrained representation is multiplied by sigma and then offset by mu to get the constrained value alpha. This is primarily used for non-centered parameterizations of hierarchical models, which are beyond this tutorial.

Constraints on variables in the data block are checked as data is read in. Transformed data, transformed parameter, and generated quantities variables are validated at the end of the block and may be in intermediate states prior to the end of the block that do not satisfy the constraints. Parameters are handled differently and transformed from unconstrained representations to satisfy constraints.

10.2 Arrays

For any type T, we can write array[N] T for the type of an array of size N containing elements of type T. We can also write array[M, N] T for a 2D array and so on for higher dimensionality. For example, array[3, 2] int is a two-dimensional array of integers of shape 3 by 2.

10.3 Matrices

The type matrix[M, N] is for an $M \times N$ matrix, which contains real values (there is no integer valued matrix type in Stan). The type complex_matrix[M, N] is for an $M \times N$ matrix with complex values.

Constrained matrix types

There are four special matrix types. The type cov_matrix is for positive definite matrices, whereas cholesky_factor_cov is for Cholesky factors of positive definite matrices. The type corr_matrix is for positive definite matrices with unit diagonals and cholesky_factor_corr is for Cholesky factors of correlation matrices. The Cholesky factors are lower triangular and should be preferred to the full matrices where possible.

10.4 Vectors and row vectors

The type vector[M] is for column vectors with M rows, whereas row_vector[N] is for row vectors with N columns. A column vector is like an $M \times 1$ matrix and a row vector is like a $1 \times N$ vector. There are also complex_vector and complex_row_vector types which work the same way.

The reason we do not just represent vectors as matrices is that we are able to reduce types in arithmetic. For example, the product of a row vector and a product vector is a scalar rather than a $1 \times 1$ matrix. Similarly, the product of a matrix and a vector is a vector.

Constrained vector types

There is a vector type ordered[M] for vectors in ascending order, and a similar type pos_ordered[M] for vectors of non-negative values in ascending order.

The vector type unit_vector[M] is for vectors of unit Euclidean length (i.e., the sum of squared values is one). There is also a simplex[M] type for simplex (non-negative values that sum to one).

Assignment and function calls

Calling a function works like assigning the variables in the argument list to the variables in the function declaration. Function argument variables may not be reassigned. In the underlying implementation in C++, values are passed by constant reference so there is no allocation and copying overhead in function calls.

In general, Stan is what programming language researchers call covariant, meaning that if a type T is assignable to a type U, then a type array[N] T is assignable to array[N] U. Similarly, integer-valued primitives and containers may be assigned to their real or complex counterparts, and real-valued primitives and containers may be assigned to their complex counterparts (but not the other way around). Variables of a constrained type may be assigned to variables of an unconstrained type.

11 Stan Example: Discrete parameters and mixture models

Although Stan only allows parameters which are real or complex valued, it is able to work with models involving discrete parameters by marginalizing them out in the likelihood and working in expectation during inference.

There are two reasons Stan doesn’t allow discrete parameters. First, because Stan isn’t limited to directed graphical models, it’s technically impossible to pull out efficient conditional distributions for the discrete parameters in general (though it can be done in some limited cases). Second, common choices for discrete sampling such as Gibbs sampling or Metropolis tend not to perform very well in complicated models with dependencies among the discrete parameters. If too many variables are changed in Metropolis, acceptance rates drop too low. If too few variables are changed, as in standard variable-at-a-time Gibbs samplers, discrete samplers are prone to lock up where none of the variables can easily be changed because of the covariance among them. A classical example of this is the Ising model for spin glasses in physics, which has well known phase transitions that are hard to sample. A more statistical example is explicit sampling of responsibility parameters in unsupervised multivariate normal mixtures (aka $K$-means clustering) or variable selection in regressions with non-zero probability mass assigned to zero coefficient values (e.g., with spike-and-slab priors that mix a point mass at zero with a distribution with support over $\mathbb{R}$).

11.1 Normal mixture model

We’ll consider a simple two-component normal mixture model parameterized with a discrete responsibility parameter. This model assumes that each item $y_n$ is drawn from one of two possible normal distributions, $\textrm{normal}(\mu_1, \sigma_1)$ or $\textrm{normal}(\mu_2, \sigma_2)$, with a probability $\lambda \in [0, 1]$ of being drawn from the first distribution. Let’s assume we have $N$ observations $y_n \in \mathbb{R}$. We will use $z_n \in \{ 0, 1 \}$ as the discrete parameter representing the distribution from which $y_n$ arose. This gives us the following model \[\begin{align} z_n &\sim \textrm{bernoulli}(\lambda) \\[6pt] y_n &\sim \textrm{normal}(\mu_{z_n}, \sigma_{z_n}) \end{align}\] In frequentist settings, the parameters $z_n$ are sometimes called missing data to finesse a philosophical aversion to probability distributions over parameters.

11.2 Marginalization to the rescue

The immediate problem is that we cannot code this model in Stan by just following the math because we do not have discrete parameters. But what we can do is marginalize out the discrete parameters. We know from the law of total probability that if $B$ is a discrete random variable, then we can derive the marginal probability $p(a)$ from the joint probability function $p(a, b)$ by \[ p(a) = \sum_{b \in B} p(a, b). \]

We can express our mixture model’s sampling distribution over both $y$ and $z$, \[ p(y, z \mid \lambda, \mu, \sigma) = \prod_{n = 1}^N p(y_n, z_n \mid \lambda, \mu, \sigma). \] This is called the complete data likelihood in frequentist settings.

Then we can evaluate the likelihood elementwise, \[ p(y_n, z_n \mid \lambda, \mu, \sigma) = \textrm{bernoulli}(z_n \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_{z_n}, \sigma_{z_n}). \] Now we can marginalize out the $z_n$ by considering values 0 and 1, \[\begin{align} p(y_n \mid \lambda, \mu, \sigma) &= p(y_n, z_n = 1 \mid \lambda, \mu, \sigma) + p(y_n, z_n = 0 \mid \lambda, \mu, \sigma) \\[6pt] &= \textrm{bernoulli}(1 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_1, \sigma_1) \\[2pt] & \qquad + \ \textrm{bernoulli}(0 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_2, \sigma_2). \end{align}\] We can further substitute in the value for the Bernoulli densities, which are \[\begin{align} \textrm{bernoulli}(1 \mid \lambda) &= \lambda, \ \textrm{and} \\[4pt] \textrm{bernoulli}(0 \mid \lambda) &= 1 - \lambda, \end{align}\] to produce the standard form of the mixture model sampling distribution, \[ p(y_n \mid \lambda, \mu, \sigma) = \lambda \cdot \textrm{normal}(y_n \mid \mu_1, \sigma_1) + (1 - \lambda) \cdot \textrm{normal}(y_n \mid \mu_2, \sigma_2). \]

11.3 Simulated data: adult height of men and women

Let’s consider a specific instance, a collection of height data where measurements for men and women have not been identified. Let’s assume for the sake of simulation that men have a mean height of 175cm with standard deviation of 7.5cm and women have a mean height of 162cm with a standard deviation of 6.3cm. We’ll further assume that women make up 51% of the adult population (this varies across age bands). This will allow us to simulate and plot data for 5000 randomly selected heights; we put vertical lines at the mean female height (red) and male height (blue).

M = 100 if DRAFT else 5_000
female_mean_height_cm = 162
male_mean_height_cm = 175
female_sd_height_cm = 6.3
male_sd_height_cm = 7.5
prob_female = 0.511
prob_male = 1 - prob_female

# codes female = 0, male = 1
mu = np.array([female_mean_height_cm, male_mean_height_cm])
sigma = np.array([female_sd_height_cm, male_sd_height_cm])
sex = np.array(np.random.binomial(n = 1, p = prob_male, size=M))
heights = np.random.normal(loc = mu[sex], scale = sigma[sex], size=M)

print(
  pn.ggplot(pd.DataFrame({'height': heights}),
                 pn.aes(x = 'height')) +
  pn.geom_histogram(bins=50, color='white') +
  pn.labs(x = 'height') +
  pn.geom_vline(xintercept=162, color='red') +
  pn.geom_vline(xintercept=175, color='blue') +
  pn.theme(axis_text_y = pn.element_blank(),
           axis_title_y = pn.element_blank(),  
           axis_ticks_major_y = pn.element_blank())

)

11.4 The log scale and log sum of exponents operation

We have successfully marginalized out the $z$ and derived $p(y \mid \lambda, \mu, \sigma)$, but this leaves us with a residual problem—Stan works on the log scale and we have done our derivation on the linear scale. That is, we start with \[\begin{align} \log p(y_n \mid \lambda, \mu, \sigma) = \log\! \big( \, \strut & \textrm{bernoulli}(0 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_0, \sigma_0) \\[2pt] & + \textrm{bernoulli}(1 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_1, \sigma_1)\, \big). \end{align}\] To reduce this further, we are going to use the log-sum-exp operation, which is defined by \[ \textrm{log-sum-exp}(a, b) = \log \left( \strut \exp(a) + \exp(b)\right). \]

The reason we use log-sum-exp is that it can be made arithmetically stable in such a way as to prevent numerical overflow by pulling out the maximum value, \[ \textrm{log-sum-exp}(a, b) = \textrm{max}(a, b) + \textrm{log-sum-exp}(a - \textrm{max}(a, b), \ b - \textrm{max}(a, b)) \\[4pt] \] This form eliminates numerical overflow because exponentiation is only applied to the terms $\left( a - \textrm{max}(a, b)\right)$ and $\left(b - \textrm{max}(a, b)\right)$, both of which must be less than or equal to zero by construction. Underflow is mitigated by pulling the leading term $\textrm{max}(a, b)$ out of the sum.

Using log-sum-exp, we can now rewrite our marginalized individual item likelihood as \[\begin{align} \log p(y_n \mid \lambda, \mu, \sigma) = \textrm{log-sum-exp}\big( &\log \textrm{bernoulli}(0 \mid \lambda) + \log \textrm{normal}(y_n \mid \mu_0, \sigma_0), \\[2pt] &\log \textrm{bernoulli}(1 \mid \lambda) + \log \textrm{normal}(y_n \mid \mu_1, \sigma_1) \big). \end{align}\]

11.5 Stan program for mixtures

As in our earlier examples, the Stan code for mixtures just follows the mathematical definitions and uses abstract variable names mu, sigma, and lambda (it might make sense to make these names meaningful in a specific application of mixtures in a larger context).

In the Python simulation code, we coded female as 0 and male as 1. In the Stan program, we retain that coding for the Bernoulli distribution for sex. But then our indexes for vectors start at 1, so we have coded the female parameters as mu[1], sigma[1] and the male parameters as mu[2], sigma[2] for this application of the generic Stan program.

heights.stan

data {
  int<lower=0> M;
  vector<lower=0>[M] heights;
}
parameters {
  real<lower=0, upper=1> lambda;
  ordered[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(170, 10);
  sigma ~ lognormal(log(7), 1);
  for (m in 1:M) {
    real lp1 = bernoulli_lpmf(0 | lambda)
               + normal_lpdf(heights[m] | mu[1], sigma[1]);
    real lp2 = bernoulli_lpmf(1 | lambda)
               + normal_lpdf(heights[m] | mu[2], sigma[2]);
    target += log_sum_exp(lp1, lp2);
  }
}

This program uses local variables lp1 and lp2. These have scope just within the for-loop and take on different values with each loop iteration. In general, local variables can be assigned and reassigned in Stan. They are declared with sizes, but cannot have constraints.

Unlike in the original model, we have to index mu and sigma from 1 because Stan arrays are indexed from 1. The biggest departure from the mathematical model as written is that we have declared mu to be of type ordered, which is the type of a vector of increasing values. The reason we do this is that the indexes are not well identified in the sense that swapping indexes preserves the same likelihood. This way, the smaller component will be the first component and the model will be identified without changing the likelihood. Betancourt (2017b) provides more details on how this works and shows that ordering does not change the underlying model.

We have included weakly informative priors for mu and sigma based on our prior knowledge of heights. The likelihood then just follows the mathematical definition. We had to resort to a loop because Stan isn’t currently able to vectorize mixtures.

11.6 Fitting the simulated data

Now we can fit our simulated data.

model = csp.CmdStanModel(stan_file = '../stan/heights.stan')
data = {'M': M, 'heights': heights}
J = 100 if DRAFT else 1000
sample = model.sample(data = data, seed = 123,
                      iter_warmup = J, iter_sampling = J,
                      show_progress = False, show_console = False)
sample.summary(sig_figs = 2)

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	-18000.00	0.0500	1.700	-18000.00	-18000.00	-18000.00	1100.0	12.0	1.0
lambda	0.48	0.0028	0.071	0.37	0.48	0.61	650.0	6.9	1.0
mu[1]	162.00	0.0280	0.730	161.00	162.00	163.00	679.0	7.2	1.0
mu[2]	175.00	0.0470	1.200	173.00	175.00	177.00	634.0	6.7	1.0
sigma[1]	6.40	0.0099	0.290	5.90	6.40	6.80	849.0	9.0	1.0
sigma[2]	7.40	0.0180	0.460	6.70	7.40	8.20	674.0	7.2	1.0

Mixture models are quite challenging to fit when the components have similar distributions as they do here (the normal components have similar locations and scales so that much of their probability mass overlaps). Checking the fit, we find $\widehat{R}$ values as high as 1.01 (still acceptable) and effective sample sizes of around 1/6 the number of iterations. Nevertheless, we see the Monte Carlo standard error is relatively low for all posterior means other than for $\lambda$. Although we recover reasonable estimates for all parameters (simulated values were $\lambda = 0.51$, $\mu = [162 \ 175]$ and $\sigma = [6.3 \ 7.5]$), the posterior intervals are only well constrained for mu, with sigma, with lambda being less well identified. These intervals shrink with the number observations $N$ at a rate of $\mathcal{O}(1 / \sqrt{N})$.

11.7 Built-in Stan function for mixtures

For simple mixtures of this kind, Stan supplies a built-in function

log_mix(lambda, lp1, lp2)
    = log_sum_exp(log(lambda) + lp1,
                  log1m(lambda) + lp2)

which is a more efficient drop-in replacement for the explicit definition in our example code. The body of the loop in the model can be replaced with the one-liner

target += log_mix(lambda, normal_lpdf(heights[m] | mu[1], sigma[1]),
                          normal_lpdf(heights[m] | mu[2], sigma[2]));

This may run a bit faster, but it won’t mix any better as it defines exactly the same target log density as the long form.

11.8 Recovering posterior distributions over discrete parameters

We have marginalized out the discrete parameters, but what if they are of interest in the fitted model? It turns out, we can sample $z$ in the generated quantities block. All we have to do is work out the probability that $z_n = 0$ (female) and sample. Once we have this probability, we can calculate many quantities of interest in expectation rather than working with the sample (i.e., we will Rao-Blackwellize). The calculation for the expected sex given values of $\mu$ and $\sigma$ and $y_n$ is \[ \Pr\!\left[Z_n = 1 \mid y, \mu, \sigma\right] \ = \ \mathbb{E}\left[Z_n \mid y, \mu, \sigma\right] \ = \ \frac{\displaystyle p(y_n \mid \mu_0, \sigma_0)} {\displaystyle p(y_n \mid \mu_0, \sigma_0) + p(y_n \mid \mu_1, \sigma_1)} \] On the log scale where Stan operates, that’s \[ \log \textrm{Pr}[Z_n = 1 \mid y, \mu, \sigma] = \log p(y_n \mid \mu_0, \sigma_0) - \textrm{log-sum-exp}\!\left(log p(y_n \mid \mu_0, \sigma_0), \log p(y_n \mid \mu_1, \sigma_1)\right). \]

Here’s a new generated quantities block we can add to the last program, heights.stan, in order to generate the probability that the first 10 data items are heights for women, and to randomly generate the sex for the first ten individuals based on this probability.

stan/heights-post.stan

data {
  int<lower=0> M;
  vector<lower=0>[M] heights;
}
parameters {
  real<lower=0, upper=1> lambda;
  ordered[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(170, 10);
  sigma ~ lognormal(log(7), 1);
  for (m in 1:M) {
    real lp1 = bernoulli_lpmf(0 | lambda)
               + normal_lpdf(heights[m] | mu[1], sigma[1]);
    real lp2 = bernoulli_lpmf(1 | lambda)
               + normal_lpdf(heights[m] | mu[2], sigma[2]);
    target += log_sum_exp(lp1, lp2);
  }
}
generated quantities {
  vector<lower=0, upper=1>[10] Pr_female;
  for (m in 1:10) {
    real lp1 = bernoulli_lpmf(0 | lambda)
               + normal_lpdf(heights[m] | mu[1], sigma[1]);
    real lp2 = bernoulli_lpmf(1 | lambda)
               + normal_lpdf(heights[m] | mu[2], sigma[2]);
    Pr_female[m] = exp(lp1 - log_sum_exp(lp1, lp2));
  }
  array[10] int<lower=0, upper=1> sex = bernoulli_rng(Pr_female);
}

Now we can compile and fit this model.

model = csp.CmdStanModel(stan_file = '../stan/heights-post.stan')
data = {'M': M, 'heights': heights}
J = 100 if DRAFT else 1000
sample = model.sample(data = data, seed = 123,
                      iter_warmup = J, iter_sampling = J,
                      show_progress = False, show_console = False)
sample.summary(sig_figs = 2)

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	-18000.000	0.0470	1.700	-18000.000	-18000.000	-18000.00	1300.0	14.0	1.0
lambda	0.480	0.0027	0.073	0.370	0.470	0.60	740.0	7.9	1.0
mu[1]	162.000	0.0270	0.750	161.000	162.000	163.00	786.0	8.5	1.0
mu[2]	175.000	0.0450	1.200	173.000	175.000	177.00	745.0	8.0	1.0
sigma[1]	6.400	0.0099	0.300	5.900	6.400	6.90	888.0	9.6	1.0
sigma[2]	7.400	0.0170	0.470	6.700	7.400	8.20	782.0	8.4	1.0
Pr_female[1]	0.057	0.0013	0.037	0.013	0.049	0.13	822.0	8.9	1.0
Pr_female[2]	0.840	0.0032	0.086	0.680	0.860	0.95	723.0	7.8	1.0
Pr_female[3]	0.220	0.0034	0.095	0.086	0.210	0.40	776.0	8.4	1.0
Pr_female[4]	0.380	0.0045	0.120	0.180	0.370	0.59	753.0	8.1	1.0
Pr_female[5]	0.650	0.0046	0.120	0.430	0.660	0.83	730.0	7.9	1.0
Pr_female[6]	0.890	0.0026	0.069	0.760	0.910	0.97	724.0	7.8	1.0
Pr_female[7]	0.840	0.0032	0.086	0.680	0.860	0.95	723.0	7.8	1.0
Pr_female[8]	0.940	0.0019	0.051	0.840	0.950	0.99	729.0	7.9	1.0
Pr_female[9]	0.440	0.0047	0.130	0.230	0.440	0.65	747.0	8.1	1.0
Pr_female[10]	0.970	0.0014	0.038	0.900	0.990	1.00	767.0	8.3	1.0
sex[1]	0.055	0.0037	0.230	0.000	0.000	1.00	3733.0	40.0	1.0
sex[2]	0.850	0.0061	0.360	0.000	1.000	1.00	3507.0	38.0	1.0
sex[3]	0.230	0.0076	0.420	0.000	0.000	1.00	3007.0	32.0	1.0
sex[4]	0.390	0.0084	0.490	0.000	0.000	1.00	3354.0	36.0	1.0
sex[5]	0.640	0.0085	0.480	0.000	1.000	1.00	3176.0	34.0	1.0
sex[6]	0.890	0.0055	0.310	0.000	1.000	1.00	3139.0	34.0	1.0
sex[7]	0.850	0.0066	0.360	0.000	1.000	1.00	2995.0	32.0	1.0
sex[8]	0.940	0.0045	0.240	0.000	1.000	1.00	2871.0	31.0	1.0
sex[9]	0.430	0.0092	0.490	0.000	0.000	1.00	2917.0	31.0	1.0
sex[10]	0.970	0.0029	0.170	1.000	1.000	1.00	3209.0	35.0	1.0

The probabilities range from 0 to 1 with some intermediate values. We have calculated posterior expectations two ways, by averaging expected values (Pr_female) and by taking random draws (sex). The posterior means are similar for these variables, but sex has much higher variance due to the sampling.

We can compare the probabilities to the simulated values of $z$, which are

print(f"{sex[0:10] =}")

sex[0:10] =array([1, 1, 1, 1, 0, 0, 0, 0, 1, 0])

This shows more than anything else that it is hard to classify whether someone is a man or a woman based on their height alone. It does not show that this is a bad model for heights.

12 Debugging Stan programs

As is usual in simple tutorials like this, we have presented a series of Stan programs that actually work as intended. When developing new programs, one often runs into problems. This section will cover some typical Stan coding errors and provide some tips on debugging.

12.1 Failure to provide support over constrained parameters

Stan requires programs to assign a finite log density to any parameters that satisfy the constraints. For example, the following program puts no constraints on sigma in the declaration, but the model block requires sigma to be greater than zero.

parameters {
  real sigma;
}
model {
  sigma ~ lognormal(0, 1);
}

This violates Stan’s requirements because the declared constraints allow a negative value for sigma, but the model block does not. The way to fix this is to declare sigma to be of type real<lower=0>. Often this kind of bug shows up as a failure to randomly initialize. For instance, if we had vector[10] sigma, we’d only have a $2^{-10}$ or about 1/1000 chance of generating valid initializations at random, because Stan initializes parameters using a $\textrm{uniform}(-2, 2)$ distribution on the unconstrained scale.

12.2 Start with small, simulated data sets

It can be difficult to debug wild type data at scale. Instead, simulating a small data set lets you work the kinks out of the code in a controlled context where we know the true answers. Then, when the code is working over small simulated data, it can be loosed on the real data.

Simulating data can be challenging for some models. First, it can be a lot of effort, because it requires rewriting the model using random number generation in the generated quantities block. Second, some models are not fully generative because they either have improper priors or because they have reduced rank parameterizations (e.g., intrinsic conditional autoregressive models or sum-to-zero parameter constraints).

Data can be simulated in Python or it can be simulated in Stan. The advantage of simulation in Stan is that all of the functions will match, such as gamma densities or negative binomial mass functions, which have multiple parameterizations in widespread use in the literature.

12.3 Debug by print

Stan has a built-in print() function that can take string and Stan variable arguments, which will be printed when it is executed. In particular, printing the target() value can help detect when it becomes $-\infty$ (and thus causes rejections). This can be done line by line, for example,

model {
  y ~ normal(alpha + beta * x, sigma);
  print("step 17: target() = ", target());
  sigma ~ lognormal(0, 1);
  print("step 18: target() = ", target());
}

Then if one of the statements throws an exception, you will see the print statement just before it was called. If one of the statements leads to a not-a-number or negative infinity value, you will see the first place this affects target(). You can also print the results of intermediate calculations. Be careful printing long structures; you might want to try printing a slice, e.g.,

matrix[256, 256] a = b * c;
print("a[1:3, 5:6] = ", a[1:3, 5:6]);

12.4 Test quantities of interest

So far, we’ve only talked about making the program work as intended. We also want to test our output as we have done in several of our running examples. If we wrote a model in order to provide posterior predictive inference, test it with posterior predictive inference (either real or simulated).

12.5 More advice on workflow

Going into full details of how to develop Stan models is beyond this simple getting started tutorial. For more information, see Gelman et al. (2020). The various inference mechanisms targeted by the Bayesian workflow suggestions can be implemented in Stan as outlined in Part III of Stan Development Team (2023c).

13 Making Stan programs go faster

For MCMC, there are two independent components to efficiency, (1) the statistical efficiency of the sampler, which is measured by effective sample size per iteration, and (2) program efficiency, which is measured by how long the program takes to compute the log density and gradients.

13.1 Computational efficiency

There are many ways to write a Stan program to compute the log density and gradients of a model with respect to a fixed parameter vector. Some of them are faster than others. Which version is faster may wind up depending on the shape of the data (e.g., caching can be very valuable in some circumstances where variables are heavily reused, but can be a burden in sparse data circumstances where all combinations of variables are not used).

For speeding up Stan programs, most of the usual considerations in speeding up computer programs apply. Specifically, we want to optimize memory locality and we want to reduce branch points; together these are the main efficiency bottlenecks for modern code.

Working first, then optimize

As Donald Knuth is reputed to have said, “Premature optimization is the root of all evil.” It’s pretty much always a good idea to get a simple version of your code working on a small slice of data first. Then only worry about optimization if you need it. If you do need to optimize, don’t try it without an unoptimized form that works as a guide. Every developer has learned through hard experience that it’s often a whole lot easier to optimize working code than it is to debug optimized code.

Do not repeat yourself

The main rule for making Stan programs go fast is don’t repeat yourself. If there is a computation that is done twice, it is best to do the computation once, save the result in a local variable, and reuse the result when it is needed again. For example, don’t do this:

// DO NOT DO THIS:
for (n in 1:N) {
  y[n] ~ normal(alpha + beta * x[n], exp(log_sigma));
}

The problem here is that exp(log_sigma) gets computed N times. This uses more memory, as Stan records every operation with links to its operands for each evaluation of the log density. It is also slow, with N - 1 redundant applications of exponentiation in the forward pass and multiply/adds for the chain rule for derivatives in the backward pass. Instead, the code should be refactored to compute the exponentiation of log_sigma once, assign it to a local variable, and re-use it in the loop.

real sigma = exp(log_sigma);
for (n in 1:N) {
  y[n] ~ normal(alpha + beta * x[n], sigma);
}

Avoid automatic differentiation if possible

If you have a constant that’s being defined, define it in the transformed data block. That way it only gets evaluated once. For example, don’t do the following, as it repeatedly allocates and fills a $K$-vector and then applies useless automatic differentiation steps in the reverse pass.

data {
  real<lower=0> alpha;
}
model {
  // DO NOT DO THIS:
  vector[K] alpha_v = rep_vector(alpha, K);
  theta ~ dirichlet(alpha_v);
}

Instead, define the constant in the transformed data block and then reuse it in the model block, as follows.

transformed data {
  vector<lower=0>[K] alphas = rep_vector(alpha, K);
}
model {
  theta ~ dirichlet(alphas);
}

The other way to get rid of autodiff is to move posterior predictions into the generated quantities block. In general, if it’s possible to move a variable to the generated quantities block, it should be moved. For example, we might be tempted do posterior predictive simulation as follows.

parameters {
  real alpha, beta;
  real<lower=0> sigma;
  // DO NOT DO THIS:
  vector[M_tilde] y_tilde;
}
model {
  y ~ normal(alpha + beta * x, sigma);
  y_tilde ~ normal(alpha + beta * x_tilde, sigma);
}

This is correct and will provide the right answer, but it is computationally and statistically inefficient. It’s computationally inefficient because Hamiltonian Monte Carlo is more expensive than random variate generation. It’s also statistically inefficient because it will typically produce correlated draws rather than independent draws, leading to a lower effective sample size. In situations like this, we can move y_tilde down to generated quantities because it’s conditionally independent of the data given the parameters alpha, beta, and sigma.

parameters {
  real alpha, beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
  array[M_tilde] real y_tilde = normal_rng(alpha + beta * x_tilde, sigma);
}

Vectorized _rng functions return arrays; if we really need a vector for y_tilde for some reason, we can convert with to_vector().

Do not use diagonal multivariate normal

Instead of

// DO NOT DO THIS:
y ~ multi_normal(mu, diag_matrix(sigma));

which creates a matrix then factors to compute its inverse and determinant, we can use the vectorized form of the univariate normal,

y ~ normal(mu, sigma);

The result is the same up to a constant.

Memory locality, copying, and storage order

Accessing a row or column of a matrix requires allocating memory and copying. Memory pressure is one of the biggest constraints in modern code and relieving it is one of the best things to do to make code go faster.

Matrices are organized in column-major order in memory, whereas arrays are stored in row-major order. That means that the columns of a matrix are stored together in memory, and therefore slicing out a column Sigma[1:M, n] is relatively efficient for large M, but slicing out a row using Sigma[m, 1:N] will be inefficient if N is large.

If you need to access vectors one by one but you do not need matrix operations, use an array of vectors in either order:

matrix[M, N] alpha;
array[M] row_vector[N] alpha_arr;
array[N] vector[M] alpha_t_arr;  // transposed

Accessing elements of an array does not require any copying.

Overall, copying and rearranging data structures is not very expensive in Stan. Loops are very fast, and rearranging doesn’t add anything to the expression graph for automatic differentiation.

Vectorize non-linear operations over parameters

Even though loops are very fast in Stan, every expression involving parameters leads to nodes and edges being added to the expression graph. The usual suspect is sampling in a loop, as follows.

// DO NOT DO THIS:
for (n in 1:N) {
  y[n] ~ normal(alpha + beta * x[n], sigma);
}

Written this way, Stan isn’t smart enough to realize it can compute log(sigma) once and re-use. It will also build a huge number of expression graph edges from the nodes representing the log density of y[n] to alpha and beta. The same contribution to the log density can be executed much faster with vectorized code such as the following.

y ~ normal(alpha + beta * x, sigma);

The speed from vectorization is not because loops are slow (at least if traversed in memory-local order), it’s because doing autodiff in a loop leads to a much larger expression graph and hence much more work when updating the chain rule for derivatives. The vectorized form has a single output variable for the log density with three edges leading to alpha, beta, and sigma. It will also be clever enough to calculate log(sigma) once and reuse it. This advantage is even bigger for distributions like multivariate normal that require expensive internal determinant calculations.

13.2 Statistical efficiency

By far the best solution for making Stan programs go faster is to improve the statistical efficiency of the model.

Identifiability

The first thing to consider is identifiability. A likelihood $p(y \mid \theta)$ is identifiable if distinct parameters lead to distinct likelihoods, or in symbols, if $\theta \neq \theta'$ implies $p(y \mid \theta) \neq p(y \mid \theta')$. The other way around, a likelihood is not identifiable if there exist $\theta \neq \theta'$ such that $p(y \mid \theta) = p(y \mid \theta').$

The classic Bradley-Terry model for inferring quality based on pairwise comparisons has a non-identifiable likelihood. Let’s consider an application to sports team ranking (it was originally applied to consumer products). Each team $j$ has an ability ranking $\alpha_j$. The log odds of team $j$ defeating team $k$ is modeled as $\alpha_j - \alpha_k$. The observations are of the results of games $n$ between team $A_n$ and $B_n$ (the notation is meant to be suggestive of the original application to A/B testing). This defines the likelihood \[ p(y_n \mid \alpha) = \textrm{bernoulli}(\textrm{logit}^{-1}(\alpha_{A_n} - \alpha_{B_n})). \] In Stan, this likelihood can be coded as

model {
  // DO NOT DO THIS:
  y ~ bernoulli_logit(alpha[A] - alpha[B]);
}

where the vectorized sampling statement unfolds to the following equivalent, but less efficient, form.

  for (n in 1:N) {
    y[n] ~ bernoulli_logit(alpha[A[n]] - alpha[B[n]]);
  }

A model is non-identifiable if two different values of the parameters provide the same sampling distribution. This model is non-identifiable because \[ p(y_n \mid \alpha) = p(y_n \mid \alpha + c). \] This holds because \[ (\alpha + c)[A[n]] - (\beta + c)[A[n]] = (\alpha[A[n]] + c) - (beta[A[n]] + c) = \alpha[A[n]] - \beta[B[n]]. \] The problem is that we only care about relative rankings with this model but the naive model induces a search for consistent absolute rankings, which are not well defined.

To get around this problem, we can do three things. The first two involve reducing the number of free parameters and the third involves soft identification through a prior. First, we can pin one of the values $\alpha$, traditionally by setting $\alpha_1 = 0$. This reduces the number of free parameters by one, as is reflected in the Stan parameterization.

parameters {
  vector[J - 1] alpha_rest;
}
transformed parameters {
  vector[J] alpha = append_row(0, alpha_rest);
}

We can then work with alpha directly. In the second approach, we can enforce the constraint $\textrm{sum}(\alpha) = 0$. This also reduces the number of free parameters by one and is coded in a similar way, only with a different transform.

transformed parameters {
  vector[J] alpha = append_row(-sum(alpha_rest), alpha_rest);
}

On the plus side, this approach is symmetric. On the downside, it is not generative in that it is not clear how to add a new player $J + 1$.

The third way to make a Bayesian model identifiable is to add a prior. For example, we can do this,

model {
  alpha ~ normal(0, 3);
}

This provides a kind of soft identification in the posterior, where values of alpha near 0 have higher posterior density. The third approach is overparameterized but has a pleasing symmetry and generalizability to new players.

Adaptation and preconditioning

The condition number of a positive-definite matrix is the ratio of its largest eigenvalue to its smallest. For a density, we can consider the negative inverse Hessian of its log density at a point. For a multivariate normal distribution, the negative inverse Hessian is its covariance at every point. That is, a normal distribution has constant curvature.

The larger the condition number, the harder it is to sample from a model, because multiple steps at the scale of the smallest eigenvalue are required to take a step in the direction of the largest eigenvalue. The first-order gradient-based approximation that Hamiltonian Monte Carlo uses can also fail due to high curvature; when the step is too large, the result will be divergent Hamiltonian trajectories (i.e., ones that do not preserve the Hamiltonian).

When the inverse Hessian is not positive definite, the density is not convex, which can present an even greater challenge to sampling than poor conditioning. If the inverse Hessian is positive definite and relatively constant over the posterior, we can correct for non-unit conditioning by preconditioning, which can rotate and scale a density until it looks more like a standard normal. In fact, given a multivariate normal it does just that—rotates and scales back to a standard normal.

Stan begins adaptation in a state that assumes the posterior is standard normal, which is (a) centered at the origin, (b) independent in dimension, and (c) unit scale. The closer a posterior is to standard normal, the easier it will be for Stan to adapt and to sample.

During adaptation, Stan estimates the posterior (co)variance to use for preconditioning. If the posterior Hessian is constant, the covariance will be equal to the negative inverse Hessian everywhere.

With its default settings, Stan preconditions by scaling each of the parameters as close to unit scale as it can manage. Under this scheme, in $D$ dimensions, the adapted metric requires $\mathcal{O}(D)$ storage and each leapfrog step (the most inner loop in Stan) requires $\mathcal{O}(D)$ time. By selecting a dense metric, Stan will also adapt to a fixed curvature (i.e., one that does not vary over the posterior). This requires $\mathcal{O}(D^2)$ storage, several $\mathcal{O}(D^3)$ costs to Cholesky factor, and then requires $\mathcal{O}(D^2)$ time per leapfrog step to evaluate.

Reparameterization

We can often reparameterize models to make them easier to sample. The standard example is reparameterizing a varying effect to use a non-centered parameterization. The centered parameterization is

parameters {
  real<lower=0> sigma;
  vector[N] alpha;
}
model {
  sigma ~ lognormal(0, 1.5);
  // DO NOT DO THIS (UNLESS alpha IS LARGE):
  alpha ~ normal(0, sigma);
}

This model induces a strong dependence between sigma and alpha. When sigma is small, alpha must be small, and when sigma is large, alpha will vary over a wide range. This varying curvature (high curvature for low sigma, very flat expanses for high sigma) is very challenging for MCMC sampling.

We can overcome the difficulty in this parameterization by non-centering, which we can code explicitly as follows.

parameters {
  real<lower=0> sigma;
  vector[N] alpha_std;
}
transformed parameters {
  alpha = sigma * alpha_std;
}
model {
  sigma ~ lognormal(0, 1.5);
  alpha_std ~ normal(0, 1);
}

We now have an independent standard normal distribution on alpha_std, not on alpha. It induces the correct distribution on the transformed parameter alpha, so that alpha has a normal(0, sigma) distribution. While this correction is only in the prior, if there is not a lot of data for each n, the posterior will be similar.

We can code the same model as above using an affine variable transform to rescale alpha with sigma.

parameters {
  real<lower=0> sigma;
  vector<multiplier=sigma>[N] alpha;  // non-centering
}
model {
  sigma ~ lognormal(0, 1.5)
  alpha ~ normal(0, sigma);  
}

Priors for speed and knowledge

Often we are tempted to use vague or overdispersed priors. When this gets too extreme, it can cause computational difficulties navigating flat expanses of probability space. And we will have statistical difficulties from the prior pulling most of the probability mass away from the natural posterior.

If you have used vague priors to avoid thinking too hard and computation is going awry, you might consider adding priors with a bit more information. It’s often possible to impose weakly informative priors that only determine the scale of a variable, which is often known. For example if we think a value $\alpha$ is going to have a value in the single digits, we can use a prior like

alpha ~ normal(0, 5);

that puts approximately 95% of the probability mass in the region $(-10, 10)$.

We do not recommend interval constraints other than for naturally constrained parameters (e.g., probabilities fall in $[0, 1]$ and concentrations in $[0, \infty)$). The reason we do not want to use

// DO NOT DO THIS
// (UNLESS alpha IS PHYSICALLY CONSTRAINED IN (-10, 10))
alpha ~ uniform(-10, 10);

is that if the data is consistent with alpha near the boundary (-10 or 10), then probability mass gets bunched up and posterior means get pushed away from the boundary. For example, consider the following Stan program.

stan/interval-censor.stan

data {
  int<lower=0> B;
  int<lower=0> N;
  vector[N] y;
}
parameters {
  real<lower=-B, upper=B> mu;
  real<lower=0> sigma;
}
model {
  y ~ normal(mu, sigma);
}

In this model, we have an improper uniform prior on sigma and a proper uniform prior on mu in the interval $(-B, B)$. Stan allows improper priors as long as the posterior is proper.

We can then simulate data according to the sampling distribution, \[ y_n \sim \textrm{normal}(9.9, 4). \] Although the true value of $\mu$ is 9.9, which falls in the interval $(-10, 10)$, consider what happens when we fit 50 random draws with this model.

model = csp.CmdStanModel(stan_file = '../stan/interval-censor.stan')
N = 50
mu = 9.9
sigma = 4
y = np.random.normal(mu, sigma, N)
B = 10
data = {'B': B, 'N': N, 'y': y}
M = 100 if DRAFT else 1000
sample = model.sample(data = data, seed = 123,
                      iter_warmup = M, iter_sampling = M,
                      show_progress = False, show_console = False)
print(f"interval = ({-B}, {B})")
sample.summary(sig_figs = 2)

interval = (-10, 10)

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	-86.0	0.039	1.10	-89.0	-86.0	-85.0	820.0	14000.0	1.0
mu	9.1	0.012	0.44	8.4	9.2	9.8	1300.0	21000.0	1.0
sigma	3.5	0.010	0.38	3.0	3.5	4.2	1300.0	22000.0	1.0

What happens is that all of the posterior mass for $\mu$ that would fall above 10 is pushed down inside the interval. As a result, we bias $\mu$ and $\sigma$ so that their estimates are too low. If we were to expand the interval to 20 and fit the same data, the result is quite different.

B = 20
data = {'B': B, 'N': N, 'y': y}
M = 100 if DRAFT else 1000
sample2 = model.sample(data = data, seed = 123,
                      iter_warmup = M, iter_sampling = M,
                      show_progress = False, show_console = False)
print(f"interval = ({-B}, {B})")
sample2.summary(sig_figs = 2)

interval = (-20, 20)

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	-84.0	0.0220	0.99	-86.0	-84.0	-83.0	2000.0	36000.0	1.0
mu	9.2	0.0087	0.50	8.4	9.2	10.0	3300.0	60000.0	1.0
sigma	3.5	0.0062	0.36	3.0	3.5	4.1	3400.0	62000.0	1.0

If you were thinking we should recover $\mu = 9.9$ and $\sigma = 4$, it’s worth noting the sample mean and standard deviation for our simulated data y, which are

print(f"mean(y) = {np.mean(y):5.2f};  sd(y) = {np.std(y):4.2f}")

mean(y) =  9.21;  sd(y) = 3.40

13.3 A final note on priors

Priors are particularly important when there is not much data in order to control computation. We can evaluate prior sensitivity by considering changing priors, as we did in the last section, to see what their effect is on the posterior.

In general, we would recommend using as much prior knowledge as you can during inference. Not doing so leaves useful information on the table.

14 Further reading

14.1 Scientific programming in Python

A good introduction to Python for data science applications is VanderPlas (2023). The current edition is available open access from the GitHub repository for the Python Data Science Handbook.

For intermediate Python programming, a standard reference is Fluent Python by Ramalho (2022).

14.2 Random number generation

The standard reference for pseudorandom number generation is still the book by Devroye (1986). It starts from scratch with an overview of uniform pseudorandom number generation and proceeds through advanced multivariate techniques. The author has provided a free online pdf on the home page of Non-Uniform Random Variate Generation.

14.3 Bayesian statistics

For an introduction to Bayesian statistics, I would highly recommend starting with the book by McElreath (2023). The second edition is available as a free online pdf from the GitHub repository for Statistical Rethinking.

For more advanced Bayesian statistics and more on computation, the third edition of the classic by Gelman et al. (2013). The third edition is available as a free pdf from the home page for Bayesian Data Analysis 3.

We provide a summary of the mathematics of Bayesian statistics in Appendix D: Bayesian statistics, as well as background material in Appendix A: Set theory and Appendix B: Probability theory.

14.4 Markov chain Monte Carlo

A good single-source for learning about MCMC methods is the handbook edited by Brooks et al. (2011). The key chapters on basic MCMC theory, convergence monitoring, and Hamiltonian Monte Carlo are available for free from the sample chapters page of the Handbook of Markov Chain Monte Carlo.

We can also recommend the paper by Roberts and Rosenthal (2004) as a solid mathematical foundation for MCMC that is about the same mathematical level as the appendices of this introduction.

We provide a summary of the mathematics of Monte Carlo methods and Markov chain Monte Carlo methods in [Appendix C: Monte Carlo methods]

14.5 Stan

The reference Stan paper is by our original development team (Carpenter et al. 2015), but there is no point in reading that paper if you have read this. It’s better to skip directly to the Stan User’s Guide (Stan Development Team 2023c), which contains much more information on modeling in Stan than this introduction. For full details on Stan’s programming language and execution, see the Stan Reference Manual (Stan Development Team 2023b); for information on the library of special mathematical and statistical functions, see the Stan Functions Reference (Stan Development Team 2023a).

There is also a wealth of information available in the form of Stan case studies and Stan conference proceedings, both of which are distributed in the form of reproducible notebooks.

14.6 Bayesian workflow

The paper that kicked off the Stan project’s focus on workflow is Gabry et al. (2019); the current state of our thinking is Gelman et al. (2020). Many of the Stan developers are working on turning the paper into a book; the current state is available from the public GitHub repository for Bayesian Workflow.

A. Set theory

Probability theory is based on relatively simple set theory. We define the set theory we need, including notions of countability, relations, and functions, in this section.

A.1 Basic set notation

One way to think of a set is as a bag of objects without duplicates. We write sets using curly braces. For example, the set containing the numbers 1 and 2 is written as $\{ 1, 2 \}$. Sets are not ordered, so the notation $\{ 1, 2 \}$ picks out the same set as $\{ 2, 1 \}$.

Ellipses notation

Sets are often written with informal shorthand with ellipses ($\ldots$) standing in for a sequence of elements. For example, $\{ 1, 2, \ldots, 7\}$ would pick out the set $\{ 1, 2, 3, 4, 5, 6, 7 \}$ and $\{ x_0, x_1, \ldots, x_4 \}$ is shorthand for $\{ x_0, x_1, x_2, x_3, x_4 \}$.

Ellipses that end rather than separate are used to denote an unbounded sequence of elements. For example, we will write $\{ 0, 1, 2, \ldots \}$ for the set of all counting numbers.

A.2 Membership, subsets, and equality

If $A$ is a set, we write $x \in A$ if $x$ is a member of the set $A$, and we write $x \not\in A$ otherwise. For example, $1 \in \{ 1, 2 \}$, but $7 \not\in \{ 1, 2 \}$.

If $A$ and $B$ are sets, we write $A \subseteq B$ and say that $A$ is a subset of $B$ if $x \in A$ implies $x \in B$. For example, $\{ 1, 3 \} \subseteq \{ 0, 1, 2, 3, 4 \}$, but $\{ 1, 3 \} \not\subseteq \{ 0, 2, 4 \}$.

Two sets $A$ and $B$ are equal, written $A = B$, if they are subsets of each other, $A \subseteq B$ and $B \subseteq A$.

If $A \subseteq B$ and $A \neq B,$ we say that $A$ is a proper subset of $B$ (this notion of propriety extends to other orderings, such as those of the numbers).

A.3 Distinguished sets

Empty set

We use the notation $\emptyset$ for the unique empty set which contains no members. In other words, for any $x$, we know $x \not\in \emptyset$. That means the empty set is a subset of every other set $A$, $\emptyset \subseteq A$.

Universal set

A universal set $\mathcal{U}$ is one such that for every set $A$ under consideration in some context, we have \[ A \subseteq \mathcal{U}. \] Depending on the sets under consideration, universal sets may not exist (in the sense of not being a set). We will only be considering sets of numbers here rather than infinitely nested sets of sets, so most of the complications of more general set theory do not arise.

Sets of numbers

The symbol $\mathbb{Z}$ is conventionally used for the set of all integers (i.e., whole numbers). The symbol $\mathbb{N}$ is used for the set of all counting numbers (i.e., non-negative whole numbers). The symbol $\mathbb{R}$ is conventionally used for the set of all real numbers. The symbol $\mathbb{C}$ is used for the set of all complex numbers.

A.4 Finite and infinite sets

A set $s$ is finite if it can be written in the form $s = \{ x_0, x_1, \ldots, x_{n-1}\}$ for some $n \geq 0.$ If a set is not finite, it is infinite.

The size of a finite set is the number of distinct members it has. For example, the set $s = \{ 1, 2, 3 \}$ has size 3. It is standard to use absolute-value notation for the size of a set, for example, \[ \left| \{ 1, 2, 3 \} \right| = 3. \] In general, every finite set $s = \{ x_0, \ldots, x_{n-1} \}$ has size $n$. The empty set $\emptyset$ has size 0.

A.5 Comprehensions and power sets

Comprehensions are a way to turn conditions into sets. For example, I might take all the natural numbers that satisfy the condition of being odd or being less than 17 or both. We use the standard comprehension notation, \[ B = \{ x \in A \, : \, \phi(x) \}, \] for the set $B$ that contains all of the elements of $A$ that satisfy the condition $\phi$.

The set of all subsets of a set $A$ is called its power set and is defined by \[ \textrm{pow}(A) = \{ B : B \subseteq A \}. \]

Intervals

Given a set of numbers $A,$ and two elements $a, b \in A,$ define the open interval \[ (a, b) = \{ x \in A : a < x \textrm{ and } x < b \}, \] the closed interval includes he boundaries, \[ [a, b] = \{ x \in A : a \leq x \textrm{ and } x \leq b \}. \] We can also have mixed intervals $(a, b]$ and $[a, b)$ defined in the obvious way. This notation can be further extended to $\infty$ and $-\infty$ in the obvious way, e.g., \[ (0, \infty) = \{ x : 0 < x \textrm{ and } x < \infty \}, \] where $x < \infty$ holds for all $x.$ The closed interval $(0, \infty]$ does not make sense because $\infty$ is not a number and hence $\infty \not\in A.$

In the discrete case, we often write $[a, b]$ as $a{:}b$ for the set $\{ a, a + 1, \ldots, b \}.$ This can be confusing because in some programming languages (MATLAB, R, Stan) $a:b$ is defined as above, whereas in Python it is defined to exclude the upper bound as $[a, b).$

A.6 Union, intersection, difference, and complement

The union of two sets $A$ and $B$, written $A \cup B$, contains the elements that are in either set, \[ A \cup B = \{ x \, : \, x \in A \textrm{ or } x \in B \}. \] The intersection of two sets $A$ and $B$, written $A \cap B$, contains the elements that are in either set, \[ A \cap B = \{ x \, : \, x \in A \textrm{ and } x \in B \}. \] The set of sets under union and intersection form what is known as a boolean algebra, with intersection being like logical conjunction and union being like logical disjunction (and with complementation, defined below, being like logical negation). So all of the logical laws of Boolean algebra follow for sets.

Both unions and intersections can be extended to infinite sets using subscript notation. If $A_0, A_1, \ldots$ is an infinite sequence of sets, then \[ \bigcup_{n = 0}^{\infty} A_n = \{ x \, : \, x \in A_n \textrm{ for some } n < \infty \} \] and \[ \bigcap_{n = 0}^{\infty} A_n = \{ x \, : \, x \in A_n \textrm{ for all } n < \infty \} \]

If $A$ and $B$ are sets, then their difference, written $A \setminus B$, is the set of elements in $A$ that are not in $B$, \[ A \setminus B = \{ x \in A \, : \, x \not\in B \}. \]

If there is a universal set $\mathcal{U}$, the complement of the set $A$ is the set of elements in $\mathcal{U}$ that are not in $A$, \[ A^\complement = \mathcal{U} \setminus A = \{ x \in \mathcal{U} \, : \, x \not\in A \}. \] The complement $A^\complement$ is sometimes written as $\bar{A}$.

If we have a universal set $\mathcal{U}$, then we can use it to define difference in terms of complementation through \[ A \setminus B = A \cap B^\complement. \]

A.7 Disjointness and partitions

A pair of sets $A$ and $B$ are disjoint if $A \cap B = \emptyset$. Disjoint sets share no elements in the sense that if $A$ and $B$ are disjoint, there is no element $x \in A$ such that $x \in B.$

A potentially infinite sequence of sets $A_0, A_1, \ldots$ is pairwise disjoint if for every $i, j < \infty$, $A_i$ and $A_j$ are disjoint, $A_i \cap A_j = \emptyset$. In a sequence of pairwise disjoint sets, each element of their union appears in exactly one of the sets in the sequence.

A pairwise disjoint sequence of sets $B_0, B_1, \ldots$ is a partition of a set $A$ if \[ A = \bigcup_{n < \infty} B_n. \]

A.8 Basic laws of set theory

Commutativity

Set operations are commutative, meaning that for all sets $A$ and $B$, \[ A \cup B = B \cup A \] and \[ A \cap B = B \cap A. \]

Associativity

Set operations are associative, meaning that for all sets $A$, $B$, and $C$, \[ (A \cup B) \cup C = A \cup (B \cup C) \] and \[ (A \cap B) \cap C = A \cap (B \cap C). \]

Distributivity

The intersection operation is distributive with respect to the union operation, so that for all sets $A$, $B$, and $C$, \[ A \cap (B \cup C) = (A \cap B) \cup (A \cap C). \] Because of commutativity, this also holds on the right for $(B \cup C) \cap A$.

Involutivity

The complementation operation is an involution (i.e., equal to its own inverse), so that for all sets $A$, \[ \left(A^\complement\right)^{\complement} = A. \]

Identities

The empty set is a left and right identity for union, so that for all sets $A$, \[ \emptyset \cup A \ = \ A \cup \emptyset \ = \ A. \] If there is a universal set $\mathcal{U}$, it is an identity for intersection, so that for all sets $A$, \[ \mathcal{U} \cap A \ = \ A \cap \mathcal{U} \ = \ A. \]

Zeroes

The empty set is a zero for intersection, so that for all sets $A$, \[ \emptyset \cap A \ = \ A \cap \emptyset \ = \ \emptyset. \] The universal set, if it exists, is a zero for union, so that for all sets $A$, \[ \mathcal{U} \cup A \ = \ A \cup \mathcal{U} \ = \ \mathcal{U}. \]

Idempotence

The union and intersection operations are idempotent, so that for all sets $A$, \[ A \cup A = A \] and \[ A \cap A = A. \]

De Morgan’s law

De Morgan’s law relates complementation, intersection, and union, where for all sets $A$, $B$, and $C$, \[ \left(A \cup B\right)^\complement = A^\complement \cap B^\complement. \] It follows that \[ \left(A \cap B\right)^\complement = A^\complement \cup B^\complement. \] De Morgan’s law can be extended to the countably infinite case, \[ \bigcap_{n \in \mathbb{N}} A_n = \left( \bigcup_{n \in \mathbb{N}} A_n^\complement \right)^{\complement}. \]

A.9 Tuples, cross-products, and relations

Tuples

Given elements from two sets, $a \in A$ and $b \in B$, we write $(a, b)$ for the tuple consisting of the pair $a$ and $b$ in that order (we stress the ordering to distinguish tuples from sets). The identity condition for tuples are that $(a, b) = (a', b')$ if and only if $a = a'$ and $b = b'.$ That means that if $a \neq b,$ then $(a, b) \neq (b, a).$

Given a tuple $(a, b),$ there are two projection functions, \[ \pi_1((a, b)) = a \qquad \pi_2((a, b)) = b. \] that pick out the elements. Thus for any tuple $x,$ we have $x = (\pi_1(x), \pi_2(x)).$

Tuples with two elements are called pairs. The tuple construction can be extended to any number of elements inductively. For example, we can construct 3-tuples$ $(a, (b, c))$ by combining an element $a$ and a pair $(b, c)$. We conventionally write a 3-tuple as $(a, b, c).$ Longer $n$-tuples* can be constructed in the same way.

Cross products and exponents

The cross product of two sets $A$ and $B$ is written $A \times B$ and defined to be the set of tuples formed from $A$ and $B,$ \[ A \times B = \{ (a, b) : a \in A, b \in B \}. \] This notion can be extended to any number of sets by taking the operator to be left associative and setting, \[ A_1 \times A_2 \times \cdots A_N = A_1 \times (A_2 \times \cdots A_N). \]

For the cross product of a set $A$ with itself $n \in N$ times, we define the set exponent as \[ A^n = \underbrace{A \times \cdot \times A}_{n \textrm{ times}}. \]

Relations

A relation between the sets $A$ and $B$ is a set $R \subseteq A \times B.$ For example, the comparison operator $<$ defines a relation over the natural numbers by \[ L = \{ (x, y) \in \mathbb{N} \times \mathbb{N} : x \leq y \} = \{ (0, 1), (0, 2), (1, 2), (0, 3), (1, 3), (2, 3), \ldots \}. \] It is a curious fact of empty sets that $\mathbb{N} \times \mathbb{N}$ is the same size as $\mathbb{N}$—the above ordering is a hint as to how to lay out the pairs in a linear sequence.

A relation $R$ is symmetric if $(a, b) \in R$ implies $(b, a) \in R$. The inequality relation is not symmetric, but the equality relation $E = \{ (x, x) \in \mathbb{R} \times \mathbb{R} : x \in \mathbb{R} \}$ is.

A relation $R$ is transitive if $(a, b) \in R$ and $(b, c) \in R$ implies $(a, c) \in R.$ Both equality and inequality are transitive relations, but the within unit distance relation $W = \{ (x, y) \in \mathbb{R} \times \mathbb{R} : | x - y | \leq 1 \}$ is not.

If $R$ and $T$ are relations, then their composition $R \circ T$ is defined by \[ R \circ T = \{ (a, c) : (a, b) \in R, (b, c) \in T \}. \] Thus a relation $R$ is transitive if $R \circ R = R.$

The inverse of a relation $R \subseteq A \times B$ is the relation $R^{-1} \subseteq B \times A$ defined by $(b, a) \in R^{-1}$ if and only if $(a, b) \in R.$

There is nomenclature around many particular kinds of relations over a set $A,$ i.e., relations $R \subseteq A \times A.$

$R$ is transitive if $(x, y), (y, z) \in R$ implies $(x, z) \in R,$ or equivalently, $R = R \circ R.$
$R$ is symmetric if $(x, y) \in R$ implies $(y, x) \in R,$ or equivalently, $R = R^{-1}.$
$R$ is antisymmmetric if $(x, y), (y, x) \in R$ implies $x = y.$
$R$ is total if for all $x, y \in A,$ $(x, y) \in A$ or $(y, x) \in A.$
$R$ is reflexive if $(x, x) \in R$ for all $x \in A.$
$R$ is a preorder if it is reflexive and transitive.
$R$ is an equivalence relation if is reflexive, symmetric, and transitive.
$R$ is a partial order if it is reflexive, antisymmetric, and transitive.
$R$ is a total order if it is a partial order and total.
$R$ is the identity relation if $R(x, y)$ if and only if $x = y.$

Given a partition of a set, the corresponding equivalence relation equates pairs of items if and only if they are in the same set.

A relation $R \subseteq A \times A$ may be closed under relations such as the above by taking the smallest relation containing $R$ and satisfying the relation.

Example. The transitive closure of a relation $R \subseteq A \times A$ can be defined inductively. Let \[ R^0 = \{ (a, a) : a \in A \} \] be the identity relation and let \[ R^{n + 1} = R \circ R^n \] be the one-step closure of $R^n$ (i.e., $(x, z) \in R^{n + 1}$ if $(x, y) \in R$ and $(y, z) \in R^n$), then define the transitive closure of $R$ to be \[ \textrm{tr}(R) = \bigcup_{n \in \mathbb{N}} R^n = \lim_{n \rightarrow \infty} R^n. \]

A.10 Functions

A relation $f \subseteq A \times B$ is a function if for every $a \in A$ there is exactly one $b \in B$ such that $(a, b) \in f.$ If $(a, b) \in f$ we write $f(a) = b$ and say $a$ maps to $b.$ If $f \subseteq A \times B$ is a function, we write $f:A \rightarrow B$.

Given two sets $A$ and $B,$ the set of functions from $A$ to $B$ is written as $B^A,$ and defined by \[ B^A = \{ f : f:A \rightarrow B \textrm{ is a function} \}. \]

Domains and ranges

If $f:A \rightarrow B$ is a function, it domain is $A$ and its codomain is $B.$ The range of a function is the set of values that result from applying the function to elements of $A.$ If we define $f:\mathbb{R} \rightarrow \mathbb{R}$ by $f(x) = x^2,$ then the codomain is $\mathbb{R},$ but the range is $(0, \infty).$ The range of a function is always a subset of its codomain.

Given a function, its domain is \[ \textrm{dom}(f) = \{ x : (x, y) \in f \} = A, \] and its range is \[ \textrm{rng}(f) = \{ y : (x, y) \in f \}, \] but we cannot infer its codomain from its members. The codomain only matters in a higher-level context in which the function is considered a member of a larger family of functions.

Composition

If $f:A \rightarrow B$ and $g:B \rightarrow C$ are functions, then their composition $g \circ f:A \rightarrow C$ is also a function, and satisfies $(g \circ f)(a) = g(f(a)).$

Injective, surjective, and bijective functions

A function $f : A \rightarrow B$ is

injective (aka one-to-one) if $a \neq b$ implies $f(a) \neq f(b),$
surjective (aka onto) if $B = \textrm{range}(B),$ and
bijective if it is both injective and surjective.

Functions satisfying these properties are called injections, surjections, and bijections. A function that is not injective is said to be many to one.

Inverse functions

If $f:A \rightarrow B$ is a bijection, it has an inverse $f^{-}:B \rightarrow A$ defined so that $f^{-1}(b) = a$ if $f(a) = b.$ This is just its inverse as a relation. The composition of a function with its inverse or vice-versa is the identity function, $f \circ f^{-} = \textrm{id},$ $f^{-} \circ f = \textrm{id}.$ Thus for a bijection $f:A \rightarrow B,$ we have $f^{-1}(f(a)) = a$ for all $a \in A$ and $f(f^{-}(b)) = b$ for all $b \in B.$

Special functions

For a given set $A$, the identity function $\textrm{id}_A:A \rightarrow A$ maps each element to itself, i.e., $\textrm{id}_A(a) = a.$. We usually drop the subscript and just write $\textrm{id}.$

For any set $A$, the indicator function $\textrm{I}_A$ is defined so that \[ \textrm{I}_A(x) = \begin{cases} 1 & \textrm{if } x \in A, \textrm{ and} \\[2pt] 0 & \textrm{otherwise}. \end{cases} \] Another common notation for power sets is $\mathbf{2}^A,$ which follows from the set-theoretic notation $B^A$ for the set of all functions from $A$ to $B$ and the notation $\mathbf{2} = \{ 0, 1 \}$ for the two-element set denoting truth values, so that $\mathbf{2}^A$ is the set of all indicator functions, which stands in a bijective relation to the sets.

Metrics

A distance metric (or just metric) over a set $A$ is a function $d:(A \times A) \rightarrow \mathbb{R}$ satisfying

non-negativity: $d(x, y) \geq 0$ for all $x, y \in A,$
identity of indiscernibles: $d(x, y) = 0$ implies $x = y,$
symmetry: $d(x, y) = d(y, x),$ and the
triangle inequaltiy: $d(x, y) + d(y, z) \geq d(x, z)$ for all $x, y, z \in A.$

A.11 Countable and uncountable sets

We can define finiteness more directly by taking $A$ to be finite if there exists a number $n \in \mathbb{N}$ and a bijection from $\{ 0, 1, \ldots, n - 1 \}$ to $A.$ The number $n$ is the size of the set. A set is infinite if it is not finite.

A set $A$ is countable if there is a surjection from the counting numbers $\mathbb{N}$ to $A$ and is uncountable otherwise. If there is a surjection from $\mathbb{N}$ to $A$ then what we have done is assigned a counting number to every element of $A.$

A countable set may be either finite or infinite. It can be shown that the integers $\mathbb{Z}$ are countable and the set of real numbers $\mathbb{R}$ is uncountable.

B. Probability theory

The goal of this overview is to provide precise definitions of those aspects of probability theory required to understand the operations used in applied Bayesian statistics. This introduction is measure-theoretic only insofar as is required to write down precise definitions of notions like random variables—it turns out to be quite tricky to define a consistent notion of event probability over continuous sample spaces due to the often counter-intuitive results of real analysis.

The presentation here was heavily influenced by an appendix to Anderson and Moore (1979)’s book Optimal Filtering, (Appendix A, “Brief Review of Results of Probability Theory”) and by Rosenthal (2006)’s book A First Look at Rigorous Probability Theory.

B.1 Probability spaces

Sample spaces and $\sigma$-algebras

A sample space is a non-empty set $\Omega$, the members of which are called outcomes because they represent possible ways the world might be (i.e., what are sometimes known as “possible worlds”).

A $\sigma$-algebra $\mathcal{S}$ is a set of subsets of $\Omega$, $\mathcal{S} \subseteq \textrm{pow}(\Omega)$, that

contains the empty set: $\emptyset \in \mathcal{S}$,
is closed under complements: if $A \in \mathcal{S},$ then $A^{\complement} \in \mathcal{S}$, and
is closed under countable unions: if $A_n \in \mathcal{S}$ for $n \in \mathbb{N},$ then $\bigcup_{n=0}^N A_n \in \mathcal{S}$.

We are defining a more general class of $\sigma$-algebras to pull out a class of sets of outcomes that can have probabilities assigned to them. It turns out that due to the vagaries of real analysis, it is not consistent to assign probabilities to arbitrary subsets of real numbers. So even though power sets form $\sigma$-algebras, they do not form the kind of $\sigma$-algebra we need to do continuous statistics. We provide an example of a Borel algebra for that below.

Because $\emptyset \in \mathcal{S}$, its complement $\Omega = \emptyset^{\complement} \in \mathcal{S}$. Closure under countable intersections follows from De Morgan’s law.

We will use the sets $A$ to represent sets of possible outcomes in $\Omega,$ so a set of possible outcomes $A \cup B$ is such that $x \in A \cup B$ if and only if $x \in A$ $x \in B$. Similarly a complement $A^\complement$ is the set of all outcomes that are not in $A$. Similarly, the set $\emptyset$ represents an impossible set of incomes because there is no outcome $\omega$ such that $\omega \in \emptyset.$ The set $\Omega$ is the universal set in the sense that every outcome $\omega$ is such that $\omega \in \Omega$.

For applied statistics, we can restrict our attention to sample spaces $\Omega \subseteq \mathbb{R}^N$.

Example 1(a): The power set of any non-empty set contains the empty set and is closed under complementation and countable unions, and is hence a $\sigma$-algebra. The trivial case has a singleton sample space $\Omega = \{ 0 \}$, with $\sigma$-algebra $\mathcal{S} = \left\{ \{\}, \{ 0 \} \right\}$. The simplest non-trivial example has a binary sample space $\Omega = \{ 0, 1 \}$, with $\sigma$-algebra $\mathcal{S} = \left\{ \{\}, \{ 0 \}, \{ 1 \}, \{ 0, 1 \} \right\}$.

Example 1(b): As an example of a countably infinite sample space, we can take the power set of the natural numbers, $\Omega = \textrm{pow}(\mathbb{N}),$ which is not only a $\sigma$-algebra, but a complete, atomic, boolean algebra.

Example 1(c): Not all discrete $\sigma$-algebras are power set algebras. For example, with $\Omega = \{ 0, 1, 2, 3 \},$ there is a $\sigma$-algebra \[ \mathcal{S} = \{ \emptyset, \{0, 2 \}, \{ 1, 3 \}, \{ 0, 1, 2, 3 4 \} \}. \] The $\sigma$-algebra $\mathcal{S}$ is closed under complements and finite unions, but it does not include the atomic events $\{ 0 \}, \{ 1 \}, \{ 2 \},$ and $\{ 3 \}.$

Example 2(a): A simple continuous example takes a sample space $\Omega = (0, 1)$ and defines $\mathcal{S}$ to be the smallest $\sigma$-algebra containing the open intervals $(a, b) \subseteq \Omega$ (i.e., $0 < a < b < 1$) and closed under complements and countable unions. The smallest $\sigma$-algebra containing the open sets of a sample space is called a Borel algebra and can be shown to exist. This $\sigma$-algebra contains all of the singletons, because $\{ a \} = ((0, a) \cup (a, 1))^{\complement}.$

Example 2(b): Another example of a Borel algebra is the closure of the open intervals over the entire real line, $\Omega = \mathbb{R}$.

Example 3: If $\mathcal{S}_1$ and $\mathcal{S}_2$ are $\sigma$-algebras over $\Omega_1$ and $\Omega_2$, then the product space $\mathcal{S} = \left\{ A_1 \times A_2 : A_1 \in \mathcal{S}_1 \ \textrm{and} \ A_2 \in \mathcal{S}_2 \right\}$ is a $\sigma$-algebra over $\Omega = \Omega_1 \times \Omega_2.$

Measures and probability spaces

Probability measures are measures where the sample space is normalized to have unit measure. A probability measure over a $\sigma$-algebra $\mathcal{S}$ is a function $\Pr: \mathcal{S} \rightarrow [0, 1]$ such that for all $A, B_n \in \mathcal{S},$

sample space unit probability: $\Pr[\Omega] = 1$,
non-negativity: if $A \in \mathcal{S}$, then $\Pr[A] \geq 0$, and
countable additivity: if $A_n \in \mathcal{S}$ for $n \in \mathbb{N}$ is a sequence of pairwise disjoint sets (i.e., $A_i \cap A_j = \emptyset$ if $i \neq j$), then \[ \Pr\!\left[\bigcup_{n = 0}^\infty A_n\right] = \sum_{n=0}^\infty \Pr\!\left[A_n\right]. \]

The sample space $\Omega$ with a probability measure $\Pr$ over the $\sigma$-algebra $\mathcal{S}$ together make up the probability space $(\Omega, \mathcal{S}, \Pr)$.

In analysis, if $\mathcal{S}$ is a $\sigma$-algebra, an element $A \in \mathcal{S}$ is said to be a Borel set. If the $\sigma$-algebra is endowed with a measure, as in a probability space, an element of the $\sigma$-algebra is said to be a measurable set. Measurable sets $A \in \mathcal{S}$ are also called events because they have a well-defined probability $\Pr[A]$ of occurring.

Because for any event $A,$ $A^\complement \cup A = \Omega$, it follows from countable additivity that \[ \Pr[A^{\complement}] = 1 - \Pr[A], \] \[ \Pr[\emptyset] = \Pr[\Omega^\complement] = 0, \textrm{ and} \] \[ \Pr[A] \leq 1. \]

A property holds almost everywhere with respect to a probability measure if the set of outcomes for which it does not hold has zero probability. More formally, suppose that $\phi : \Omega \rightarrow \{ 0, 1 \}$ is an indicator function assigning truth value to elements of $\Omega$ and defining a set $A_{\phi} = \{ \omega \in \Omega : \phi(\omega) = 1 \}.$ The property $\phi$ holds $\Pr$-almost everywhere if $\Pr[A_\phi] = 1,$ or equivalently, $\Pr[A_\phi^\complement] = 0.$

Example 1(a), continued: We can define a measure over the $\sigma$-algebra with sample space $\Omega = \{ 0, 1 \}$ and measurable sets $\mathcal{S} = \left\{ \{\}, \{ 0 \}, \{ 1 \}, \{ 0, 1 \} \right\}$. This involves assigning probabilities to all of the events. By construction, we know that \[ \Pr[\{ \}] = 0 \qquad \Pr[\{ 0, 1 \}] = 1. \] By the countable additive property of measures, and the fact that events $\{ 0 \}$ and $\{ 1 \}$ are disjoint, it follows that \[ \Pr[\{ 0 \}] + \Pr[\{ 1 \}] = \Pr[\{ 0 \} \cup \{ 1 \}] = 1, \] and therefore \[ \Pr[\{ 0 \}] = 1 - \Pr[\{ 1 \}]. \] For example, let $\Pr[\{ 1 \}] = 0.7$ so that $\Pr[\{ 0 \}] = 0.3.$

If we take the sample space to represent the boolean outcome of a football match (1 for the home team winning, 0 for the visiting team winning), then $\Pr[\{ 1 \}]$ is the probability that the home team wins and $\Pr[\{ 0 \}]$ is the probability that the visiting team wins.

Example 1(b), cont. We can define multiple measures over the power set of the integers. These are defined by supplying probabilities to the atoms, i.e., the integers. In this example, we will take $\Pr[\{ n \}] = 2^-{n + 1}.$ This assigns a probability to the event $\{ n \}$ equal to that of flipping $n$ successive heads with a fair coin. Countable additivity gives us $\Pr[\mathbb{N}] = 1,$ because $\sum_{n \in \mathbb{N}} 2^{n - 1} = \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \cdots = 1.$

from which it follows from countable additivity that for any $A \subseteq \mathbb{N},$ that $\Pr[ A \] = \sum{a \in A} \Pr[\{ a \}].$

Example 2(a), cont. We can define the Lebesgue measure over intervals $(a, b) \subseteq (0, 1)$ by \[ \Pr[(a, b)] = b - a. \] This yields the expected value for the whole space, \[ \Pr[\Omega] = \Pr[(0, 1)] = 1 - 0 = 1. \] The probability for unions of non-intersecting intervals is defined by countable additivity. Singleton sets have measure zero, because \[ \Pr[\{ a \}] = \Pr[(a, a)] = a - a = 0. \]
We could have used $\Omega = [0, 1]$ as an example, but it is different geometrically than $(0, 1)$ because it contains its limiting end points. The open interval will be more convenient going forward when we consider functions from $(0, 1)$ to the real numbers.

Example 2(b), cont. The Lebesgue measure over $\Omega = \mathbb{R}$ does not form a probability space, because the measure of the whole space $\mathbb{R}$ is not finite. On the other hand, we can define a measure over $\Omega = \mathbb{R}$ by defining the probability of an interval to be \[ \Pr[(a, b)] = \textrm{logit}^{-1}(b) - \textrm{logit}^{-1}(a), \] where for $u \in (0, 1)$ and $v \in \mathbb{R},$ \[ \textrm{logit}(u) = \log \frac{u}{1 - u} \qquad \qquad \textrm{logit}^{-1}(v) = \frac{\exp(v)}{1 + \exp(v)}. \] This has the desired property that \[ \Pr[\Omega] = \lim_{\begin{array}{c}a \rightarrow -\infty \\ b \rightarrow \infty\end{array}} \Pr[(a, b)] = \lim_{\begin{array}{c}a \rightarrow -\infty \\ b \rightarrow \infty\end{array}} \textrm{logit}^{-1}(b) - \textrm{logit}^{-1}(a) = 1 - 0 = 1. \]

Example 3, cont. Suppose we have a product $\sigma$-algebra over the open unit square $\Omega = [0, 1]^2$ (i.e., $[0, 1] \times [0, 1]$). We can define a basis for a probability measure using open discs for $x \in (0, 1)^2$, \[ \textrm{D}(x, r) = \left\{ x' : ||x|| < r \right\}. \] The probability of a disc in the unit square (i.e., $D(x, r) \subset \Omega$) is given by its area, \[ \Pr[\textrm{D}(x, r)] = \pi \cdot r^2. \] Probabilities for other measurable sets are defined by countable additivity and complementation. In this probability space, a measurable subset of $(0, 1)^2$ has a probability equal to its area (i.e., it’s a Lebesgue measure like Example 2(a)). Thus the entire sample space $\Omega$ has a probability of 1 because it is a unit square with unit area. This construction works with the volumes of balls in three dimensions and the hypervolumes of hyperballs in higher dimensions.

Total variation distance

Suppose we have two different measures, $\Pr$ and $\textrm{Q}$ over the same $\sigma$-algebra $(\Omega, \mathcal{S}).$ The total variation distance between $\Pr$ and $\textrm{Q}$ is written as $\left|\left| \strut \Pr - \textrm{Q} \right|\right|$ and defined by \[ \left|\left| \strut \Pr - \textrm{Q} \right|\right| = \max_{A \in \mathcal{S}} \left| \strut \Pr[A] - \textrm{Q}[A] \right|. \] Total variation distance defines a metric over the space of probability measures on a fixed $\sigma$-algebra.

B.2 Joint and conditional probability

In this section, we will presuppose a fixed probability space $(\Omega, \mathcal{S}, \Pr)$.

Joint probability

If $A$ and $B$ are events (i.e., $A, B \in \mathcal{S}$), their joint probability is written $\Pr[A, B]$ and understood as the probability that events $A$ and $B$ both occur, so that \[ \Pr[A, B] = \Pr[A \cap B]. \] This notation extends to any number of events, with $\Pr[A, B, C] = \Pr[A \cap B \cap C]$ and so on.

The sets $A_1, A_2, \ldots$ are mutually disjoint if $i \neq j$ implies $A_i \cap A_j = \emptyset.$ A $partition$ of the set $B$ is a countable sequence of mutually disjoint sets $B_0, B_1, B_2, \ldots$ such that $B = \bigcup_{n \in \mathbb{N}} \, B_n.$

The law of total probability states that if $B_0, B_1, \ldots$ is a partition of $B$, then \[ \Pr[A, B] = \sum_{n \in \mathbb{N}} \, \Pr[A, B_n]. \] This result follows directly from countable additivity.

In the context of joint probability $\Pr[A, B]$, we refer to $\Pr[A]$ as the marginal probability (of $A$). It follows from the law of total probability that if $B_1, B_2, \ldots$ is a partition of $\Omega,$ then the marginal probability of $A$ is \[ \Pr[A] = \Pr[A, \Omega] = \sum_n \Pr[A, B_n]. \]

Conditional probability

If $A$ and $B$ are events and $\Pr[B] \neq 0$, the conditional probability of $A$ given $B$ is the probability of $A$ occurring if $B$ occurs, and is defined by \[ \Pr[A \mid B] = \frac{\Pr[A, B]}{\Pr[B]}. \]

The chain rule for probabilities follows immediately, \[ \Pr[A, B] = \Pr[A \mid B] \cdot \Pr[B]. \]

Given a fixed event $B \in \mathcal{S},$ the function $\Pr[A \mid B]$ forms a measure over $B.$ More formally, we can derive a new probability space with $\Omega' = B$, $\mathcal{S} = \{ A \in \mathcal{S} : A \subseteq B \},$ and $\Pr'[A] = \Pr[A \mid B].$

Bayes’s rule

Bayes’s rule allows us to derive a conditional probability $\Pr[A \mid B]$ given the inverse probability $\Pr[B \mid A]$ and $\Pr[A]$. In its simplest form, it’s just an application of the chain rule for probabilities and the definition of conditional probability, where if $\Pr[B] \neq 0,$ \[ \Pr[A \mid B] = \frac{\Pr[B \mid A] \cdot \Pr[A]}{\Pr[B]}. \] But Bayes’s rule goes further and assumes that if $A_0, A_1, A_2, \cdots$ is a partition of $A$ (either finite or countably infinite), then \[ \Pr[A_n \mid B] = \frac{\Pr[B \mid A_n] \cdot \Pr[A_n]} {\sum_{n'} \Pr[B \mid A_{n'}] \cdot \Pr[A_{n'}]}. \]

B.3 Independent events

There are several notions of independence in statistics, which we summarize below.

Independence

A pair of events $A$ and $B$ are independent if \[ \Pr[A, B] = \Pr[A] \cdot \Pr[B]. \] When $A$ and $B$ are independent, the occurence of $A$ doesn’t give us any information about $B$, and vice-versa, so that \[ \Pr[A \mid B] = \Pr[A] \qquad \qquad \Pr[B \mid A] = \Pr[B]. \] If $A$ and $B$ are independent, then $A$ and $B^\complement$ are also independent, and thus $\Pr[A \mid B] = \Pr[A \mid B^\complement].$

Example 3, cont. Consider our Lebesgue measure on $(0, 1)^2,$ where the probability of an event is equal to its area. We can define an event $A$ by dividing the sample space into an area for $A$ of 0.2 and an area for $A^\complement$ of 0.8. In the following diagram, we represent this as a blue horizontal dividing line, with the lower half (with area 0.2) representing the event $A.$ We do the same for $B$ with a red vertical dividing line, taking the left half (with area 0.6) to correspond to the event $B.$

Code

print(pn.ggplot(pd.DataFrame())
  + pn.scale_y_continuous(breaks=[0.1, 0.6], labels=["A", "-A"], limits = (0, 1), expand=(0, 0))
  + pn.scale_x_continuous(breaks=[0.3, 0.8], labels=["B", "-B"], limits = (0, 1), expand=(0, 0))
  + pn.geom_vline(pn.aes(xintercept=0.6), color='red')
  + pn.geom_hline(pn.aes(yintercept=0.2), color='blue')
  + pn.coord_equal()
  + pn.theme(panel_grid_major = pn.element_blank(),
             panel_grid_minor = pn.element_blank(),
             axis_title = pn.element_blank(),
         axis_ticks = pn.element_blank())
)

We have $\Pr[A] = 0.2$ and $\Pr[B] = 0.6.$ We can work out that \[ \Pr[A \mid B] = \frac{\Pr[A, B]}{\Pr[B]} = \frac{0.12}{0.6} = 0.2 = \Pr[A], \] and \[ \Pr[B \mid A] = \frac{\Pr[B, A]}{\Pr[A]} = .\frac{0.12}{0.2} = 0.6 = \Pr[B]. \] Thus the events $A$ and $B$ are independent. We can also reason geometrically, noting that the ratio of $A$’s area to the total space’s area is the same as the ratio of $A \cap B$’s area to the area of $B.$ Now consider the following manner of partitioning the sample space.

Code

print(pn.ggplot(pd.DataFrame())
  + pn.scale_x_continuous(breaks=[0.3, 0.8], labels=["C", "-C"], limits = (0, 1), expand=(0, 0))
  + pn.scale_y_continuous(breaks=[0.1, 0.6], labels=["A", "-A"], limits = (0, 1), expand=(0, 0))
  + pn.geom_segment(pn.aes(x=0.2, y=1, xend=0.7, yend=0), color='red')
  + pn.coord_equal()
  + pn.geom_hline(pn.aes(yintercept=0.2), color='blue')
  + pn.theme(panel_grid_major = pn.element_blank(),
             panel_grid_minor = pn.element_blank(),
         axis_title = pn.element_blank(),
         axis_ticks = pn.element_blank())
)

Geometric reasoning here shows that $\Pr[A \mid C^\complement] < \Pr[A \mid C]$ and thus $A$ and $C$ are not independent.

Conditional independence

A pair of events $A$ and $B$ are conditionally independent given an event $C$ with $\Pr[C] > 0$ if \[ \Pr[A, B \mid C] = \Pr[A \mid C] \cdot \Pr[B \mid C]. \] As with independence, if $A$ and $B$ are conditionally independent given $C$, then $A$ doesn’t provide any information about the occurrence of $B$ if $C$ occurs, \[ \Pr[A \mid B, C] = \Pr[A \mid C] \qquad \qquad \Pr[B \mid A, C] = \Pr[B \mid C]. \]

Mutual independence

A sequence of events $A_0, A_1, \ldots, A_N$ is mutually independent if

$\Pr\!\left[\bigcap_{n = 1}^N A_n\right] = \prod_{n=1}^N \Pr[A_n],$ and
every proper subsequence $A_m, A_{m+1}, \ldots, A_{M}$ is mutually independent, where a proper subsequence is one in which $m \neq 0$ or $M \neq N$ (i.e., it is not the whole sequence).

Conditional mutual independence is defined the same way.

B.4 Random variables

In this section we introduce univariate and multivariate random variables and show how to define new random variables as functions of random variables.

Random variables

Given a probability space $(\Omega, \mathcal{S}, \Pr)$, a random variable over this space is a function $X:\Omega \rightarrow \mathbb{R}$ such that for every $x \in \mathbb{R}$, the set $\{ \omega \in \Omega : X(\omega) \leq x \} \in \mathcal{S}$ (i.e., it is measurable). The little $x$ here is just a bound variable and this condition is just saying that for any number $x$, the set of outcomes where $X \leq x$ is an event in our $\sigma$-algebra. We will need this condition in order to assign probabilities to possible values of the random variable.

A random variable is a deterministic function, and as such is neither a variable nor random by itself. The name derives because we often use simple variables like $X$ to refer to random variables in notation and because it derives its random structure from the underlying probability measure $\Pr,$ as we show in the rest of this section.

Given a random variable $X$, we define its range as usual for a function, \[ \textrm{range}(X) = \{ X(\omega) : \omega \in \Omega \}. \] Because $X(\omega) \in \mathbb{R}$, the range is always a subset of real numbers. It is conventional to overload the membership operator for random variables and write $X \in A$ if $X(\omega) \in A$ for every $\omega \in \Omega.$

Example 1, cont. Our first example has a finite sample space, $\Omega = \{ 0, 1 \}.$ Over this sample space, we can define a random variable $X:\Omega \rightarrow \mathbb{R}$ by $X(0) = -2$ and $X(1) = 10$. This has $\textrm{range}(X) = \{ -2, 10 \}$ and represents the payoff of a 2 unit bet placed on the outcome $X = 1$ at 5:1 odds.

Example 2(a), cont. In this ongoing example, we have $\Omega = (0, 1)$. Because $\Omega \subseteq \mathbb{R}$, we can define a random variable $X$ by the identity function, \[ X(\omega) = \omega, \] with a range equal to the sample space, $\textrm{range}(X) = (0, 1) = \Omega.$

The range of a random variable is not constrained by $\Omega$. Consider the random variable $Y$ over $\Omega = (0, 1)$ defined by \[ Y(\omega) = \textrm{logit}(\omega) = \log \frac{\omega}{1 - \omega} \textrm{ if} \omega \in (0, 1). \] A third example is the random variable $Z$ defined by $Z(\omega) = \log \omega$, which has $\textrm{range}(Z) = (-\infty, 0)$. We can even have a random variable $T$ over $\Omega = (0, 1)$ with a discrete range by defining \[ T(\omega) = \begin{cases} 1 & \textrm{if } \omega < 0.3, \textrm{ and} \\ 0 & \textrm{if } \omega \geq 0.3, \end{cases} \] so that $\textrm{range}(T) = \{ 0, 1 \}.$

Example 2(b), cont. We previously defined a probability space over the sample space $\Omega = \mathbb{R}$. We can define a random variable $U$ over $\Omega$ by the identity function, \[ U(\omega) = \omega, \] which gives us $\textrm{range}(U) = \mathbb{R}.$ We can define a second random variable $V$ by \[ V(\omega) = \textrm{logit}^{-1}(\omega) = \frac{1}{1 + \exp(-\omega)}, \] which has $\textrm{range}(V) = (0, 1)$. A third example is $W$ defined by $W(\omega) = \omega^2$, which has $\textrm{range}(W) = [0, \infty)$.

Independence for random variables

Two random variables $X$ and $Y$ are independent if the events picked out by $X \leq x$ and $Y \leq y$ (i.e., $\{ \omega \in \Omega : X(\omega) \leq x \}$) are independent for all $x$ and $y.$

Functions of random variables

Random variables partly derive their name because we can use them in arithmetic operations and as arguments to functions to produce new random variables. Given a random variable $X$ defined over a sample space $\Omega$ and a function $f:\textrm{range}(X) \rightarrow \mathbb{R}$, we can define a new random variable $Y$ over the same sample space by setting \[ Y(\omega) = f(X(\omega)). \] This new random variable is conventionally written as $Y = f(X),$ which is yet another overload of conventional set-theory notation.

Functions of more than one random variable work the same way. If $X_1, \ldots, X_N$ are random variables and $f:\textrm{range}(X_1) \times \cdots \times \textrm{range}(X_N) \rightarrow \mathbb{R}$, then we write $Y = f(X_1, \ldots, X_N)$ for the random variable defined by \[ Y(\omega) = f(X_1(\omega), \ldots, X_N(\omega)). \]

Example 2(a), cont. Although we have given the random variables $X,$ $Y,$ and $T$ direct definitions, note that $Y = \textrm{logit}(X),$ $Z = \textrm{log}(X),$ and $T = \textrm{I}(X < 0.3).$

Example 2(b), cont. In this example, we have defined $U,$ $V,$ and $W$ as primitive random variables, but we also have $V = \textrm{logit}^{-1}(U)$ and $W = \textrm{U}^2.$

B.5 Cdfs, pdfs, and pmfs

In this section, we introduce cumulative distribution functions (cdfs), probability density functions (pdfs), and probability mass functions (pmfs).

Cumulative distribution function

Given a random variable $X$ defined over a probability space $(\Omega, \mathcal{S}, \Pr),$ its cumulative distribution function (cdf) $F_X:\mathbb{R} \rightarrow [0, 1]$ is defined for $x \in \mathbb{R}$ by \[ F_X(x) = \Pr[X \leq x], \] where we read $X \leq x$ as the event $\{ \omega \in \Omega : X(\omega) \leq x \}$. The cdf of a random variable is well-defined because all of the relevant events $X \leq x$ must be measurable in order for $X$ to be a random variable.

The cdf $F_X$ of any random variable $X$ satisfies three important properties,

$F_X$ is continuous from the right,
$F_X$ is monotonic increasing (but not necessarily strictly so), and
the limit as the argument goes to infinity is one, \[ \lim_{x \rightarrow \infty} F_X(x) = 1. \]

Example 1, cont.: Continuing our minimal discrete example with $\Omega = \{ 0, 1 \}$, $\Pr[\{ 0 \}] = 0.3,$ and $\Pr[\{ 1 \}] = 0.7,$ consider a random variable that takes $X(0) = -2$ and $X(1) = 10$. The resulting cdf is \[ F_X(x) = \begin{cases} 0 & \textrm{if } x \in (-\infty, 2), \\ 0.3 & \textrm{if } x \in [-2, 10), \textrm{ and} \\ 1.0 & \textrm{if } x \in [10, \infty). \end{cases} \]

Example 2(a), cont. Taking up our example of $\Omega = (0, 1),$ with $\Pr[(a, b)] = b - a$ for $(a, b) \subseteq (0, 1),$ we see that the random variable $X(\omega) = \omega$ has the cdf $F_X:\mathbb{R} \rightarrow [0, 1]$ defined by \[ F_X(x) = \begin{cases} 0 & \textrm{if } x \in (-\infty, 0), \\ x & \textrm{if } x \in [0, 1), \textrm{ and} \\ 1 & \textrm{if } x \in [1, \infty). \end{cases} \] The result for $x \in [0, 1)$ follows because $F_X(x) = \Pr[X \leq x] = \Pr[(0, x)] = x.$

For the case of $Y(\omega) = \textrm{logit}(\omega),$ we have the cdf $F_Y:\mathbb{R} \rightarrow [0, 1]$ defined by \[\begin{align} F_Y(y) &= \Pr[Y \leq y] \\[4pt] &= \Pr[\textrm{logit}^{-1}(Y) \leq \textrm{logit}^{-1}(y)] \\[4pt] &= \Pr[X \leq \textrm{logit}^{-1}(y)] \\[4pt] &= \textrm{logit}^{-1}(y). \end{align}\]

Example 2(b), cont. In this example, $\Omega = \mathbb{R}$ and $\Pr[(a, b)] = \textrm{logit}^{-1}(b) - \textrm{logit}^{-1}(a).$ Let the random variable $U$ be the identity i.e., $U(\omega) = \omega.$ The cdf of $U$ is the function $F_U:\mathbb{R} \rightarrow [0, 1]$ defined by \[\begin{align} F_U(u) &= \Pr[U \leq u] \\[3pt] &= \lim_{x \rightarrow -\infty} \, \Pr[U \in (x, u)] \\[3pt] &= \lim_{x \rightarrow -\infty} \, \textrm{logit}^{-1}(u) - \textrm{logit}^{-1}(x) \\[3pt] &= \textrm{logit}^{-1}(u) - \lim_{x \rightarrow -\infty} \, \textrm{logit}^{-1}(x) \\[3pt] &= \textrm{logit}^{-1}(u). \end{align}\] This is the same cumulative distribution function as we had for $Y$ in Example 2(a). Consider a second random variable $V = \textrm{logit}^{-1}(U)$. The cdf of $V$ is given by \[ F_V(v) = \begin{cases} 0 & \textrm{if } v \in (-\infty, 0) \\ v & \textrm{if } v \in [0, 1) \\ 1 & \textrm{if } v \in [1, \infty). \end{cases} \] The middle case follows because \[\begin{align} F_V(v) &= \Pr[V \leq v] \\[4pt] &= \Pr[\textrm{logit}(V) \leq \textrm{logit}(v)] \\[4pt] &= \Pr[U \leq \textrm{logit}(v)] \\[4pt] &= \textrm{logit}^{-1}(\textrm{logit}(v)) \\[3pt] &= v. \end{align}\] Thus we see that $F_V$ in this example has the same cumulative distribution as $F_X$ in example 2(a).

Probability density function

If the cumulative distribution function $F_X$ for a random variable $X$ is continuous and differentiable, we define its probability density function (pdf) $p_X:\mathbb{R} \rightarrow (0, \infty)$ by \[ p_X(x) = F_X'(x), \] for $x \in \mathbb{R},$ where $F_X'$ is the derivative of the cdf. Thus the unit of measure for the cdf is probability, whereas the unit of measure for the pdf is change in probability.

Even though the cdf forms the theoretical foundation for distributions, in practice we usually work with pdfs and occasionally want to go back to the cdf, which we can do by integration, \[ F_X(x) = \int_{-\infty}^x \, p_X(x) \, \textrm{d}x. \]

We know that $\Pr[\Omega] = 1$ and that for any random variable $X,$ $\Pr[\Omega] = \lim_{x \rightarrow \infty} \, F_X(x) = 1$. Furthermore, we know $F_X(x) = \int_{-\infty}^x p_X(x) \textrm{d}x.$ Putting this together we have

\[\begin{align} \int_{-\infty}^\infty p_X(x) \textrm{d}x &= \lim_{x \rightarrow \infty} \int_{-\infty}^x p_X(x) \textrm{d}x \\[4pt] &= \lim_{x \rightarrow \infty} F_X(x) \\[4pt] &= 1. \end{align}\]

Example 2(a), cont. We have a random variable $X$ with cdf $F_X(x) = x,$ for $x \in X,$ so the pdf is \[ p_X(x) = F_X'(x) = \frac{\textrm{d}}{\textrm{d}x} x = 1. \] A constant density indicates the variable has a uniform distribution. The uniform distribution has an associated function for $\alpha < \beta \in \mathbb{R},$ \[ \textrm{uniform}(x \mid \alpha, \beta) = \frac{1}{\beta - \alpha}. \] If a random variable $X$ has a density $p_X(x) = \textrm{uniform}(x \mid \alpha, \beta),$ we write \[ X \sim \textrm{uniform}(\alpha, \beta). \] In this notation, $\alpha$ and $\beta$ might themselves be random variables, which require the notion of conditional density, which we define below.

For our second random variable $Y$, the cdf is $F_Y(y) = \textrm{logit}^{-1}(y),$ and so the pdf is \[ p_Y(y) = F_Y'(y) = \frac{\textrm{d}}{\textrm{d}y} \textrm{logit}^{-1}(y) = \textrm{logit}^{-1}(y) \cdot (1 - \textrm{logit}^{-1}(y)). \] Variable $Y$ has a standard logistic distribution, and we write \[ Y \sim \textrm{logistic}(0, 1). \] The general definition of the logistic distribution’s density involves parameters $\mu \in \mathbb{R}$ and $\sigma \in (0, \infty),$ \[ \textrm{logistic}(y \mid \mu, \sigma) = \textrm{logit}^{-1}\!\left(\frac{y - \mu}{\sigma}\right) \cdot \left(1 - \textrm{logit}^{-1}\!\left(\frac{y - \mu}{\sigma}\right)\right). \] The use of $\mu$ and $\sigma$ in this configuration defines what’s known as a location-scale parameterization. The standard parameterization takes the location to be zero and the scale to be one. If $Y \sim \textrm{logistic}(0, 1)$ (i.e., it has a standard logistic distribution), then $Z = \mu + \sigma \cdot Y$ is such that $Z \sim \textrm{logistic}(\mu, \sigma).$ The parameter $\sigma$ used to scale is called a scale parametrer and the parameter $\mu$ used to translate or shift is called a location parameter.

The following code and plot shows the cdf and pdf for the standard logistic distribution.

Code

def logit(x):
    return np.log(x / (1 - x))
def inv_logit(x):
    return 1 / (1 + np.exp(-x))

x = np.linspace(-8, 8, 100)
pdf_y = [inv_logit(x_n) * (1 - inv_logit(x_n)) for x_n in x]
df_pdf = pd.DataFrame({
    'x': x, 'y': pdf_y, 'type': 'pdf p_X'
})
cdf_y = [inv_logit(x_n) for x_n in x]
df_cdf = pd.DataFrame({
    'x': x, 'y': cdf_y, 'type': 'cdf F_X'
})
df = pd.concat([df_pdf, df_cdf])
print(pn.ggplot(df, pn.aes(x='x', y='y')) + pn.geom_line()
    + pn.facet_wrap('~ type', scales = 'free_y')
    + pn.labs(x='x', y='value', title = 'X ~ logistic(0, 1)')
    + pn.scale_x_continuous(breaks = [-8, -6, -4, -2, 0, 2, 4, 6, 8])
    + pn.theme(figure_size=(10, 4), panel_spacing=0.5))

The side-by-side plots illustrate how the pdf $p_X$ is the derivative of the cdf $F_X.$ The derivative of the cdf, and hence the pdf, is maximized at $X = 0,$ and approaches zero as $X \rightarrow \infty$ or $X \rightarrow -\infty.$ The chain rule for derivatives cancels the outer negation and establishes symmetry, \[ F_X(x) = 1 - F_X(-x) \qquad \qquad F_X'(x) = F_X'(-x) \qquad \qquad p_X(x) = p_X(-x). \]

Change of variables (univariate)

Suppose I have a univariate random variable $X:\Omega \rightarrow \mathbb{R}$ and a function $f:\mathbb{R} \rightarrow \mathbb{R}$. Then I can define a new random variable $Y = f(X)$ by defining \[ Y(\omega) = f(X(\omega)) \] for a point $\omega \in \Omega$ in the sample space.

If the function $f$ is a differentiable, monotonic (up or down), and one-to-one, then the standard change of variables formula for densities applies, so that the density function for $Y$ is \[ p_Y(y) = p_X(f^{-1}(y)) \cdot (f^{-1})'(y), \] where $f^{-1}$ is the inverse of $f$ and $(f^{-1})'$ is the derivative of the inverse of $f$. We consider multivariate changes of variables when we introduce vector-based random variables below.

Probability mass function

If the cumulative distribution function $F_X$ for a random variable $X$ is discrete (in the sense of being made up only of step functions), then we define its probability mass function (pmf) $p_X:\mathbb{N} \rightarrow (0, \infty)$ by \[ p_X(n) = \Pr[X \leq n + 1] - \Pr[X \leq n] = F_X(n + 1) - F_X(n), \] for $n \in \mathbb{N}$.

The support of a random variable $X$ is the set of values it can take on, which is formally defined as the values for which the pdf or pmf is strictly greater than 0, \[ \textrm{supp}(X) = \{ x : p_X(x) > 0 \}. \]

Example 4. Let $U$ be the result of rolling a fair 6-sided die and $V$ be the result of rolling another fair 6-sided die independently, and let $X = U + V$ be the sum of the dice. The following plots show the pmf and pdf for $X$, which can take on values between 2 and 12. There are $6 \times 6 = 36$ different primitive outcomes, and each outcome has a probability equal to the total number of ways it can arise divided by 36 (e.g., $\Pr[X = 2] = 1/36$ because $X=2$ only occurs one way, with $U = 1, V = 1,$ whereas $\Pr[X = 5] = 4/36$ because $X = 5$ can arise as four different combinations of $U$ and $V$.

Code

support = np.arange(2, 13)
probs = [0, 1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36]
cdf_values = np.cumsum(probs)
df = pd.DataFrame({
    'n': support,  'cdf': cdf_values[1:12], 'cdf_low': cdf_values[0:11],
})
plot1 = (pn.ggplot(df, pn.aes(x='n', y='cdf'))
    + pn.geom_point(size=2)
    + pn.scale_x_continuous(breaks=np.arange(2, 13))
    + pn.labs(x='n', y='Pr[X <= n]')
    + pn.scale_x_continuous(limits=(1,13),
                            breaks=support, expand=(0, 0))
    + pn.ggtitle('cdf')
    + pn.theme(strip_margin_x=5.0))
for n in range(0,12):
    plot1 += pn.geom_segment(pn.aes(x = n + 1,
               y = cdf_values[n], xend = n + 2 - 0.075,
               yend = cdf_values[n]))
    plot1 += pn.geom_point(pn.aes(x='n', y='cdf_low'),
               size=2, stroke=0.5, fill='none',
           shape='o', color='black')
df2 = pd.DataFrame({ 'n': support, 'probs': probs[1:] })
plot2 = (pn.ggplot(df2, pn.aes(x='n', y='probs'))
         + pn.geom_bar(stat='identity', color='black', fill='white')
         + pn.scale_x_continuous(breaks=support)
         + pn.scale_y_continuous(breaks=np.arange(1, 7) / 36,
               labels=['1/36', '2/36', '3/36', '4/36', '5/36', '6/36'])
         + pn.ggtitle('pmf'))
g12 = (pw.load_ggplot(plot1, figsize=(3.25, 3))
      | pw.load_ggplot(plot2, figsize=(3.25, 3)))
g12.savefig()

The characterization of random variables with pdfs and pmfs is incomplete because not all probability distributions can be classified as discrete or continuous. A random variable $X$ can have a cdf with both continuous portions and jumps. For example, if we define a mixture of a discrete and continuous cdf (e.g., for a spike-and-slab prior in Bayesian statistics), we get a cdf that is neither discrete nor continuous (we provide an example and plot its cdf below).

Conditional cdfs, pdfs, and pmfs

Conditional cumulative distribution function

If $X:\Omega \rightarrow \mathbb{R}$ is a random variable and $A \subseteq \Omega$ is an event with $\Pr[A] > 0,$ the conditional cumulative distribution function is \[ F_{X \mid A}(x) = \Pr[X \leq x \mid A], \] where we read $X \leq x$ as the event $\{ \omega \in \Omega : X(\omega) \leq x \}.$

Conditional probability mass functions

If $X$ is a discrete random variable, the conditional probability mass function is \[ p_{X \mid A}(x) = \Pr[X = x \mid A], \] where $X = x$ picks out the event $\{ \omega \in \Omega : X(\omega) = x \}.$

Suppose $Y:\Omega \rightarrow \mathbb{R}$ is a discrete random variable (i.e., $\textrm{range}(Y)$ is countable). It is conventional to further overload notation and terminology and define the conditional probability mass function $p_{X \mid Y}$ by taking \[ p_{X \mid Y}(x \mid y) = \Pr[X = x \mid Y = y]. \] It follows that \[ p_{X \mid Y}(x \mid y) = \frac{p_{X, Y}(x, y)}{p_Y(y)}. \]

Conditional probability density functions

If $X$ is continuous, the conditional probability density function is defined by differentiation, \[ p_{X \mid A}(x) = F_{X \mid A}'(x). \]

If $Y$ is also continuous, we will define a limit that provides the conditional form. Let $Y^{y, \epsilon}$ be the event determined by the condition $Y \in (y - |\epsilon|, y + |\epsilon|)$ and define \[ p_{X \mid Y}(x \mid y) = \lim_{\epsilon \rightarrow 0} \ p_{X \mid Y^{\mbox{$\scriptstyle y,\epsilon$}}}(x). \] It follows that \[ p_{X \mid Y}(x \mid y) = \frac{p_{X, Y}(x, y)}{p_Y(y)}. \]

Conditioning on discrete and continuous variables

In practice, we might have a variable (discrete or continuous) whose value depends on the value of one or more discrete variables and one or more continuous variables (e.g., the outcome of a drug trial based on a patient’s discrete age group and continuous blood pressure). In this case, we define the cdf by conditioning on equality for discrete variables and by taking limits of surrounding intervals (balls in higher dimensions) for continuous variables. If the resulting variable is discrete, its pmf is defined by differences of the cdf; if it is continuous, the pdf is defined by differentiating the cdf.

In all cases, we will rely on the fact that for any random variables $X$ and $Y$, including multivariate ones, we have \[ p_{X \mid Y}(x \mid y) = \frac{p_{X, Y}(x, y)}{p_Y(y)}. \] Here, we have exploited the overloaded $p$ notation, which allows for discrete, continuous, or mixed random variables.

Example 5. Suppose we have a continuous random variable $Y$ that depends on a discrete random variable $U$ and a continuous random variable $V.$ Then we define the conditional density as a partial derivative of a limit conditioned on the discrete outcome, \[ p_{Y \mid U, V}(y \mid u, v) = \frac{\partial}{\partial y} \lim_{\epsilon \rightarrow 0} F_{Y \mid U = u, \ V \in (v - |\epsilon|, v+ |\epsilon|)}(y). \]

Mixed continuous and discrete variables

Not all random variables can be classified as continuous or discrete—some have cumulative distribution functions that are continuous in some segments and step functions elsewhere. In our applied examples, such distributions will be constructed from simpler continuous and discrete distributions.

Example 6. Suppose we have a binary random variable $Z$ (i.e., $\textrm{range}(Z) = \{ 0, 1 \}$) and a continuous random variable $U$ with $\textrm{range}(U) = \mathbb{R}.$ We can define a new random variable $Y$ by \[ Y = \begin{cases} 0 & \textrm{if } Z = 1, \textrm{ and} \\[2pt] U & \textrm{if } Z = 0. \end{cases} \] The variable $Y$ will have $\textrm{range}(Y) = \mathbb{R},$ so it is not technically discrete. Nevertheless, we see that $\Pr[Y = 0] = \Pr[Z = 1],$ which cannot happen in a continuous distribution (where all points have zero probability). The cdf $F_Y$ is equal to the cdf $F_U$ scaled by the probability $Z = 0$, with a jump at zero of $\Pr[Z = 1],$ \[ F_Y(y) = \begin{cases} F_U(y) \cdot \Pr[Z = 0] & \textrm{if } y < 0, \textrm{ and} \\[4pt] \Pr[Z = 1] + F_U(y) \cdot \Pr[Z = 0] & \textrm{if } y > 0. \end{cases} \] Zero-inflated continuous distributions are sometimes called spike and slab distributions, because the probability mass at zero acts as a “spike” (i.e., as a delta function for integration), with the underlying continuous distribution forming the “slab” (especially when there is a high probability of extreme values).

Here is a plot of the resulting mixed cdf for the case $\Pr[Z = 1] = 0.4$, with the standard logistic distribution as the continuous component (i.e., with cdf $\textrm{logit}^{-1}.)$

Code

p = 0.4
def mix_cdf(y):
    cdf_y = (1 - p) * inv_logit(y)
    return cdf_y if y < 0 else cdf_y + p
y_left = np.linspace(-8, -0.065, 50)
y_right = np.linspace(0, 8, 50)
F_Y_left = [mix_cdf(y_n) for y_n in y_left]
F_Y_right = [mix_cdf(y_n) for y_n in y_right]
df_left = pd.DataFrame({ 'y': y_left, 'F_Y': F_Y_left })
df_right = pd.DataFrame({ 'y': y_right, 'F_Y': F_Y_right })
print(pn.ggplot(pn.aes(x = 'y', y = 'F_Y'))
    + pn.geom_line(data = df_left)
    + pn.geom_line(data = df_right)
    + pn.geom_point(data = pd.DataFrame({'x': [0], 'y': [(1 - p) / 2]}),
                    mapping = pn.aes(x = 'x', y = 'y'), size=2,
            shape='o', stroke = 0.5, fill='none')
    + pn.geom_point(data = pd.DataFrame({'x': [0], 'y': [1 - (1 - p) / 2]}),
                    mapping = pn.aes(x = 'x', y = 'y'), size=2)
    + pn.scale_x_continuous(breaks = np.linspace(-8, 8, 9))
    + pn.scale_y_continuous(breaks = [0, 0.3, 0.7, 1])
)

B.6 Multivariate random variables

Although we have only described univariate random variables that take values in $\mathbb{R}$, if we take a collection of random variables $X_1, \ldots, X_N$ we can combine them into a multivariate random variable $X:\Omega \rightarrow \mathbb{R}^N$, by taking \[ X(\omega) = [X_1(\omega) \ \cdots \ X_N(\omega)]. \] In this case, we can write $X = [X_1 \ \cdots \ X_N].$ The same construction may be naturally extended to matrices, tensors, etc.

Example 4. Consider a sample space $(0, 1)^2$ with a Borel $\sigma$-algebra and probability function given by area (i.e., the Lebesgue measure), so that if $A \subseteq (0,1)^2,$ we have $\Pr[A] = \int_{(0,1)^2} \, \textrm{I}(x \in A) \, \textrm{d}x.$ We can define a random variable $X$ by $X(\omega) = \pi_1(\omega),$ where $\pi_1$ is the projection function defined by $\pi_1\!\left((x,y)\strut\right) = x,$ and a second random variable $Y$ defined by $Y(\omega) = \pi_2(\omega),$ where $\pi_2\!\left((x, y)\strut\right) = y.$ We can define a multivariate random variable $Z = [X \quad Y],$ which could have also been defined directly by $Z(\omega) = \omega.$

Functions of multivariate random variables

As in the univariate case, we can directly apply real-valued multivariate functions to a sequence of random variables to define new random variables. For example, if $f:\textrm{range}(X_1) \times \cdots \times \textrm{range}(X_N) \rightarrow \mathbb{R}^M,$ then we get a new random variable $Y = f(X)$ defined by \[ Y(\omega) = f(X_1(\omega), \ldots, X_N(\omega)). \]

Example 4, cont. We can take the random variables $X$ and $Y$ and define a new random variable $L = \sqrt{X^2 + Y^2}$ which is the Euclidean distance from the origin to the point $(X, Y)$. We can move back and forth between univariate and multivariate functions. For example, we could have defined $L$ as a function of the multivariate variable $Z$ by $L = \sqrt{Z_1^2 + Z_2^2}.$

Change of variables (multivariate)

Suppose we have a random variable $X \in \mathbb{R}^N$ and a smooth, invertible function $f:\mathbb{R}^N \rightarrow \mathbb{R}^N.$ Then we define $Y = f(X)$ to be a new random variable, and note that its density is \[ p_Y(y) = p_X\!\left(f^{-1}(y)\right) \cdot \left| \nabla \! \left( f^{-1} \right) \! \left(y\right) \right|, \] where $\nabla g$ is the Jacobian of the function $g$ and $\left| A \right|$ is read as the absolute determinant of the matrix $A.$ Here the function $g = f^{-1}$ maps $\mathbb{R}^N$ to $\mathbb{R}^N,$ so the Jacobian $\nabla\!\left(f^{-1}\right)$ is an $N \times N$ matrix.

If the Jacobian $\nabla f^{-1}$ is a triangular $N \times N$ matrix, its determinant simplifies to product of its diagonal, \[ \left| (\nabla f^{-1})(y) \right| = \prod_{n = 1}^N \left((\nabla f^{-1})(y)\right)_{n, n}, \] where $(\nabla f^{-1}(y))_{n, n}$ is the partial derivative of the $n$-th element of the output with respect to the $n$-th element of the input, evaluated at $y,$ \[ \left((\nabla f^{-1})(y)\right)_{n, n} = \frac{\partial}{\partial u_n} f^{-1}(u)_n \ \bigg|_{u = y}. \] For example, suppose we have random variables $X, Y$ such that $X \in (0, 1)$ and $Y \in (0, 1 - X).$ If we define $Z = 1 - X - Y,$ then $[X \quad Y \quad Z]$ is a simplex (i.e., non-negative entries, sums to one). We can perform a change of variables on $X, Y$ to convert to unconstrained parameters $U, V \in \mathbb{R}^2$ using the following “stick-breaking” transform, $[u \quad v] = f([x \quad y]),$ defined by \[ u = \textrm{logit}(x) \qquad v = \textrm{logit}\!\left(\frac{y}{1 - x}\right). \] The inverse of this function, $[x \quad y] = f^{-1}([u \quad v])$ is given by \[ x = \textrm{logit}^{-1}(u) \qquad y = \textrm{logit}^{-1}(v) \cdot (1 - \textrm{logit}^{-1}(u)), \] with the Jacobian matrix \[ \begin{bmatrix} \frac{\partial}{\partial u} x & \frac{\partial}{\partial v} x \\[4pt] \frac{\partial}{\partial u} y & \frac{\partial}{\partial v} y \end{bmatrix} = \begin{bmatrix} (\textrm{logit}^{-1})'(u) & 0 \\[4pt] \textrm{logit}^{-1}(v) \cdot (\textrm{logit}^{-1})'(u) & (1 - \textrm{logit}^{-1}(u)) \cdot (\textrm{logit}^{-1})'(v) \end{bmatrix}, \] and thus has a Jacobian determinant of \[ \left| \nabla f^{-1}([u \quad v]) \right| = (\textrm{logit}^{-1})'(u) \cdot (1 - \textrm{logit}^{-1}(u)) \cdot (\textrm{logit}^{-1})'(v), \] where \[ \left(\textrm{logit}^{-1}\right)'(y) = \textrm{logit}^{-1}(y) \cdot \left( 1 - \textrm{logit}^{-1}(y) \right). \]

It is convenient for derivations that the Jacobian for a function composition is just the product of the Jacobians for the functions, \[ \frac{\textrm{d}}{\textrm{d}x} f(g(x)) = f'(g(x)) \cdot f'(x), \] and hence the determinant is just a function of the individual functions’ determinants, \[ \left| \nabla (f \circ g) (x) \right| = \left| \nabla f (g(x)) \right| \cdot \left| \nabla g (x) \right|. \]

Multivariate cdfs, pmfs, and pdfs

The cumulative distribution function for a multivariate random variable $X:\Omega \rightarrow \mathbb{R}^N$ is defined by \[ F_X(x_1, \ldots, x_N) = \Pr[x_1 \leq X_1, \cdots, x_N \leq X_N] = \Pr[\{ x \in \mathbb{R}^N : x_1 \leq X_1, \ldots, x_N \leq X_N \}]. \]

The probability mass function for a discrete multivariate random variable is \[ p_X(x_1, \ldots, x_N) = \Pr[X_1 = x_1, \ldots, X_N = x_n]. \]

The probability density function for a continuous multivariate random variable $X \in \mathbb{R}^N$ is defined as the function $p_X:\mathbb{R}^N \rightarrow [0, \infty)$ that has the property that for any event $A \in \mathcal{S},$ \[ \Pr[x \in A] = \int_A p_X(x) \textrm{d} x, \] or equivalently, in terms of rectangular integrations and the cdf, \[ F_X(x) = \int_0^{x_1} \cdots \int_0^{x_N} p_X([x_1 \quad \cdots \quad x_N]) \, \textrm{d}x_N \cdots \textrm{d}x_1. \]

We can marginalize a joint cdf with a limit, \[ F_X(x) = \lim_{y \rightarrow \infty} F_{X, Y}(x, y). \] The law of total probability holds for joint densities, \[ p_X(x) = \int_{\mathbb R} \, p_{X, Y}(x, y) \, \textrm{d}y. \]

Conditional cdfs, pmfs, and pdfs are defined in the same way as for univariate random variables. This leads to a common form of marginalization where the joint density is factored, \[ p_X(x) = \int_{\mathbb R} \, p_{X \mid Y}(x \mid y) \cdot p_Y(y) \, \textrm{d}y. \]

B.7 Expectations

Expectation

If we have a (potentially multivariate) random variable $Z$, we write $\mathbb{E}[Z]$ for its expectation, which despite the fancy name, is just its average value, weighted by its density, \[ \mathbb{E}[Z] = \int_{\textrm{supp}(Z)} z \cdot p_Z(z) \, \textrm{d}z. \] Discussing random variables presupposes a background probability space $(\Omega, \mathcal{S}, \Pr)$ with respect to which the cumulative distribution function $F_Z$ and the probability density function $p_Z$ are defined.

Statistical notation overloads many notations, and commonly just $Z$ rather than $\textrm{supp}(Z)$ is used to subscript the integral.

Expectations are linear operators in that if $a, b$ are constants and $X$ is a random variable, then \[ \mathbb{E}[a + b \cdot X] = a + b \cdot \mathbb{E}[X]. \] Expectations are also additive in random variables, \[ \mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]. \] If $X$ and $Y$ are independent, then we also have \[ \mathbb{E}[X \cdot Y] = \mathbb{E}[X] \cdot \mathbb{E}[Y]. \]

Conditional expectation

With Bayesian statistics, we are typically interested in conditional expectations, which are the value of a random variable conditioned on the observed value of a second random variable. Suppose $Y$ is a second random variable over the same implicit probability space and we observe that $Y = y$. The conditional expectation of $Z$ given $Y = y$ is \[ \mathbb{E}[Z \mid Y = y] = \int_{Z} z \cdot p_{Z \mid Y}(z \mid y) \, \textrm{d}z. \] That is, we take a weighted average of the value of $Z$ with weights determined by the conditional density $p_{Z \mid Y}(z \mid y)$.

Expectations of functions of random variables

Suppose we have a random variable $Z$ and a real-valued function $f:\textrm{supp}(Z) \rightarrow \mathbb{R}^N$. We can then define a new random variable $W = f(Z)$. To calculate the expectation of $W$ in terms of the density for $Z$, we use the so-called law of the unconscious statistician, which tells us we can simplify the change of variables back to a direct average over $Z$, \[\begin{align} \mathbb{E}[f(Z)] &= \mathbb{E}[W] \\[4pt] &= \int_W w \cdot p_W(w) \, \textrm{d}w. \\[4pt] &= \int_W w \cdot p_Z(f^{-1}(w)) \cdot (f^{-1})'(w) \, \textrm{d}w. \\[4pt] &= \int_Z f(z) \cdot p_Z(z) \, \textrm{d}z, \end{align}\] where we have written $f^{-1}$ for the inverse transfrom form $W$ back to $Z$ and $(f^{-1})'$ for its derivative. The first line is the definition, the second is a standard univariate change of variables (see above), but deriving the third line in the general case requires measure-theoretic techniques beyond this short introduction.

The conditional form follows suit, \[ \mathbb{E}[f(Z) \mid Y = y] = \int_Z f(z) \cdot p_{Z \mid Y}(z \mid y) \, \textrm{d}z. \] Either way, we just take a weighted average the value of $f(z)$ with weights $p_Z(z)$ or $p_{Z|Y}(z \mid y).$

B.8 Variance, standard deviation, covariance, and correlation

Variance

The variance of a random variable is the expectation of its squared difference from the expectation, \[ \textrm{var}[X] = \mathbb{E}\!\left[ \left( X - \mathbb{E}[X] \right)^2 \right]. \] Variance can be decomposed as follows. \[\begin{align} \textrm{var}[X] &= \mathbb{E}\!\left[ \left( X - \mathbb{E}[X] \right)^2 \right] \\[4pt] &= \mathbb{E}\!\left[ \, X^2 - 2 \cdot X \cdot \mathbb{E}[X] + \mathbb{E}[X]^2 \, \right] \\[4pt] &= \mathbb{E}\!\left[X^2 \right] - \mathbb{E}\!\left[2 \cdot X \cdot \mathbb{E}[X]\right] + \mathbb{E}\!\left[\mathbb{E}[X]^2\right] \\[4pt] &= \mathbb{E}\!\left[X^2 \right] - 2 \cdot \mathbb{E}[X] \cdot \mathbb{E}[X] + \mathbb{E}[X]^2 \\[4pt] &= \mathbb{E}\!\left[X^2\right] - \mathbb{E}[X]^2. \end{align}\]

If $X$ represents a distance measured in meters, then the variance $\textrm{var}[X]$ is expressed in square meters. This can make it tricky to reason about variances.

Adding a constant $c \in \mathbb{R}$ to a random variable $X$ does not change its variance, \[ \textrm{var}[c + X] = \textrm{var}[X]. \] Multiplying the variance requires quadratic scaling, \[ \textrm{var}[c \cdot X] = c^2 \cdot \textrm{var}[X]. \] The variance of a sum is the sum of the variances, \[ \textrm{var}[U + V] = \textrm{var}[U] + \textrm{var}[V]. \]

Divergent expectations

An expectation is defined through an integral, but that integral might diverge (i.e., not have a finite value). Examples are rampant in statistics. Suppose we have two independent, standard normal random variables $X_1, X_2 \sim \textrm{normal}(0, 1).$ Then their quotient $Y = \frac{X_1}{X_2}$ is a random variable such that $\mathbb{E}[Y]$ and hence $\textrm{var}[Y]$ diverges. The density of $Y$ follows the Cauchy distribution, with \[ p_Y(y) = \frac{1}{\pi} \cdot \frac{1}{1 + y^2}. \] The fundamental problem is that $p_Y(y)$ does not converge to zero quickly enough as $y \rightarrow \infty$ or $y \rightarrow -\infty$. Contrast the standard Cauchy density with the standard normal density, \[ p_{X_1}(x) = \frac{1}{\sqrt{2 \cdot \pi}} \exp\!\left(-\frac{1}{2}\cdot x\right), \] which converges exponentially fast.

Standard deviation

The standard deviation of a random variable is the square root of its variance, \[ \textrm{sd}[X] = \sqrt{\textrm{var}[X]}. \] The units of the standard deviation are identical to those of the random variable $X$. Thus we say that the standard deviation is the scale of a variable’s variation.

Standard deviations of a random variable $X$ have the same scale as the random variable $X$ itself. They also drop additive constants, but unlike variances, they scale linearly with multiplication due to their scale. For any constant $c \in \mathbb{R},$ we have \[\begin{align} \textrm{sd}[c + X] &= \textrm{sd}[X] \\[4pt] \textrm{sd}[c \cdot X] &= c \cdot \textrm{sd}[X]. \end{align}\] The standard deviation of a sum works through the fact that variances are additive, \[ \textrm{sd}[U + V] = \sqrt{\textrm{sd}[U]^2 + \textrm{sd}[V]^2}. \]

Covariance

If we have two random variables $X$ and $Y$, their covariance is defined as \[ \textrm{cov}[X, Y] = \mathbb{E}\!\left[\strut (X - \mathbb{E}[X]) \cdot (Y - \mathbb{E}[Y])\right]. \] Covariance is measured in units that are the product of the units of $X$ and the units of $Y$. For example if $X$ is watts and $Y$ is hours, then the units of $\textrm{cov}[X, Y]$ is watt-hours. As an operation, covariance is commutative in that $\textrm{cov}[X,Y] = \textrm{cov}[Y, X].$

The covariance between a variable and itself is just its variance, \[ \textrm{cov}[X, X] = \mathbb{E}\!\left[ \strut (X - \mathbb{E}[X]) \cdot (X - \mathbb{E}[X])\right] = \mathbb{E}\!\left[ \strut (X - \mathbb{E}[X])^2\right] = \textrm{var}[X]. \]

Example 6. We can construct a pair of scalar random variables $(X, Y)$ that covary by generating data according to a simple linear regression with $\mathbb{E}[Y] = \alpha + \beta \cdot X,$ \[\begin{align} p_X(x) &= \textrm{normal}(x \mid 0, \tau), \textrm{ and} \\[4pt] p_{Y \mid X}(y \mid x) &= \textrm{normal}(y \mid \alpha + \beta \cdot x, \sigma), \end{align}\] where $\sigma, \tau \in (0, \infty)$ are scale parameters (i.e., determine the distribution’s standard deviation). Because the normal distribution uses a location-scale parameterization, we can also construct $Y$ using a simple change of variables using a new variable $E$ corresponding to the “error,” \[\begin{align} Y &= \alpha + \beta \cdot X + E \\[4pt] p_E(\epsilon) &= \textrm{normal}(0, \sigma). \end{align}\] We know from regression theory that if $Y = \alpha + \beta \cdot X + E$ where $E$ has a normal distribution, then \[ \beta = \frac{\textrm{cov}[X, Y]}{\textrm{var}[X]}, \] and hence \[ \textrm{cov}[X, Y] = \beta \cdot \textrm{var}[X] = \beta \cdot \tau^2. \] We can validate our mathematical derivation via simulation.

M = 100_000
tau = 2.3
sigma = 1.5
alpha, beta = 1.3, -0.7
x = np.random.normal(loc = 0, scale = tau, size = M)
y = np.random.normal(loc = alpha + beta * x, scale = sigma, size = M)
cov_hat = np.cov(x, y)[0, 1]
print(f"sample cov(x, y) = {cov_hat:5.2f}")

sample cov(x, y) = -3.71

Here, we expect the sample covariance to be very close to $\beta \cdot \tau^2 = -0.7 \cdot 2.3^2 = -3.703.$

Correlation

The correlation between two random variables is their inverse scaled covariance, \[ \textrm{corr}[X, Y] = \frac{\textrm{cov}[X, Y]} {\textrm{sd}[X] \cdot \textrm{sd}[Y]}. \] The units all cancel here—the numerator has units in the square of the original variables as does the denominator. Like covariance, correlation is commutative.

The correlation between a variable and itself is the unit, \[ \textrm{corr}[X, X] = \frac{\textrm{var}[X]} {\textrm{sd}[X] \cdot \textrm{sd}[X]} = 1. \]

Example 6, cont. Given that we have computed the covariance between $X$ and $Y$, we can compute their correlation by dividing by their scales, \[ \textrm{corr}[X, Y] = \frac{\textrm{cov}[X, Y]}{\textrm{sd}[X] \cdot \textrm{sd}[Y]}. \] We know that $\textrm{sd}[X] = \tau,$ but $Y$ is a compound that inherits additional uncertainty form its location parameter being a random variable itself. Nevertheless, because $Y = \alpha + E + \beta \cdot X$, where $\alpha$ is constant and both $E$ and $X$ are normal, we just add variances of the varying terms and then take the square root to return to the standard deviation scale, \[ \textrm{sd}[Y] = \sqrt{\textrm{var}[\beta \cdot X] + \textrm{var}[E]} = \sqrt{\beta^2 \cdot \textrm{var}[X] + \textrm{var}[E]} = \sqrt{\beta^2 \cdot \tau^2 + \sigma^2} \approx 2.200. \] Therefore, the correlation will be \[ \textrm{corr}[X, Y] = \frac{\textrm{cov}[X, Y]}{\textrm{sd}[X] \cdot \textrm{sd}[Y]} = \frac{\beta \cdot \tau^2}{\sigma \cdot \sqrt{\beta^2 \cdot \tau^2 + \sigma^2}} = \frac{-0.7 \cdot 2.3^2}{2.3 \cdot 2.2} \approx -0.732. \] The simulation results agree with the mathematics.

print(f"sd(x) = {np.std(x):5.2f};  sd(y) = {np.std(y):5.2f}")
corr_hat = np.corrcoef(x, y)[0, 1]
print(f"sample corr(x, y) = {corr_hat:5.2f}")

sd(x) =  2.30;  sd(y) =  2.21
sample corr(x, y) = -0.73

Covariance and correlation matrices

If we have a random variable $X:\Omega \rightarrow \mathbb{R}^N$, its covariance matrix is \[ \textrm{cov}[X] = \mathbb{E}\!\left[(X - \mathbb{E}[X]) \cdot (X - \mathbb{E}[X])^{\top}\right]. \] The covariance matrix of $X$ has entries that are covariances of its elements, \[ \textrm{cov}[X]_{i,j} = \textrm{cov}[X_i, X_j]. \] Covariance matrices are symmetric and positive definite (and hence full rank). The units of the entries are squared relative to the original variable $X$.

The correlation matrix for $X$ standardizes the covariance matrix by dividing by the scales, \[ \textrm{corr}[X] = S^{-1} \cdot \textrm{cov}[X] \cdot S^{-1}, \] where $S$ is the diagonal matrix of scales, $S_{n,n} = \textrm{sd}[X_n].$ The entries are the correlations between pairs of components, \[ \textrm{corr}[X]_{i,j} = \frac{1}{\textrm{sd}[X_i]} \cdot \textrm{cov}[X_i, X_j] \cdot \frac{1}{\textrm{sd}[X_j]} = \textrm{corr}[X_i, X_j]. \]

B.9 Quantiles, medians, and uncertainty intervals

The quantile function for a continuous random variable $Z$ is its inverse cdf, $F^{-1}_Z:(0, 1) \rightarrow \textrm{range}(Z).$ For a value $p \in (0, 1),$ we say that $F^{-1}_Z(p)$ is the $p$-quantile of $Z$. In terms of probabilities, if $z = F^{-1}_Z(p),$ then $\Pr[Z \leq z] = p.$

The 50% quantile, $F_Z^{-1}(0.5),$ is known as the median. There is a 50% chance that a random variable takes on a value less than its median and a 50% chance that it takes on a value greater.

We can define posterior intervals for variables using quantile functions. For a random variable $X$ and $p \in (0, 1)$ the central $p$ posterior interval of $X$ is $\left( F_X^{-1}\!\left(\frac{1 - p}{2}\right), F_X^{-1}\!\left(1 - \frac{1 - p}{2}\right) \right),$ and it has its nominal coverage by definition, \[ \Pr\!\left[X \in \left( F_X^{-1}\!\left(\frac{1 - p}{2}\right), F_X^{-1}\!\left(1 - \frac{1 - p}{2}\right) \right)\right] = p. \]

B.10 The central limit theorem

The central limit theorem is expressed in terms of the convergence of a sequence of random variables to another random variable, so we will begin with formal definitions of convergence of random variables.

Convergence of sequences of random variables

Suppose $Y = Y_1, Y_2, \ldots$ is a sequence of random variables and $Z$ is another random variable.

The sequence $Y$ converges in distribution to $Z$ if for every point $x$ at which $F_Z$ is continuous, \[ \textrm{lim}_{n \rightarrow \infty} F_{Y_n}(x) = F_{Z}(x). \]

The sequence $Y$ converges in probability to $Z$ if for all $\epsilon > 0,$ \[ \textrm{lim}_{n \rightarrow \infty} \Pr\!\left[ \strut \left| Y_n - Z \right| \geq \epsilon \right] = 0. \]

The sequence $Y$ converges almost surely to $Z$ if \[ \Pr\!\left[ \{ \omega : \lim_{n \rightarrow \infty} Y_n(\omega) = Z(\omega) \} \right] = 1. \]

Convergence of a sequence of random variables to a constant value $\mu \in \mathbb{R}$ is defined as convergence to a constant random variable $Z$ that always takes the value $\mu$ (i.e., $Z(\omega) = \mu$ for all $\omega \in \Omega$).

The laws of large numbers

Suppose we have an infinite sequence of random variables $X_0, X_1, \ldots$ that are independent and identically distributed (i.i.d.) with a finite expectation, which means there is a constant $\mu \in \mathbb{R}$ such that for all $n \in \mathbb{N},$ \[ \mathbb{E}[X_n] = \mu. \] Next, define a sequence of variables $Y_0, Y_1, \ldots$ to be the running averages of the $X$ sequence, \[ Y_n = \textrm{mean}(X_{0:n}) = \frac{1}{n + 1} \sum_{i = 0}^n X_n. \] The additional one in the denominator is because there are $n + 1$ numbers between $0$ and $n$ (inclusive).

The weak law of large numbers is that the sequence $Y$ converges in probability to $\mu.$

The strong law of large numbers is that the sequence $Y$ converges almost surely to $\mu.$

Example 7: Let $X_n \sim \textrm{bernoulli}(0.8),$ and recall that $Y_n = \textrm{mean}(X_{0:n}).$ We draw a sequence of length $n = 10^6$ and plot $Y_n$ versus $n$ in two segments. On the left is $n \in [10^2, 10^4]$ and on the right shows $n \in [10^4, 10^6].$ The axes are scaled so that on the right are shrunk by a factor fo 10—on the left, the $y$-axis has a range $(0.7, 0.9)$, whereas it is $(0.79, 0.81)$ on the rig.

Code

N = 1_000_001
M = 100

# Simulate M sequences
df_list = []
for m in range(M):
    X = np.random.choice([0, 1], size=N, p=[0.2, 0.8])
    Y = np.cumsum(X) / (np.arange(N) + 1)
    df = pd.DataFrame({'n': np.arange(N), 'Y': Y})
    df['Simulation'] = m
    df_list.append(df)
df = pd.concat(df_list)

start, end = np.log10(100), np.log10(10_000)
log_values = np.linspace(start, end, 1000)
indices = np.unique(np.round(10**log_values)).astype(int)
df1 = df.loc[indices]
plot1 = (pn.ggplot(df1, pn.aes(x='n', y='Y', group='Simulation'))
        + pn.geom_line(alpha=0.2, size=0.25)
        + pn.scale_y_continuous(limits=[0.7, 0.9])
        + pn.scale_x_continuous(trans='log10'))

start, end = np.log10(10_000), np.log10(1_000_000)
log_values = np.linspace(start, end, 1000)
indices = np.unique(np.round(10**log_values)).astype(int)
df2 = df.loc[indices]
plot2 = (pn.ggplot(df2, pn.aes(x='n', y='Y', group='Simulation'))
        + pn.geom_line(alpha=0.2, size=0.25)
        + pn.scale_y_continuous(limits=[0.79, 0.81])
        + pn.scale_x_continuous(trans='log10'))

g12 = (pw.load_ggplot(plot1, figsize=(2.8, 2))
      | pw.load_ggplot(plot2, figsize=(2.8, 2)))
g12.savefig()

The plots look nearly identical because the error in estimating the mean in the laws of large numbers follows a predictable pattern we describe in the next section.

The central limit theorem

We will still be assuming that $X = X_0, X_1, \ldots$ is an infinite sequence of i.i.d. random variables with finite mean and further assume they have finite variance, so that we can define \[ \sigma = \textrm{sd}[X_n]. \] The central limit theorem involves convergence to a random variable with a standard normal distribution (which such variable we choose doesn’t matter). Because $Y_n$ is defined as a running average of the $X_n$, their scale will be reduced by $\sqrt{n}$ (this is what the central limit theorem establishes). Because we want a single variable to target for our limit, we also standardize the $Y_n$ by subtracting the mean $\mu = \mathbb{E}[X_n]$ and then dividing by the scale $\sigma = \textrm{sd}[X_n].$ This gives us \[ U_n = \sqrt{n} \cdot \frac{Y_n - \mu}{\sigma}. \] The central limit theorem (CLT) simply says that $U_n$ converges in distribution to a standard normal variate $Z \sim \textrm{normal}(0, 1)$. If we let $\Phi$ be the cdf of the standard normal distribution, we can unpack the convergence in distribution to restate the CLT as \[ \lim_{n \rightarrow \infty} \Pr[U_n \leq x] = \Phi(x). \]

The first version of the central limit theorem, provided by De Moivre (1718), was restricted to the binomial case; it was generalized nearly 100 years later by Pierre-Simon Laplace (1811) and then extended to non identically distributed variables by Lyapunov (1900--1901) nearly another century later.

Example 7, cont.: We continue the example with $X_n \sim \textrm{bernoulli}(0.8),$ which has an expected value of $\mu = 0.8$ and standard deviation of $\sigma = \sqrt{\mu \cdot (1 - \mu)} = 0.4.$ As before, $Y_n$ defined to be the running average of the $X_n.$ We now take $U_n = \sqrt{n} \cdot \frac{Y_n - \mu}{\sigma},$ as defined above, and plot histograms of draws for various $n.$

Code

M = 1000 if DRAFT else 1_000_000
df = pd.DataFrame(columns = ['N', 'y'])
mu = 0.2
sigma = 0.4
for N in [2, 4, 8, 16, 32, 64, 128, 256, 512]:
    y = [np.mean(x) / N for x in np.random.binomial(N, 0.2, size=M)]
    u = [np.sqrt(N) * (v - mu) / sigma for v in y]
    dfN = pd.DataFrame({'N': N, 'u': u })
    df = pd.concat([df, dfN], ignore_index = True)
print(pn.ggplot(df, pn.aes(x = 'u'))
  + pn.geom_bar(stat = 'count', width=0.125)
  + pn.facet_wrap('~ N', scales='free_y')
  + pn.scale_x_continuous(breaks = [-4, -2, 0, 2, 4], limits = [-5, 5])
  + pn.theme(axis_text_y = pn.element_blank(),
             axis_title_y = pn.element_blank(),
             axis_ticks_major_y = pn.element_blank()))

With $n \gg 100,$ the approximation is quite reasonable even with skewed Bernoulli draws. More draws would be required for chance of success greater than 0.8 (or by symmetry, less than 0.2) and fewer would be required for chances of success closer to 0.5.

B.11 Stochastic processes

A stochastic process is an infinite sequence of random variables taking their values in some set $\mathcal{X},$ which is referred to as the state space of the process. For the purposes of parametric applied statistics, we will assume a real-valued state space of fixed dimension, $\mathcal{X} = \mathbb{R}^N.$ We do allow variables that range over subsets of $\mathbb{R}^N,$ such as the integers, simplexes, or positive definite matrices.

Discrete-time stochastic processes

A discrete-time stochastic process is a countably infinite sequence of random variables $X = X_0, X_1, \ldots \in \mathcal{X}$ indexed by natural numbers $n \in \mathbb{N}.$ Continuous-time processes are similar, only with real-valued indexes.

In the simplest case where the state space is just the real numbers, $\mathcal{X} = \mathbb{R}$, then each $X_n$ is scalar and the stochastic process $X$ can be treated like an infinite-dimensional vector. Or you can think of it as a function from indexes to regular old scalar random variables.

Stationary processes

A discrete-time stochastic process $X_1, X_2, \ldots$ is stationary if the marginal distribution of every subsequence of the same length is the same, i.e., position, i.e., for all sequence lengths $k \in \mathbb{N}$ and starting positions $n, n' \in \mathbb{N},$ \[ F_{X_{n}, \ldots, X_{n + k}} = F_{X_{n'}, \ldots X_{n' + k}}. \] With $k = 1,$ we have $F_{X_n} = F_{X_{n'}}$ for all $n, n' \in \mathbb{N}.$ In particular, the initial distribution $F_{X_0}$ must be identical to every other $F_{X_n}$ (more about this later).

this requires $F_{X_0} = F_{X_n}$ for all $n \in \mathbb{N}.$

Example 1. Any i.i.d. sequence is stationary, including a constant sequence.

B.12 Markov chains

A Markov chain is a discrete-time stochastic process consisting of a countably infinite sequence of random variables $X = X_0, X_1, \ldots,$ where the conditional distribution of $X_{n + 1}$ given $X_1, \ldots, X_n$ depends only on $X_n,$ i.e., for all $n \in \mathbb{N},$ \[ F_{X_{n + 1} \mid X_n} = F_{X_{n + 1} \mid X_n, \ldots, X_1} \] The distribution function for the first element of the chain is $F_{X^{(0)}}.$

These distribution functions may only have support over subsets of the state space $\mathcal{X} = \mathbb{R}^N,$ such as correlations constrained to $[-1, 1]$ or rates constrained to $(0, \infty).$ In the finite state-space case, the initial distribution can be represented as a simplex (i.e., a vector with non-negative entries that sums to one) and the transition distribution as a stochastic matrices (i.e., matrices with simplex rows).

Time-homogeneous Markov chains

We will restrict our attention to time-homogeneous Markov chains, where the conditional distribution function for the next element does not depend on $n,$ i.e., for all $n, n' \in \mathbb{N},$ \[ F_{X_{n + 1} \mid X_n} = F_{X_{n' + 1} \mid X_{n'}}. \]

Initial and conditional pmfs or pdfs

We will often work with either a discrete or continuous initial distribution, which will have either a pmf or pdf $p_{X_0}.$ Similarly, transition distributions will typically be discrete or continuous and have either a conditional pmf or pdf $p_{X_{n + 1} \mid X_n},$ which is the same for every $n$ in the time-homogeneous case.

Example 2. Autoregressive process. Rosenthal (2006) provides the example of a Markov chain defined by an initial distribution with density \[ p_{X_0}(x_0) = \textrm{normal}(x_0 \mid 0, 1), \] and transition distribution with density \[ p_{X_{n + 1} \mid X_n}(x_{n + 1} \mid x_n) = \textrm{normal}\!\left(x_{n+1} \mid \frac{x_n}{2}, \frac{3}{4}\right), \] which leads to a Markov chain with a standard normal stationary distribution \[ p_{X_n}(x_n) = \textrm{normal}(x_n \mid 0, 1). \] This process is an example of a first-order autoregressive process (AR(1)), the general form of which has a transition distribution \[ X_{n + 1} \sim \textrm{normal}\!\left(\alpha + \beta \cdot X_n \mid \sigma\right). \] An AR(1) process is stationary only if $\alpha = 0$ and $\left| \beta \right| < 1,$ and has a marginal distribution of \[ X_n \sim \textrm{normal}\!\left(0, \frac{\sigma}{\sqrt{1 - \beta^2}}\right) \] and covariance \[ \textrm{cov}[X_n, X_{n + k}] = \frac{\sigma^2 \cdot \beta^k}{1 - \beta^2}. \] The covariance only depends on the number of transition steps, $k,$ and decays exponentially with $\beta < 1.$ This is easier to see on the correlation scale after dividing through by the marginal standard deviations, which are both $\sigma / \sqrt{1 - \beta^2}$, \[ \textrm{corr}[X_n, X_{n + k}] = \frac{\textrm{cov}[X_n, X_{n + k}]} {\textrm{sd}[X_n] \cdot \textrm{sd}[X_{n + k}]} = \frac{\left( \frac{\sigma^2 \cdot \beta^k}{1 - \beta^2} \right)} {\left( \frac{\sigma}{\sqrt{1 - \beta^2}} \cdot \frac{\sigma}{\sqrt{1 - \beta^2}} \right)} = \beta^k. \]

Stationarity and stationary-preserving transitions

Suppose we have a Markov chain $X = X_0, X_1, \ldots$ with state-space $\mathcal{X} = \mathbb{R}^N.$

Suppose we have a measure $\textrm{Q}$ on the state space with a density-like function $\pi_Q$ corresponding to $\textrm{Q},$ so that for every event $A$ in the Borel algebra over $\mathbb{R}^N,$ \[ \textrm{Q}[A] = \int_{A} \pi_Q(x) \, \textrm{d}x. \] We say that the measure $\textrm{Q}$ is the stationary distribution of the Markov chain $X$ if for every $A \in \mathcal{Q},$ \[ \textrm{Q}[A] = \int_{A} \pi_Q(x) \textrm{d}x. \]

A Markov chain $X$ need not be stationary to have a stationary distribution. In applications, we generally construct non-stationary Markov chains with a target stationary distribution (e.g., a Bayesian posterior). If $X$ has the stationary distribution $\textrm{Q}$ and the initial distribution follows it (i.e., $\textrm{Pr}[X_0 \in A] = \textrm{Q}[A]),$, then $X$ is a stationary chain.

Reversible Markov chains

A Markov chain is reversible if $p_{X_n, X_{n + 1}} = p_{X_{n + 1}, X_n}$ for all $n \in \mathbb{N}.$ In a reversible chain, the conditional probability of the next and previous elements are the same (i.e., $p_{X_{n + 1} \mid X_n} = p_{X_{n - 1} \mid X_n}$ for $n > 0$).

Irreducibility

A Markov chain is irreducible with respect to a probability measure on its state space if every subset of the state space with non-zero probability in the specified measure will eventually be reached from any given starting point in the state space after a finite number of steps.

More formally, suppose we have a time-homogeneous Markov chain $X = X_0, X_1, \ldots$ over state-space $\mathbb{R}^N.$ Further, suppose we have a probability measure $Q$ over the usual Borel algebra for $\mathbb{R}^N.$ In the standard definition, $Q$ can be an arbitrary measure over a general state space. Note that the measure $Q$ is distinct from the measure $\Pr$ defining the Markov chain.

A Markov chain $X = X_0, X_1, \ldots$ is $Q$-irreducible if for every $A \subseteq \mathbb{R}^N$ such that $Q[A] > 0,$ and for every point $x \in \mathbb{R}^N,$ there exists an $m \in \mathbb{N}$ such that \[ Q\left[X_{m + n} \in A \mid X_n = x\right] > 0. \]

Aperiodicity

Informally, a Markov chain is periodic if there are distinct subsets of its sample space and it cycles through them in turn. Suppose we have a Markov chain $X = X_0, X_1, \ldots$ over $\mathbb{R}^N.$ The period of such a chain is defined to be the largest $M \in \mathbb{N}$ such that there is a partition $A_0, \ldots, A_{M - 1}$ of $\mathbb{R}^N$ such that for all $m < M,$ \[ \Pr[X_{n + 1} \in A_{m + 1 \textrm{ mod } M} \mid X_n \in A_m] = 1. \] A chain is periodic if it has a period greater than one and is aperiodic otherwise.

The ergodic theorem

This section lays out a simple form of the ergodic theorem, which entails that under certain mild conditions and for almost every starting point, the distribution of the elements of a Markov chain converges to the chain’s stationary distribution in total variation distance.

Suppose $X = X_0, X_1, \ldots$ is a Markov chain over the state space $\mathbb{R}^N,$ with a Borel algebra $\mathcal{S}$ of events and a measure $\Pr.$ To aid in the statement of the theorem, we define a new measure $P^n_x$ over $A \in \mathcal{S}$ by \[ P^n_x[A] = \Pr[X_n \in A \mid X_0 = x], \] which is the probability of being in set $A$ after $n$ transitions starting from $x.$

The ergodic theorem. If $X = X_1, X_2, \ldots$ is a Markov chain over $\mathbb{R}^n$ with stationary measure $Q$ that is $Q$-irreducible and aperiodic, then for $Q$-almost every $x \in \mathbb{R}^N,$ \[ \lim_{n \rightarrow \infty} \big|\big| \, P^n_x - Q \, \big|\big| \rightarrow 0. \] The double bars are notation for total variation distance, which expands here to \[\begin{align} \big|\big| \, P^n_x - Q \, \big|\big| &= \sup_{A \in \mathcal{S}} \bigg| {\textstyle \Pr^n_x[A]} - Q[A] \bigg| \\[4pt] &= \sup_{A \in \mathcal{S}} \bigg| \Pr[X_n \in A] - Q[A] \bigg|. \end{align}\]

C. Monte Carlo methods

In this section, we introduce the basic theory of Monte Carlo methods and then move on to more general Markov chain Monte Carlo methods. The presentation here was strongly influenced by Roberts and Rosenthal (2004) and Rosenthal (2006).

C.1 Monte Carlo methods

Monte Carlo methods use randomness to estimate expectations. In the general case, suppose a random variable $X:\Omega \rightarrow \mathbb{R}^N$ is defined over a probability space $(\Omega, \mathcal{S}, \Pr)$ and we want to estimate an expectation \[ \mathbb{E}\!\left[f(X)\right] = \int_{\mathbb{R}^N} f(x) \cdot p_X(x) \, \textrm{d}x. \] This integral may be difficult to evaluate analytically, but if we can take independent draws $x^{(1)}, \ldots, x^{(M)}$ where each draw is distributed as \[ x^{(m)} \sim p_X(\cdot), \] then we can approximate the integral to any desired degree of precision as \[ \mathbb{E}\!\left[f(X)\right] \approx \frac{1}{M} \sum_{m=1}^M f\!\left(x^{(m)}\right). \] Because the draws of $x^{(m)}$ are independent, the draws of $f\!\left(x^{(m)}\right)$ are also independent, so the ordinary central limit theorem applies. This means that we get the right answer in the limit as the number of draws grows, \[ \lim_{M \rightarrow \infty} \frac{1}{M} \sum_{m=1}^M f\!\left(x^{(m)}\right) = \mathbb{E}\!\left[f(X)\right], \] and that the error approaches shrinks at a rate of (only) $\mathcal{O}\!\left(\frac{1}{\sqrt{M}}\right),$ and approaches a normal distribution, \[ \lim_{M \rightarrow \infty} \mathbb{E}\!\left[f(X)\right] - \frac{1}{M} \sum_{m=1}^M f\!\left(x^{(m)}\right) \sim \textrm{normal}\!\left( \frac{\textrm{sd}[X]}{\sqrt{M}} \right). \] As usual, this can be a good approximation for even small values, such as $M \approx 20.$

We refer to $q$ as the transition density, though we may also consider the induced measure, \[ \Pr(X_{m + 1} \in A \mid X_m = x_m) = \int_{A} q(y \mid x_m) \textrm{d}x_m. \]

D. Bayesian statistics

In this appendix, we provide a concise mathematical overview of Bayesian statistics. Bayesian statistics is largely concerned with estimating quantities of interest based on a joint probability model of the data and parameters, given some observed data. It is also concerned with quantifying our uncertainty in these estimates. Quantities of interest are typically parameter values and deviations, event probability estimates, and forecasts/backcasts.

D.1 Bayes’s rule for densities

Bayes (1763) formalized his approach in the following theorem which was named after him. The basic idea is that if we have parameters and data, we can calculate the conditional distribution of the parameters given the data if we know the distribution of the data given the parameters and the prior distribution of the parameters. the distribution the prior.

Using names, the first step of Bayes’s theorem for densities is \[\begin{align} \underbrace{p(\textrm{params} \mid \textrm{data})}_{\textrm{posterior}} &= \underbrace{p(\textrm{data} \mid \textrm{params})}_{\textrm{sampling}} \cdot \underbrace{p(\textrm{params})}_{\textrm{prior}} \, / \, \underbrace{p(\textrm{data})}_{\textrm{evidence}}. \\[4pt] \end{align}\] In general discussions, we typically use the variable $y$ for data and $\theta$ for model parameters. The second step of Bayes’s theorem replaces the evidence by marginalizing data out of the denominator.

Theorem 1 (Bayes’s Theorem) Given a joint density $p(y, \theta)$, the posterior density $p(\theta \mid y)$ can be defined in terms that only involve the prior $p(\theta)$ and sampling distribution $p(y \mid \theta)$, as \[ p(\theta \mid y) \ = \ \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta}. \]

Proof: \[\begin{align} p(\theta \mid y) &= \frac{p(y, \theta)} {p(y)} & \textrm{[definition of conditional probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {p(y)} & \textrm{[chain rule for densities]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y, \theta') \, \textrm{d}\theta'} & \textrm{[law of total probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y \mid \theta') \cdot p(\theta') \, \textrm{d}\theta'}. & \textrm{[chain rule for densities]} \end{align}\] $\blacksquare$

Bayes’s theorem allows us to solve the so-called inverse problem of inferring the posterior $p(\theta \mid y)$ when all we have is the sampling distribution $p(y \mid \theta),$ the prior $p(\theta)$, and data $y$.

In most casses, Stan programs do not require normalized log densities. This allows us to go a step further and drop the denominator $p(y)$, which does not depend on $\theta$ and write Bayes’s rule up to a proportion as \[ p(\theta \mid y) \ \propto \ p(y \mid \theta) \cdot p(\theta). \]

This lets us proceed with only an unnormalized sampling density $p(y \mid \theta)$ and unnormalized prior $p(\theta)$. About the only time that a normalizing constant is required is for comparing two models’ prior predictive distributions, which is not something we would recommend doing. The prior predictive distribution is just the distribution of the data in the model (aka, the evidence), \[ p(\tilde{y}) = \int_{\mathbb{R}^N} p(\tilde{y} \mid \theta) \cdot p(\theta) \, \textrm{d}\theta. \] The prior predictive distribution is useful for simulating data to do prior predictive checks, but we prefer to compare models using posterior predictive checks, which condition on some “training” data $y,$ \[ p(\tilde{y} \mid y) = \int_{\mathbb{R}^N} p(\tilde{y} \mid \theta) \cdot p(\theta \mid y) \cdot \textrm{d}\theta. \] Here we have only replaced the prior $p(\theta)$ int he prior predictive formula with the posterior $p(\theta \mid y)$ conditioned on observed data $y.$

While this is useful for checking whether priors are sensible by using it to simulate $y$ from potential $\theta$ drawn from the prior.

D.2 Bayesian inference

Bayesian inference is largely about estimating quantities of interest based on a probability model and observed data. In typical applied Bayesian inference problems, we are interested in three quantities that can be expressed as expectations: parameter estimates, event probabilities, and probabilistic prediction. All of these quantities are expressed as expectations over the posterior, meaning that they involve averaging over our uncertainty in parameter values.

We are also interested in uncertainty, which is defined by the posterior. We typically summarize uncertainty using quantiles, and in particular, posterior intervals (sometimes called “credible intervals”).

Parameter estimation

The first quantity of interest is the value of parameters. The standard Bayesian parameter estimate is the posterior mean, or conditional expectation given the data. Given a model $p(y, \theta)$ and observed data $y$, the Bayesian posterior mean estimate of the parameters $\theta$ is written as $\widehat{\theta},$ where as usual, a hat is placed over a parameter to signify the value is an estimate of the variable without the hat. The definition for the Bayesian estimator is \[\begin{align} \widehat{\theta} &= \mathbb{E}[\Theta \mid y] \\[6pt] &= \int_{\Theta} \theta \cdot p(\theta \mid y) \, \textrm{d}\theta. \end{align}\] The posterior mean is the value that minimizes expected square error in the estimates if the model is well specified in the sense of representing the true data-generating process. Square error is the squared $\textrm{L}_2$ norm of the difference between the estimate and the true parameter values. We can expand these definitions down to basic form.

\[\begin{align} \widehat{\theta} &= \textrm{arg min}_{u} \ \mathbb{E}\!\left[\left. \left|\left| \, u - \Theta \, \right|\right|^2 \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \mathbb{E}\!\left[\left. (u - \Theta)^{\top} \cdot (u - \Theta) \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \mathbb{E}\!\left[\left. \sum_{d=1}^D \, (u_d - \Theta_d)^2 \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \sum_{d=1}^D \, \mathbb{E}\!\left[ \left. (u_d - \Theta_d)^2 \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \sum_{d=1}^D \, \int_{\Theta} (u_d - \theta_d)^2 \cdot p(\theta \mid y) \, \textrm{d}\theta. \end{align}\]

The final estimate is a weighted average over the posterior $p(\theta \mid y),$ which is called out by the form $\int_{\Theta} \cdots p(\theta \mid y) \, \textrm{d}\theta,$ which we see again and again with expectations.

Event probabilities

An event in statistics is a subset of the parameter space, $A \subseteq \Theta$, where $\Theta$ is the set of all valid parameter values. We usually pick out events using conditions on the parameter space. For example, the condition $\theta > 0.5$ defines the event $A = \left\{ \theta \in \Theta : \theta > 0.5 \right\}$.

The probability of an event conditioned on data is just another posterior expectation, this time of \[\begin{align} \textrm{Pr}[A \mid y] &= \mathbb{E}[\textrm{I}[\Theta \in A] \mid y] \\[6pt] &= \int_{\Theta} \textrm{I}(\theta \in A) \cdot p(\theta \mid y) \, \textrm{d}\theta, \end{align}\] where $\textrm{I}$ is a function defined over boolean values. The expression $\textrm{I}(\theta \in A)$ takes on value 1 if $\theta \in A$ and value 0 otherwise. We write $\textrm{I}[\Theta \in A]$ using square brackets because a random variable is a function, whereas we write $\textrm{I}(\theta \in A)$ because $\theta \in \mathbb{R}^D$ is a value; see Appendix B.4. Random variables.

Posterior predictive inference

Often we are interested in predicting new data $\tilde{y}$ given the observation of existing data $y$. This is a form of posterior predictive inference. For example, $y$ might be the price of a stock over some time period and $\tilde{y}$ the price of the stock in the future (we use $\tilde{y}$ as a hint that we are talking about a predictive quantity). Or $y$ might be the result of past games and $\tilde{y}$ the new outcome representing the winner of tomorrow’s game. Posterior predictive inference is also cast as an expectation, this time of a density.

\[\begin{align} p(\tilde{y} \mid y) &= \mathbb{E}\!\left[ \, p(\tilde{y} \mid \Theta) \mid y \, \right] \\[6pt] &= \int_{\Theta} p(\tilde{y} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta. \end{align}\]

D.3 Conjugate priors

In a few limited, though very useful, cases, we can derive posterior distributions analytically.

Suppose we have a family of distributions $\mathcal{G}$ and a family of sampling distributions $\mathcal{F}$ that share the same parameter space. We say that $\mathcal{G}$ is the conjugate prior family for the sampling distribution $\mathcal{F}$ if a prior from $\mathcal{G}$ and sampling distribution from $\mathcal{F}$ entail that the posterior distribution will be in $\mathcal{G}.$

Example. The class $\mathcal{G}$ of beta distributions, \[ \mathrm{beta}(\theta \mid \alpha, \beta) \propto \theta^{\alpha - 1} \cdot (1 - \theta)^{\beta - 1}, \] is the conjugate prior for the class of binomial sampling distributions, \[ \textrm{binomial}(y \mid N, \theta) \propto \theta^y \cdot (1- \theta)^(N - y). \] Suppose we have a joint model \[ p(y, \theta \mid N, \alpha, \beta) = \textrm{beta}(\theta \mid \alpha, \beta) \cdot \textrm{binomial}(y \mid N, \theta). \]
The posterior can be derived by Bayes’s rule as \[\begin{align} p(\theta \mid y, N, \alpha, \beta) &\propto \textrm{beta}(\theta \mid \alpha, \beta) \cdot \textrm{binomial}(y \mid N, \theta) \\[4pt] &= \theta^(\alpha - 1) \cdot (1 - \theta)^(\beta - 1) \cdot \theta^y \cdot (1 - \theta)^(N - y) \\[4pt] &= \theta^{y + \alpha - 1} \cdot (1 - \theta)^{N - y + \beta - 1} \\[4pt] &\propto \textrm{beta}(\theta \mid y + \alpha, N - y + \beta). \end{align}\] From here, we can conclude that the posterior is the beta distribution, because if \[ p(\theta \mid y, N, \alpha, \beta) \propto \textrm{beta}(\theta \mid y + \alpha, N - y + \beta), \] then \[\begin{align} p(\theta \mid y, N, \alpha, \beta) &= \frac{\textrm{beta}(\theta \mid y + \alpha, N - y + \beta)} {\int_{\Theta} \textrm{beta}(\theta \mid y + \alpha, N - y + \beta) \, \textrm{d}\theta} \\[8pt] &= \frac{\textrm{beta}(\theta \mid y + \alpha, N - y + \beta)} {1} \\[6pt] &= \textrm{beta}(\theta \mid y + \alpha, N - y + \beta) \end{align}\]

Conjugate priors as data

Conjugate priors are always from the exponential family of distributions (which we do not characterize here). The reason for this is that the prior and sampling distribution both need to be exponential functions of the parameters so that their effects can be added and remain within the prior family. For example, normals have density functions that are exponential functions of their argument and fall in the exponential family.

Continuing the example of a $\textrm{beta}(\alpha, \beta)$ distribution, consider the case where $\alpha = \beta = 1.$ The resulting beta distribution is uniform over $\theta \in (0, 1),$ \[ \textrm{beta}(\theta \mid 1, 1) \propto \theta^{1 - 1} \cdot (1 - \theta)^{1 - 1} = 1. \] Consider observing $y$ successes out of $N,$ and hence $N - y$ failures. The posterior will be $\textrm{beta}(y + 1, N - y + 1).$ Given that we started from a uniform distribution, this represents $y$ successes and $N - y$ failures.

Chaining inferences

Suppose we have observed data $y$ and parameter $\theta$ with sampling distribution $p(y \mid \theta)$ and prior $p(\theta \mid a_0)$ for some fixed parameters $a_1.$ Now suppose we observe data $y_1.$ We will have a posterior $p(\theta \mid a_1)$ where $a_1$ is some new fixed parameters. This posterior after one observation $y_1$ may be used as the prior for the next observation $y_2.$ We can continue to chain inference this way and will wind up with the same result as if we had taken a sampling distribution with density $\prod_{n = 1}^N p(y_n \mid \theta)$ and observed $y_1, \ldots, y_N$ all at once.

Example (cont.) Continuing the beta-binomial example, suppose we start with a uniform prior $\textrm{beta}(1,1).$ Now suppose we observe $y_1 = 1, N_1 = 1$, 1 success out of 1 try. The posterior after observing $y_1$ is $\textrm{beta}(2, 1).$ This lets us update what we know about $\theta$ and use $\textrm{beta}(2, 1)$ as our prior for the next observation. Suppose the next observation is $y_2 = 1,$ another success. This gives us a posterior of $\textrm{beta}(3, 1),$ which becomes the prior for the next observation and so on. In this way, we can break a single binomial updates down into a sequence of smaller updates, even down to $N = 1$ where we take the results one at a time.

E. Installed packages

The system and operating system are as follows.

import sys
import platform

print("\nSystem")
print(sys.version)

print("\nOperating System")
print(f"""  {platform.system() = }
  {platform.release() = }
  {platform.version() = }""")


System
3.11.3 (main, Apr  7 2023, 19:25:52) [Clang 14.0.0 (clang-1400.0.29.202)]

Operating System
  platform.system() = 'Darwin'
  platform.release() = '22.3.0'
  platform.version() = 'Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64'

The installed packages (i.e., the working set) is as follows.

import pkg_resources

print("\nInstalled packages:")
for dist in pkg_resources.working_set:
    print(dist)


Installed packages:
Jinja2 3.1.2
MarkupSafe 2.1.2
Pillow 9.5.0
PyYAML 6.0
Pygments 2.15.1
QtPy 2.3.1
Send2Trash 1.8.2
anyio 3.6.2
appnope 0.1.3
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asttokens 2.2.1
attrs 23.1.0
backcall 0.2.0
beautifulsoup4 4.12.2
bleach 6.0.0
cffi 1.15.1
cmdstanpy 1.1.0
comm 0.1.3
contourpy 1.0.7
cycler 0.11.0
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.6
executing 1.2.0
fastjsonschema 2.16.3
filelock 3.12.0
fonttools 4.39.3
fqdn 1.5.1
idna 3.4
ipykernel 6.22.0
ipython 8.13.1
ipython-genutils 0.2.0
ipywidgets 8.0.6
isoduration 20.11.0
jedi 0.18.2
jsonpointer 2.3
jsonschema 4.17.3
jupyter 1.0.0
jupyter-client 8.2.0
jupyter-console 6.6.3
jupyter-core 5.3.0
jupyter-events 0.6.3
jupyter-server 2.5.0
jupyter-server-terminals 0.4.4
jupyterlab-pygments 0.2.2
jupyterlab-widgets 3.0.7
kiwisolver 1.4.4
matplotlib 3.7.1
matplotlib-inline 0.1.6
mistune 2.0.5
mizani 0.9.0
mpmath 1.3.0
mypy-extensions 1.0.0
nbclassic 0.5.6
nbclient 0.7.4
nbconvert 7.3.1
nbformat 5.8.0
nest-asyncio 1.5.6
networkx 3.1
notebook 6.5.4
notebook-shim 0.2.3
numpy 1.24.3
packaging 23.1
pandas 2.0.1
pandocfilters 1.5.0
parso 0.8.3
patchworklib 0.6.0
patsy 0.5.3
pexpect 4.8.0
pickleshare 0.7.5
pip 23.1.2
platformdirs 3.5.0
plotnine 0.10.1
prometheus-client 0.16.0
prompt-toolkit 3.0.38
protobuf 4.21.12
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pycparser 2.21
pyparsing 3.0.9
pyre-extensions 0.0.29
pyrsistent 0.19.3
python-dateutil 2.8.2
python-json-logger 2.0.7
pytz 2023.3
pyzmq 25.0.2
qtconsole 5.4.2
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
scipy 1.10.1
setuptools 67.6.1
six 1.16.0
sniffio 1.3.0
soupsieve 2.4.1
stack-data 0.6.2
statsmodels 0.13.5
sympy 1.12
terminado 0.17.1
tinycss2 1.2.1
torch 2.0.1
tornado 6.3.1
tqdm 4.65.0
traitlets 5.9.0
typing-extensions 4.6.3
typing-inspect 0.9.0
tzdata 2023.3
uri-template 1.2.0
wcwidth 0.2.6
webcolors 1.13
webencodings 0.5.1
websocket-client 1.5.1
wheel 0.40.0
widgetsnbextension 4.0.7
xformers 0.0.20

References

Anderson, Brian D. O., and John B. Moore. 1979. Optimal Filtering. Englewood Cliffs, New Jersey: Prentice-Hall, Inc.

Bayes, Thomas. 1763. “An Essay Towards Solving a Problem in the Doctrine of Chances.” Philosophical Transactions of the Royal Society of London, no. 53: 370–418.

Betancourt, Michael. 2017a. “A Conceptual Introduction to Hamiltonian Monte Carlo.” arXiv Preprint arXiv:1701.02434.

———. 2017b. “Identifying Bayesian Mixture Models.” https://betanalpha.github.io/assets/case_studies/identifying_mixture_models.html.

Blei, David M, Alp Kucukelbir, and Jon D McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association 112 (518): 859–77.

Brooks, Steve, Andrew Gelman, Galin Jones, and Xiao-Li Meng. 2011. Handbook of Markov Chain Monte Carlo. CRC Press/Chapman & Hall.

Carpenter, Bob, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt. 2015. “The Stan Math Library: Reverse-Mode Automatic Differentiation in C++.” arXiv, no. 1509.07164.

Casella, George, and Edward I George. 1992. “Explaining the Gibbs Sampler.” The American Statistician 46 (3): 167–74.

Chib, Siddhartha, and Edward Greenberg. 1995. “Understanding the Metropolis-Hastings Algorithm.” The American Statistician 49 (4): 327–35.

Dawid, A Philip. 1982. “The Well-Calibrated Bayesian.” Journal of the American Statistical Association 77 (379): 605–10.

De Moivre, Abraham. 1718. The Doctrine of Chances: Or, a Method of Calculating the Probabilities of Events in Play. W. Pearfon, for the author.

Devroye, Luc. 1986. Non-Uniform Random Variate Generation. New York: Springer Science+Business Media, LLC.

Diaconis, Persi, and Donald Ylvisaker. 1979. “Conjugate Priors for Exponential Families.” The Annals of Statistics, 269–81.

Doucet, Arnaud, Nando De Freitas, and Neil Gordon. 2001. “An Introduction to Sequential Monte Carlo Methods.” In Sequential Monte Carlo Methods in Practice, edited by Arnaud Doucet, Nando De Freitas, and Neil Gordon, 3–14. New York: Springer.

Duane, Simon, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. 1987. “Hybrid Monte Carlo.” Physics Letters B 195 (2): 216–22.

Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.” Journal of the Royal Statistical Society Series A: Statistics in Society 182 (2): 389–402.

Galilei, Galileo. 1638. Dialogues Concerning Two New Sciences. 1954 translation by Henry Crew and Alfonso de Salvio. New York: Dover.

Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian Data Analysis. Third Edition. CRC Press.

Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” arXiv, no. 2011.01808.

Gilks, Walter R, and Pascal Wild. 1992. “Adaptive Rejection Sampling for Gibbs Sampling.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 41 (2): 337–48.

Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (2): 243–68.

Hoffman, Matthew D, and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (1): 1593–623.

Jacob, Pierre E, John O’Leary, and Yves F Atchadé. 2020. “Unbiased Markov Chain Monte Carlo Methods with Couplings.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (3).

Laplace, Pierre Simon. 1774. “Mémoire Sur La Probabilité de Causes Par Les évènements.” Mémoires de Mathématique Et de Physique Presentés à l’Académie Royale Des Sciences 6: 621–56.

Laplace, Pierre-Simon. 1811. “Mé́moire Sur Les Approximations Des Formules Qui Sont Fonctions de Trè̀s Grands Nombres Et Sur Leur Application Aux Probabilité́s.” Mé́moires de l’Acadé́mie Royale Des Sciences de Paris Année 1809: 353–415.

———. 1814. Essai Philosophique Sur Les Probabilités. Paris: Courcier.

Lin, Hanti. 2022. “Bayesian Epistemology.” In The Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta and Uri Nodelman, Fall 2022. https://plato.stanford.edu/archives/fall2022/entries/epistemology-bayesian/; Metaphysics Research Lab, Stanford University.

Little, Roderick J. 2006. “Calibrated Bayes: A Bayes/Frequentist Roadmap.” The American Statistician 60 (3): 213–23.

Livingstone, S, M Betancourt, S Byrne, and M Girolami. 2019. “On the Geometric Ergodicity of Hamiltonian Monte Carlo.” Bernoulli 25 (4A): 3109–38.

Lunn, David, Chris Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. 2012. The BUGS Book: A Practical Introduction to Bayesian Analysis. CRC press/Chapman-Hall.

Lyapunov, Aleksandr. 1900--1901. “Nouvelle Forme Du Theorem Sur La Limie de La Probabilité.” Mémoires de l’Académie Impériale Des Sciences de St. Pétersbourg 11–12 (1900--1901).

Marin, Jean-Michel, Pierre Pudlo, Christian P Robert, and Robin J Ryder. 2012. “Approximate Bayesian Computational Methods.” Statistics and Computing 22 (6): 1167–80.

McElreath, Richard. 2023. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second Edition. CRC Press.

Mill, John Stuart. 1882. A System of Logic: Raciocinative and Inductive. Eighth edition. New York: Harper & Brothers, Publishers.

Neal, Radford M. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo. Chapman; Hall/CRC.

Ramalho, Luciano. 2022. Fluent Python. Second Edition. O’Reilly Media, Inc.

Roberts, Gareth O, and Jeffrey S Rosenthal. 2004. “General State Space Markov Chains and MCMC Algorithms.” Probability Surveys 1: 20–71.

Rosenthal, Jeffrey S. 2006. A First Look at Rigorous Probability Theory. Worlds Scientific Publishing Co.

Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92.

Stan Development Team. 2023a. “Stan Functions Reference.” https://mc-stan.org/docs/functions-reference/index.html.

———. 2023b. “Stan Reference Manual.” https://mc-stan.org/docs/reference-manual/index.html.

———. 2023c. “Stan User’s Guide.” https://mc-stan.org/docs/stan-users-guide/index.html.

VanderPlas, Jake. 2023. Python Data Science Handbook: Essential Tools for Working with Data. Second Edition. O’Reilly Media, Inc.

Wilkinson, Leland. 2005. The Grammar of Graphics. Second Edition. Springer.

Footnotes

The statistical sampling literature often overloads “sample” to mean both a sample and a draw. We will try to stick to the notation where a sample consists of a sequence of one or more draws.↩︎

Preface

Prerequisites

Source code and license

Building this document from source

Python, CmdStanPy, NumPy, pandas, and plotnine

Python boilerplate

Acknowledgements

1 Introduction

1.1 Bayesian statistics

1.2 Markov chain Monte Carlo methods

1.3 Stan and probabilistic programming

2 Pragmatic Bayesian statistics

3 Stan examples: forward simulation and Monte Carlo

3.1 A first Stan program

3.2 Pseudorandom numbers

3.3 Monte Carlo integration

Random points are far away in high dimensions

3.4 Markov chain Monte Carlo methods

4 Hints for Python programmers

4.1 Indexing from 1

4.2 Inclusive ranges

4.3 Strong typing

4.4 Block scope

4.5 Loops are not slow

4.6 Whitespace and semicolons

4.7 Matrices and vectors are real or complex-valued

4.8 Storage order

Matrices in Python vs. Stan

Arrays in Python vs. Stan

5 Stan example: Laplace’s live birth inverse problem

5.1 Inverse problems

5.2 Bayesian solutions to inverse problems

Parametric statistics and variable types

The Bayesian process

Bayes’s formulation

5.3 Live birth data

5.4 Laplace’s model

Analytic posterior

5.5 Stan program for birth data

5.6 Parameter and model blocks

5.7 Sampling from the posterior

Stan’s sampler

Fitting in Stan

5.8 Bayesian point estimates

Posterior mean estimator

Posterior median estimator, quantiles, and intervals

Quantiles

Posterior intervals

Estimation error and bias

Posterior mode estimator

Loss functions and properties of estimators

(Markov chain) Monte Carlo error and effective sample size

5.9 Estimating event probabilities

Event probabilities via indicators

Events as indicators in Stan

The answer to Laplace’s question

MCMC summary statistics from Stan

6 Warmup and convergence monitoring

6.1 Warmup

6.2 Potential scale reduction and \(\widehat{R}\)

6.3 How many chains for how long?

6.4 Running chains concurrently

7 Stan example: A/B testing

7.1 Stan model for A/B testing

8 Stan’s execution model

8.1 Data and transformed data

8.2 Parameters and transformed parameters

8.3 Model

8.4 Generated quantities

8.5 User-defined functions

User-defined probability density and mass functions

User-defined random number generators

Modifying the target

8.6 Automatic differentiation

9 Stan example: Regression and prediction

9.1 Fisher’s iris data set

9.2 Linear regression

9.3 Built-in generalized linear model functions

9.4 Robust regression

9.5 Posterior predictive inference