Let the random variable
where \(Z = 1\) indicates the individual is positive for the disease and \(Z = 0\) indicates they are negative.
The prevalence of a disease (in a given [sub]population) is the probability of an individual being positive, \[ \textrm{prevalence} = \textrm{Pr}[Z = 1]. \]
Let the random variable
where \(Y = 1\) indicates a positive test result and \(Y = 0\) a negative result.
The sensitivity of a test is its accuracy for positive individuals, \[ \textrm{sensitivity} = \textrm{Pr}[Y = 1 \mid Z = 1]. \]
The specificity of a test is its accuracy for negative individuals, \[ \textrm{specificity} = \textrm{Pr}[Y = 0 \mid Z = 0]. \]
The data (Bendavid et al. 2020) consists of 17 seroprevalence studies
pos_tests | tests | sample_sens |
---|---|---|
78 | 85 | 0.918 |
27 | 37 | 0.730 |
25 | 35 | 0.714 |
neg_tests | tests | sample_spec |
---|---|---|
368 | 371 | 0.992 |
30 | 30 | 1.000 |
70 | 70 | 1.000 |
1102 | 1102 | 1.000 |
300 | 300 | 1.000 |
311 | 311 | 1.000 |
500 | 500 | 1.000 |
198 | 200 | 0.990 |
99 | 99 | 1.000 |
29 | 31 | 0.935 |
146 | 150 | 0.973 |
105 | 108 | 0.972 |
50 | 52 | 0.962 |
pos_tests | tests | sample_prev |
---|---|---|
50 | 3330 | 0.015 |
The pooled sample sensitivity is 0.828, pooled sample specificity is 0.995, and pooled sample prevalence is 0.015.
Using sensitivity as an example (specificity is identical other than some signs), the meta-analysis assumes a test’s sensitivity and specificity vary by site with normal noise on the log odds scale. We will parameterize sensitivity and specificity using log odds, where the log odds function \(\textrm{logit}:(0, 1) \rightarrow \mathbb{R}\) is \[ \textrm{logit}(u) = \log \frac{u}{1 - u}. \] Its inverse is the logistic function, \(\textrm{logit}^{-1}:\mathbb{R} \rightarrow (0, 1),\) \[ \textrm{logit}^{-1}(v) = \frac{1}{1 + \exp(-v)}. \]
The sensitivity of site \(k\) will be represented by the variable \(\theta_{1, k}\) (the \(1\) is for sensitivity as opposed to specificity).
The tests are assumed to be independent so that the number of positive tests at a site may be modeled as binomial, \[ y_k \sim \textrm{binomial_logit}(N_k, \theta_{1, k}). \] We’ve used a parameterization of the binomial’s chance of success parameter on the log odds scale, where \[ \textrm{binomial_logit}(N, \theta) = \textrm{binomial}(N, \textrm{logit}^{-1}(\theta)). \]
The model will assume a test sensitivity log odds \(\mu_1\) and scale of variation \(\sigma_1\) such that the sensitivity log odds \(\theta_{1, k}\) at site \(k\) is distributed as \[ \theta_{1, k} \sim \textrm{normal}(\mu_1, \sigma_1) \] This will pool (or regularize) the estimates \(\theta_{1, k}\) toward \(\mu_1\), with \(\sigma_1\) controlling the scale of inter-site variation in sensitivity.
As \(\sigma_1 \rightarrow 0\), the model approaches a full pooling model where we assume there is no variation in sensitivity among the sites. As \(\sigma_1 \rightarrow \infty\), the model approaches a no-pooling model where each site’s sensitivity is estimated independently. The hierarchical model is a partial pooling model, which determines the amount of pooling to apply from the data by controlling the estimate of \(\sigma_1\) from the data.
The hyperparameters \(\mu_1, \sigma_1\) should get hyperpriors based on what we know about this disease and serum testing variation among sites. Assuming we know nothing, we could assume that the probability of a correct test, \(\textrm{logit}^{-1}(\mu_1)\) has a uniform distribution, which has an implied standard logistic distribution on the log odds scale, \[ \mu_1 \sim \textrm{logistic}(0, 1). \] The scale parameter should be consistent with zero, so we do not want to give it a lognormal or inverse-gamma family prior. We do not expect a huge variation on the log odds scale, so we can use a simple standard half-normal hyperprior on \(\sigma_1\), \[ \sigma_1 \sim \textrm{normal}_+(0, 1) \]
If more is known about the tests, these hyperpriors should be tighter. Without a lot of data (both per site and number of sites), the posteriors will be sensitive to these hyperpriors.
To predict future test results at site \(k\), the probability of getting \(n\) positive tests out of the next \(N\) tests where the subject is positive, given observed (training) data \(y\), is given by the posterior predictive distribution, \[\begin{eqnarray*} p(n \mid N, k) & = & \int \underbrace{\textrm{binomial_logit}(n \mid N, \theta_{1, k})} _{\textrm{sampling uncertainty}} \cdot \underbrace{p(\theta_{1, k} \mid y)} _{\textrm{estimation uncertainty}} \, \textrm{d}{\theta_{1, k}} \\[6pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \textrm{binomial_logit}(n \mid N, \theta_{1, k}^{(m)}), \end{eqnarray*}\] where \[ \theta_{1, k}^{(m)} \sim p(\theta_{1, k} \mid y) \] is a draw from the posterior (binomial-logit determines sampling uncertainty and drawing from the posterior the estimation uncertainty).
Suppose we gather data from a new testing site \(k'\) not in the original data. The posterior predictive distribution for new data is identical in form, \[\begin{eqnarray*} p(n \mid N, k') & = & \int \underbrace{\textrm{binomial_logit}(n \mid N, \theta_{1, k'})} _{\begin{array}{l} \textrm{sampling} \\ \textrm{uncertainty} \end{array}} \cdot \underbrace{p(\theta_{1, k'} \mid y)} _{\begin{array}{l} \textrm{estimation} \\ \textrm{uncertainty} \end{array}} \, \textrm{d}{\theta_{1, k'}} \\[6pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \textrm{binomial_logit}(n \mid N, \theta_{1, k'}^{(m)}), \end{eqnarray*}\] where \(\theta_{1, k'}^{(m)} \sim p(\theta_{1, k'} \mid y)\) is a draw from the posterior.
The only difference is that now there is no data informing \(\theta_{1, k'}\), only the hyperprior parameters \(\mu_1, \sigma_1\) estimated through the meta-analysis, that is, as \(\theta_{1, k'} \sim \textrm{normal}(\mu_1, \sigma_1)\). This allows the sampling for MCMC to be refactored as \[ \theta_{1, k'}^{(m)} \sim \textrm{normal}(\mu_1^{(m)}, \sigma_1^{(m)}), \] where \(\mu_1^{(m)}, \sigma_1^{(m)} \sim p(\mu_1, \sigma_1 \mid y).\)
Specificity meta-analysis is identical to sensitivity meta-analysis (with some signs reversed). We can now use the predictive inference for sensitivity and specificity in our new site along with the observed number of positive tests at the new site to estimate prevalence of the disease (in the sampled (sub)population). Suppose we have a site with sensitivity \(\theta_{1, k}\) and a specificity of \(\theta_{0, k}\). If the disease prevalence is \(\pi\), the probability of getting a positive test result is \[ \mbox{Pr}[y = 1] = \pi \cdot \textrm{logit}^{-1}(\theta_{1, k}) + (1 - \pi) \cdot (1 - \textrm{logit}^{-1}(\theta_{0, k})) \] For a total of \(N\) independent tests, the number of positive tests will thus be distributed binomially with the preceding probability of success, \[ n \sim \textrm{binomial}\left(N, \pi \cdot \textrm{logit}^{-1}(\theta_{1, k}) + (1 - \pi) \cdot (1 - \textrm{logit}^{-1}(\theta_{0, k}))\right). \]
If we assume a simple uniform prior on prevalence \(\pi \in (0, 1)\), the entire model is \[\begin{eqnarray*} \pi & \sim & \textrm{uniform}(0, 1) \\ \mu_0, \mu_1 & \sim & \textrm{logistic}(0, 1) \\ \sigma_0, \sigma_1 & \sim & \textrm{normal}_+(0, 1) \\ \theta_{0, k} & \sim & \textrm{normal}(\mu_0, \sigma_0) \\ \theta_{1, k} & \sim & \textrm{normal}(\mu_1, \sigma_1) \\ n^{\textrm{pos}}_k & \sim & \textrm{binomial}(N^{\textrm{pos}}_k, \theta_{1, k}) \\ n^{\textrm{neg}}_k & \sim & \textrm{binomial}(N^{\textrm{neg}}_k, \ \theta_{0, k}) \\ n^{\textrm{unk}}_k & \sim & \textrm{binomial}\left(N^{\textrm{unk}}_k, \pi \cdot \textrm{logit}^{-1}(\theta_{1, k}) + (1 - \pi) \cdot (1 - \textrm{logit}^{-1}(\theta_{0, k})) \right) \end{eqnarray*}\] Some care is required to avoid clasing \(k\) subscripts in an impelementation.
The posterior \(p(\pi \mid n^{\textrm{pos}}, N^{\textrm{pos}}, n^{\textrm{neg}}, N^{\textrm{neg}}, n^{\textrm{unk}}, N^{\textrm{unk}})\) characterizes what we know about the prevalence of the disease given the observed sensitivity data, specificity data, and unkown status data.
Here is the full Stan model implementing this meta-analysis.
data {
int<lower = 0> K_pos;
int<lower = 0> N_pos[K_pos];
int<lower = 0> n_pos[K_pos];
int<lower = 0> K_neg;
int<lower = 0> N_neg[K_neg];
int<lower = 0> n_neg[K_neg];
int<lower = 0> K_unk;
int<lower = 0> N_unk[K_unk];
int<lower = 0> n_unk[K_unk];
real<lower = 0> sigma_sigma_logit_sens;
real<lower = 0> sigma_sigma_logit_spec;
}
parameters {
real mu_logit_sens;
real mu_logit_spec;
real<lower = 0> sigma_logit_sens;
real<lower = 0> sigma_logit_spec;
vector<offset = mu_logit_sens,
multiplier = sigma_logit_sens>[K_pos] logit_sens;
vector<offset = mu_logit_spec,
multiplier = sigma_logit_spec>[K_neg] logit_spec;
vector<offset = mu_logit_sens,
multiplier = sigma_logit_sens>[K_unk] logit_sens_unk;
vector<offset = mu_logit_spec,
multiplier = sigma_logit_spec>[K_unk] logit_spec_unk;
real<lower = 0, upper = 1> pi;
}
transformed parameters {
vector[K_pos] sens = inv_logit(logit_sens);
vector[K_neg] spec = inv_logit(logit_spec);
vector[K_unk] sens_unk = inv_logit(logit_sens_unk);
vector[K_unk] spec_unk = inv_logit(logit_spec_unk);
}
model {
mu_logit_sens ~ normal(4, 2);
sigma_logit_sens ~ normal(0, sigma_sigma_logit_sens);
logit_sens ~ normal(mu_logit_sens, sigma_logit_sens);
n_pos ~ binomial_logit(N_pos, logit_sens);
mu_logit_spec ~ normal(4, 2);
sigma_logit_spec ~ normal(0, sigma_sigma_logit_spec);
logit_spec ~ normal(mu_logit_spec, sigma_logit_spec);
n_neg ~ binomial_logit(N_neg, logit_spec);
pi ~ uniform(0, 1);
logit_sens_unk ~ normal(mu_logit_sens, sigma_logit_sens);
logit_spec_unk ~ normal(mu_logit_spec, sigma_logit_spec);
n_unk ~ binomial(N_unk, pi * sens_unk + (1 - pi) * (1 - spec_unk));
}
Now we compile it.
model <- stan_model("meta.stan")
and then fit
data <-
list(sigma_sigma_logit_sens = 0.01,
sigma_sigma_logit_spec = 0.01,
K_pos = sens_tests,
N_pos = sens_df$tests,
n_pos = sens_df$pos_tests,
K_neg = spec_tests,
N_neg = spec_df$tests,
n_neg = spec_df$neg_tests,
K_unk = unk_tests,
N_unk = array(unk_df$tests, dim = c(length(unk_df$tests))),
n_unk = array(unk_df$pos_tests, dim = c(length(unk_df$tests))))
fit <- sampling(model, data = data,
iter = 4000,
control = list(adapt_delta = 0.99),
open_progress = FALSE, refresh = 0)
print(fit, probs = c(0.025, 0.5, 0.975), digits = 3, pars = c("pi"))
Inference for Stan model: meta.
4 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=8000.
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
pi 0.013 0 0.003 0.007 0.012 0.019 7795 1
Samples were drawn using NUTS(diag_e) at Mon May 18 18:14:51 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Gelman, Andrew. 1 May 2020. Simple Bayesian analysis inference of coronavirus infection rate from the Stanford study in Santa Clara county. Statistical Modeling, Causal Inference, and Social Sciences blog.
Eran Bendavid, Bianca Mulaney, Neeraj Sood, Soleil Shah, Emilia Ling, Rebecca Bromley-Dulfano, Cara Lai, Zoe Weissberg, Rodrigo Saavedra-Walker, Jim Tedrow, Dona Tversky, Andrew Bogan, Thomas Kupiec, Daniel Eichner, Ribhav Gupta, John P.A. Ioannidis, Jay Bhattacharya. 27 April 2020. COVID-19 Antibody Seroprevalence in Santa Clara County, California, version 2