Chapter 5 Preliminaries on semiparametric estimation
5.1 From causal to statistical quantities
- We have arrived at identification formulas that express quantities that we care about in terms of observable quantities
- That is, these formulas express what would have happened in hypothetical worlds in terms of quantities observable in this world.
- This required causal assumptions
- Many of these assumptions are empirically unverifiable
- We saw an example where we could relax the cross-world assumption, at the cost of changing the parameter interpretation
- and where we could relax the positivity assumption, also at the cost of changing the parameter interpretation
- We are now ready to tackle the estimation problem, i.e., how do we best learn the value of quantities that are observable?
- The resulting estimation problem can be tackled using statistical
assumptions of various degrees of strength
- Most of these assumptions are verifiable (e.g., a linear model)
- Thus, most are unnecessary (except for convenience)
- We have worked hard to try to satisfy the required causal assumptions
- This is not the time to introcuce unnecessary statistical assumptions
- The estimation approach we will introduce reduces reliance on these statistical assumptions.
5.1.1 Computing identification formulas if you know the true distribution
- The mediation parameters that we consider can be seen as a function of the joint probability distribution of \(O=(W,A,Z,M,Y)\)
- For example, under identifiability assumptions the natural direct effect is equal to \[\begin{equation*} \psi(\P) = \E[\color{Goldenrod}{\E\{\color{ForestGreen}{\{E(Y \mid A=1, M, W) - \E(Y \mid A=0, M, W)}\mid A=0,W\}}] \end{equation*}\]
- The notation \(\psi(\P)\) implies that the parameter is a function of \(\P\)
- This means that we can compute it for any distribution \(\P\)
- For example, if we know the true \(\P(W,A,M,Y)\), we can comnpute the true value
of the parameter by:
- Computing the conditional expectation \(\E(Y\mid A=1,M=m,W=w)\) for all values \((m,w)\)
- Computing the conditional expectation \(\E(Y\mid A=0,M=m,W=w)\) for all values \((m,w)\)
- Computing the probability \(\P(M=m\mid A=0,W=w)\) for all values \((m,w)\)
- Compute \[\begin{align*} \color{Goldenrod}{\E\{}&\color{ForestGreen}{\E(Y \mid A=1, M, W) - \E(Y \mid A=0, M, W)}\color{Goldenrod}{\mid A=0,W\}} =\\ &\color{Goldenrod}{\sum_m\color{ForestGreen}{\{\E(Y \mid A=1, m, w) - \E(Y \mid A=0, m, w)\}}\P(M=m, A=0, W=w)} \end{align*}\]
- Computing the probability \(\P(W=w)\) for all values \(w\)
- Computing the mean over all values \(w\)
5.1.2 Estimating identification formulas
The above is how you would compute the true value if you know the true distribution \(\P\)
- This is exactly what we did in our R examples before
- But we can use the same logic for estimation:
- Fit a regression to estimate, say \(\hat\E(Y\mid A=1,M=m,W=w)\)
- Fit a regression to estimate, say \(\hat\E(Y\mid A=0,M=m,W=w)\)
- Fit a regression to estimate, say \(\hat\P(M=m\mid A=0,W=w)\)
- Estimate \(\P(W=w)\) with the empirical distribution
- Evaluate \[\begin{equation*} \psi(\hat\P) = \hat\E[\hat\E\{\hat\E(Y \mid A=1, M, W) - \hat\E(Y \mid A=0, M, W)\mid A=0,W\}] \end{equation*}\]
- This is known as the g-computation estimator
5.1.3 Weighted estimator (akin to IPW)
- An alternative expresion of the parameter functional (for the NDE) is given by
\[\E\bigg[\color{RoyalBlue}{\bigg\{ \frac{I(A=1)}{\P(A=1\mid W)}\frac{\P(M\mid A=0,W)}{\P(M\mid A=1,W)} - \frac{I(A=0)}{\P(A=0\mid W)}\bigg\}} \times \color{Goldenrod}{Y}\bigg]\]
- Thus, you can also construct a weighted estimator as
\[\frac{1}{n}\sum_{i=1}^n\bigg[\color{RoyalBlue}{\bigg\{ \frac{I(A_i=1)}{\hat\P(A_i=1\mid W_i)}\frac{\hat\P(M_i\mid A_i=0,W_i)}{\hat\P(M_i\mid A_i=1,W_i)} - \frac{I(A_i=0)}{\hat\P(A_i=0\mid W_i)}\bigg\}} \times \color{Goldenrod}{Y}\bigg]\]
5.1.4 How can g-estimation and weighted estimation be implemented in practice?
- There are two possible ways to do g-computation estimation:
- Using parametric models for the above regressions
- Using flexible data-adaptive regression (aka machine learning)
5.1.5 Pros and cons of g-computation and weighting parametric models
- Pros:
- Easy to understand
- Ease of implementation (standard regression software)
- Can use the Delta method or the bootstrap for computation of standard errors
- Cons:
- Unless \(W\) and \(M\) contain very few categorical variables, it is very easy to misspecify the models
- This can introduce sizable bias in the estimators
- This modelling assumptions have become less necessary in the presence of data-adaptive regression tools (a.k.a., machine learning)
5.1.6 An example of the bias of a g-computation estimator of the natural direct effect
- The following
R
chunk provides simulation code to exemplify the bias of a g-computation parametric estimator in a simple situation
mean_y <- function(m, a, w) abs(w) + a * m
mean_m <- function(a, w) plogis(w^2 - a)
pscore <- function(w) plogis(1 - abs(w))
- This yields a true NDE value of 0.58048
w_big <- runif(1e6, -1, 1)
trueval <- mean((mean_y(1, 1, w_big) - mean_y(1, 0, w_big)) *
mean_m(0, w_big) + (mean_y(0, 1, w_big) - mean_y(0, 0, w_big)) *
(1 - mean_m(0, w_big)))
print(trueval)
#> [1] 0.58062
- Let’s perform a simulation where we draw 1000 datasets from the above distribution, and compute a g-computation estimator based on
gcomp <- function(y, m, a, w) {
lm_y <- lm(y ~ m + a + w)
pred_y1 <- predict(lm_y, newdata = data.frame(a = 1, m = m, w = w))
pred_y0 <- predict(lm_y, newdata = data.frame(a = 0, m = m, w = w))
pseudo <- pred_y1 - pred_y0
lm_pseudo <- lm(pseudo ~ a + w)
pred_pseudo <- predict(lm_pseudo, newdata = data.frame(a = 0, w = w))
estimate <- mean(pred_pseudo)
return(estimate)
}
estimate <- lapply(seq_len(1000), function(iter) {
n <- 1000
w <- runif(n, -1, 1)
a <- rbinom(n, 1, pscore(w))
m <- rbinom(n, 1, mean_m(a, w))
y <- rnorm(n, mean_y(m, a, w))
est <- gcomp(y, m, a, w)
return(est)
})
estimate <- do.call(c, estimate)
hist(estimate)
abline(v = trueval, col = "red", lwd = 4)
- The bias also affects the confidence intervals:
cis <- cbind(
estimate - qnorm(0.975) * sd(estimate),
estimate + qnorm(0.975) * sd(estimate)
)
ord <- order(rowSums(cis))
lower <- cis[ord, 1]
upper <- cis[ord, 2]
curve(trueval + 0 * x,
ylim = c(0, 1), xlim = c(0, 1001), lwd = 2, lty = 3, xaxt = "n",
xlab = "", ylab = "Confidence interval", cex.axis = 1.2, cex.lab = 1.2
)
for (i in 1:1000) {
clr <- rgb(0.5, 0, 0.75, 0.5)
if (upper[i] < trueval || lower[i] > trueval) clr <- rgb(1, 0, 0, 1)
points(rep(i, 2), c(lower[i], upper[i]), type = "l", lty = 1, col = clr)
}
text(450, 0.10, "n=1000 repetitions = 1000 ", cex = 1.2)
text(450, 0.01, paste0(
"Coverage probability = ",
mean(lower < trueval & trueval < upper), "%"
), cex = 1.2)
5.1.7 Pros and cons of g-computation or weighting with data-adaptive regression
- Pros:
- Easy to understand
- Alleviate model-misspecification bias
- Cons:
- Might be harder to implement depending on the regression procedures used
- No general approaches for computation of standard errors and confidence intervals
- For example, the bootstrap is not guaranteed to work, and it is known to fail in some cases
5.2 Semiparametric estimation
- Intuitively, it offers a way to use data-adaptive regression to
- avoid model misspecification bias,
- endow the estimators with additional robustness (e.g., double robustness), while
- allowing the computation of correct standard errors and confidence intervals
- This can be achieved by adding a bias correction factor the the g-computation as follows: \[\begin{equation*} \psi(\hat \P) + \frac{1}{n}\sum_{i=1}^n D(O_i) \end{equation*}\] for some function \(D(O_i)\) of the data
- The function \(D(O)\) is called the efficient influence function (EIF)
- The EIF must be found on a case-by-case basis for each parameter \(\psi(\P)\)
- For example, for estimating the standardized mean \(\psi(\P)=\E[\E(Y\mid A=1, W)]\), we have \[\begin{equation*} D(O) = \frac{A}{\hat \P(A=1\mid W)}[Y - \hat\E(Y\mid A=1, W)] + \hat\E(Y\mid A=1, W) - \psi(\hat\P) \end{equation*}\]
- The EIF is found by using a distributional analogue of a Taylor expansion
- In this workshop we will omit the specific form of \(D(O)\) for some of the parameters that we use
- But the estimators we discuss and implement in the R packages will be based on these EIFs
- And the specific form of the EIF may be found in papers in the references
Note: the bias correction above may have an additional problem of returning parameter estimates outside of natural bounds. E.g., probabilities greater than one. A solution to this (not discussed in this workshop but implemented in some of the R packages) is targeted minimum loss based estimation.