Modern Causal Mediation Analysis

Authors

Affiliations

Ivan Diaz

NYU Population Health

Nima Hejazi

Harvard Biostatistics

Kara Rudolph

Columbia Epidemiology

Nick Williams

Columbia Epidemiology

Welcome to SER 2025!

This open source, reproducible vignette accompanies a half-day workshop on modern methods for causal mediation analysis, offered at the SER 2025 annual meeting.

About this workshop

Causal mediation analysis can provide a mechanistic understanding of how an exposure impacts an outcome, a central goal in epidemiology and health sciences. However, rapid methodologic developments coupled with few formal courses presents challenges to implementation. Beginning with an overview of classical direct and indirect effects, this workshop will present recent advances that overcome limitations of previous methods, allowing for: (i) continuous exposures, (ii) multiple, non-independent mediators, and (iii) effects identifiable in the presence of intermediate confounders affected by exposure. Emphasis will be placed on flexible, stochastic and interventional direct and indirect effects, highlighting how these may be applied to answer substantive epidemiological questions from real-world studies. Multiply robust, nonparametric estimators of these causal effects, and free and open source R packages (crumble) for their application, will be introduced.

To aid translation to real-world data analysis, this workshop will incorporate hands-on R programming exercises to allow participants practice in implementing the statistical tools presented. It is recommended that participants have working knowledge of the basic notions of causal inference, including counterfactuals and identification (linking the causal effect to a parameter estimable from the observed data distribution). Familiarity with the R programming language is also recommended.

Tentative schedule

08:30A-08:45A: Introductions + mediation set-up
08:45A-9:15A: Controlled direct effects, natural direct/indirect effects, interventional direct/indirect effects
9:15A-9:25A: Choosing an estimand in real-world examples
9:25A-10:00A: What is the EIF?!
10:00A-10:30A: Break + discussion
10:30A-11:05A: Using the EIF for estimating the natural direct effect
11:05A-12:00P: Example walkthrough with R packages for effect estimation
12:00A-12:30P: Wrap-up

NOTE: All times listed in Eastern Daylight Time (EDT).

About the instructors

Iván Díaz

I am an Associate Professor of Biostatistics in the Department of Population Health at the NYU Grossman School of Medicine. My research focuses on the development of non-parametric statistical methods for causal inference from observational and randomized studies with complex datasets, using machine learning. This includes but is not limited to mediation analysis, methods for continuous exposures, longitudinal data including survival analysis, and efficiency guarantees with covariate adjustment in randomized trials. I am also interested in general semi-parametric theory, machine learning, and high-dimensional data.

Nima Hejazi

I am an Assistant Professor of Biostatistics at the Harvard Chan School of Public Health, where my research program explores how advances in causal inference, machine learning, and computational statistics help catalyze discovery in the biomedical and health sciences. I develop model-agnostic statistical methods using ideas from causal inference, semi-parametric statistics, and causal machine learning. Areas of recent emphasis have included causal mediation analysis, causal inference for continuous exposures, efficient inference under auxiliary- or outcome-dependent sampling designs, and sieve methods for causal machine learning. My methods research is directly tied to applied science problems from studies of investigational agents to treat or prevent infectious diseases, chronic diseases, and cancer. I am also interested in open-source software and high-performance computing for statistics and in reproducible statistical data science.

Kara Rudolph

I am an Associate Professor of Epidemiology at the Columbia Mailman School of Public Health. My research interests are in developing and applying causal inference methods to understand the best ways to prevent and treat substance use disorders, including understanding the mediation mechanisms underlying those relationships. More generally, my work on generalizing/ transporting findings from study samples to target populations and identifying subpopulations most likely to benefit from interventions contributes to efforts to optimally target available policy and program resources.

Nick Williams

I am a Senior Data Analyst in Columbia University’s Mailman School of Public Health, Department of Epidemiology, and an incoming PhD student in Biostatistics at the University of California, Berkeley. My interests are in the development of statistical computing tools for novel causal inference methods. I am the author and maintainer of several R packages for conducting causal analyses in R.

Reproducibility note

These workshop materials were written using the Quarto, an open-source, cross-platform technical publishing system built on RMarkdown, and the complete source is available on GitHub. This version of the book was built with R version 4.5.0 (2025-04-11), pandoc version 3.6.3, and quarto version 1.7.31. See the renv.lock file in the source repository for an up-to-date list of the packages used.

Setup instructions

R and RStudio

R and RStudio are separate downloads and installations. R is the underlying statistical computing environment. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. You need to install R before you install RStudio.

Virtual environment setup with `renv`

These instructions are intended to help with setting up the included renv virtual environment, which ensures all participants are using the same exact set of R packages (and package versions). A few important notes to keep in mind:

When R is started from the top level of this repository, renv is activated automatically. There is no further action required on your part. If renv is not installed, it will be installed automatically, assuming that you have an active internet connection.
While renv is active, the R session will only have access to the packages (and their dependencies) that are listed in the renv.lock file—that is, you should not expect to have access to any other R packages that may be installed elsewhere on the computing system in use.
Upon an initial attempt, renv will prompt you to install packages listed in the renv.lock file, by printing a message.

In any such case, please call renv::status() to review the list of packages missing and to view renv’s recommendations for fixing the issue; usually, renv::restore() will be the next step necessary to install any missing packages. Note that you do not need to manually install the packages via install.packages(), remotes::install_github(), or similar, as renv will attempt do this for you.

While unnecessary for the purposes of this workshop, if you’d like to learn more about the details of how the renv virtual environment system works, the following references may be helpful:

In some rare cases, R packages that renv automatically tries to install as part of the renv::restore() process may fail due to missing systems-level dependencies. In such cases, a reference to the missing dependencies and system-specific instructions their installation involving, e.g., Ubuntu Linux’s apt or homebrew for macOS, will usually be displayed.

--- authors: - name: Ivan Diaz affiliation: NYU Population Health orcid: 0000-0001-9056-2047 email: ivan.diaz@nyulangone.org - name: Nima Hejazi affiliation: Harvard Biostatistics orcid: 0000-0002-7127-2789 email: nhejazi@hsph.harvard.edu - name: Kara Rudolph affiliation: Columbia Epidemiology orcid: 0000-0002-9417-7960 email: kr2854@cumc.columbia.edu - name: Nick Williams affiliation: Columbia Epidemiology orcid: 0000-0002-1378-4831 email: ntw2117@cumc.columbia.edu --- ```{r} #| label: load-renv #| echo: false #| message: false renv::autoload() library(here) ``` # Welcome to SER 2025! {.unnumbered} This open source, reproducible vignette accompanies a half-day workshop on modern methods for *causal mediation analysis*, offered at the SER 2025 annual meeting. ## About this workshop {#about} Causal mediation analysis can provide a mechanistic understanding of how an exposure impacts an outcome, a central goal in epidemiology and health sciences. However, rapid methodologic developments coupled with few formal courses presents challenges to implementation. Beginning with an overview of classical direct and indirect effects, this workshop will present recent advances that overcome limitations of previous methods, allowing for: (i) continuous exposures, (ii) multiple, non-independent mediators, and (iii) effects identifiable in the presence of intermediate confounders affected by exposure. Emphasis will be placed on flexible, stochastic and interventional direct and indirect effects, highlighting how these may be applied to answer substantive epidemiological questions from real-world studies. Multiply robust, nonparametric estimators of these causal effects, and free and open source `R` packages ([`crumble`](https://github.com/nt-williams/crumble)) for their application, will be introduced. To aid translation to real-world data analysis, this workshop will incorporate hands-on `R` programming exercises to allow participants practice in implementing the statistical tools presented. It is recommended that participants have working knowledge of the basic notions of causal inference, including counterfactuals and identification (linking the causal effect to a parameter estimable from the observed data distribution). Familiarity with the `R` programming language is also recommended. ## Tentative schedule {#schedule} - 08:30A-08:45A: Introductions + mediation set-up  - 08:45A-9:15A: Controlled direct effects, natural direct/indirect effects, interventional direct/indirect effects    - 9:15A-9:25A: Choosing an estimand in real-world examples  - 9:25A-10:00A: What is the EIF?!  - 10:00A-10:30A: Break + discussion - 10:30A-11:05A: Using the EIF for estimating the natural direct effect  - 11:05A-12:00P: Example walkthrough with `R` packages for effect estimation  - 12:00A-12:30P: Wrap-up **NOTE: All times listed in Eastern Daylight Time (EDT).** ## About the instructors {#instructors} ### [Iván Díaz](https://www.idiaz.xyz/) {.unnumbered} I am an Associate Professor of Biostatistics in the [Department of Population Health at the NYU Grossman School of Medicine](https://med.nyu.edu/faculty/ivan-l-diaz). My research focuses on the development of non-parametric statistical methods for causal inference from observational and randomized studies with complex datasets, using machine learning. This includes but is not limited to mediation analysis, methods for continuous exposures, longitudinal data including survival analysis, and efficiency guarantees with covariate adjustment in randomized trials. I am also interested in general semi-parametric theory, machine learning, and high-dimensional data. ### [Nima Hejazi](https://nimahejazi.org) {.unnumbered} I am an Assistant Professor of Biostatistics at the [Harvard Chan School of Public Health](https://www.hsph.harvard.edu/profile/nima-hejazi/), where my research program explores how advances in causal inference, machine learning, and computational statistics help catalyze discovery in the biomedical and health sciences. I develop model-agnostic statistical methods using ideas from causal inference, semi-parametric statistics, and causal machine learning. Areas of recent emphasis have included causal mediation analysis, causal inference for continuous exposures, efficient inference under auxiliary- or outcome-dependent sampling designs, and sieve methods for causal machine learning. My methods research is directly tied to applied science problems from studies of investigational agents to treat or prevent infectious diseases, chronic diseases, and cancer. I am also interested in open-source software and high-performance computing for statistics and in reproducible statistical data science. ### [Kara Rudolph](https://kararudolph.github.io/) {.unnumbered} I am an Associate Professor of Epidemiology at the [Columbia Mailman School of Public Health](https://www.publichealth.columbia.edu/profile/kara-rudolph-phd). My research interests are in developing and applying causal inference methods to understand the best ways to prevent and treat substance use disorders, including understanding the mediation mechanisms underlying those relationships. More generally, my work on generalizing/ transporting findings from study samples to target populations and identifying subpopulations most likely to benefit from interventions contributes to efforts to optimally target available policy and program resources. ### [Nick Williams](https://github.com/nt-williams) {.unnumbered} I am a Senior Data Analyst in Columbia University’s Mailman School of Public Health, Department of Epidemiology, and an incoming PhD student in Biostatistics at the [University of California, Berkeley](https://publichealth.berkeley.edu/academics/biostatistics). My interests are in the development of statistical computing tools for novel causal inference methods. I am the author and maintainer of several R packages for conducting causal analyses in R. ## Reproducibility note {#repro} These workshop materials were written using the [Quarto](https://quarto.org/), an open-source, cross-platform technical publishing system built on [RMarkdown](https://rmarkdown.rstudio.com/), and the complete source is available on [GitHub](https://github.com/nhejazi/causal_mediation_workshops). This version of the book was built with `r R.version.string`, [pandoc](https://pandoc.org/) version `r rmarkdown::pandoc_version()`, and [quarto](https://quarto.org/) version `r quarto::quarto_version()`. See the [`renv.lock` file](https://github.com/nhejazi/causal_mediation_workshops/blob/master/renv.lock) in the source repository for an up-to-date list of the packages used. ## Setup instructions {#setup} ### R and RStudio **R** and **RStudio** are separate downloads and installations. R is the underlying statistical computing environment. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. You need to install R before you install RStudio. ### Virtual environment setup with `renv` {#renv} These instructions are intended to help with setting up the included [`renv` virtual environment](https://rstudio.github.io/renv/index.html), which ensures all participants are using the same exact set of `R` packages (and package versions). A few important notes to keep in mind: - When `R` is started from the top level of this repository, `renv` is activated automatically. There is no further action required on your part. If `renv` is not installed, it will be installed automatically, assuming that you have an active internet connection. - While `renv` is active, the `R` session will only have access to the packages (and their dependencies) that are listed in the `renv.lock` file---that is, you should not expect to have access to any other `R` packages that may be installed elsewhere on the computing system in use. - Upon an initial attempt, `renv` will prompt you to install packages listed in the `renv.lock` file, by printing a message. In any such case, please call `renv::status()` to review the list of packages missing and to view `renv`'s recommendations for fixing the issue; usually, `renv::restore()` will be the next step necessary to install any missing packages. Note that you **do *not* need to manually install** the packages via `install.packages()`, `remotes::install_github()`, or similar, as `renv` will attempt do this for you. While unnecessary for the purposes of this workshop, if you'd like to learn more about the details of how the `renv` virtual environment system works, the following references may be helpful: 1. [Collaborating with `renv`](https://rstudio.github.io/renv/articles/collaborating.html) 2. [Introduction to `renv`](https://rstudio.github.io/renv/articles/renv.html) In some rare cases, `R` packages that `renv` automatically tries to install as part of the `renv::restore()` process may fail due to missing systems-level dependencies. In such cases, a reference to the missing dependencies and system-specific instructions their installation involving, e.g., [Ubuntu Linux's `apt`](http://manpages.ubuntu.com/manpages/bionic/man8/apt.8.html) or [`homebrew` for macOS](https://brew.sh/), will usually be displayed.

Welcome to SER 2025!

About this workshop

Tentative schedule

About the instructors

Iván Díaz

Nima Hejazi

Kara Rudolph

Nick Williams

Reproducibility note

Setup instructions

R and RStudio

Virtual environment setup with renv

Virtual environment setup with `renv`