## Figures

## Abstract

Mediation analysis is a statistical method for evaluating the direct and indirect effects of an exposure on an outcome in the presence of a mediator. Mediation models have been widely used to determine direct and indirect contributions of genetic variants in clinical phenotypes. In genetic studies, the additive genetic model is the most commonly used model because it can detect effects from either recessive or dominant models (or any model in between). However, the existing approaches for mediation model cannot be directly applied when the genetic model is additive (e.g. the most commonly used model for SNPs) or categorical (e.g. polymorphic loci), and thus modification to measures of indirect and direct effects is warranted. In this study, we proposed overall measures of indirect, direct, and total effects for a mediation model with a categorical exposure and a censored mediator, which accounts for the frequency of different values of the categorical exposure. The proposed approach provides the overall contribution of the categorical exposure to the outcome variable. We assessed the empirical performance of the proposed overall measures via simulation studies and applied the measures to evaluate the mediating effect of a women’s age at menopause on the association between genetic variants and type 2 diabetes.

**Citation: **Wang J, Ning J, Shete S (2021) Mediation model with a categorical exposure and a censored mediator with application to a genetic study. PLoS ONE 16(10):
e0257628.
https://doi.org/10.1371/journal.pone.0257628

**Editor: **Marie-Pierre Dubé, Universite de Montreal, CANADA

**Received: **April 23, 2021; **Accepted: **September 6, 2021; **Published: ** October 12, 2021

**Copyright: ** © 2021 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The MESA CARe data used for the analyses described in this manuscript are available in public repository at the dbGaP, accession number phs000209.v13.p3.

**Funding: **Jian Wang was supported in part by the National Institutes of Health (NIH) grant 1R01AI143886. Sanjay Shete was supported in part by a Cancer Prevention Fellowship funded by the Cancer Prevention and Research Institute of Texas (CPRIT) grant award RP170259 and the Betty B. Marcus Chair in Cancer Prevention. Jing Ning was supported in part by the NIH grant R01CA193878 and the Andrew Sabin Family Fellowship. Jian Wang and Jing Ning were supported in part by the CPRIT grant RP200633 and NIH grant 1R01CA256977. Jian Wang, Jing Ning, and Sanjay Shete were supported in part by the NIH Cancer Center Support Grant P30CA016672. The MESA project has been conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA has been provided by contracts N01-HC- 95159, N01-HC-95160, N01-HC-95161, N01-HC-95162, N01-HC-95163, N01-HC-95164, N01-HC-95165, N01-HC- 95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, and CTSA UL1-RR-024156. The MESA CARe data used for the analyses described in this manuscript were obtained through dbGaP (phs000209.v13.p3). Funding for CARe genotyping was provided by NHLBI contract N01-HC-65226.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Mediation analysis is a statistical method used to evaluate the direct and indirect effects of an exposure on an outcome in the presence of a mediator. Mediation models have been widely used to determine direct and indirect contributions of genetic variants in clinical phenotypes, such as contribution of *CHRNA3-A5* genes in lung cancer [1–7]. In many studies, one encounters right- or left-censored data instead of complete data. Approaches to assess mediation when the outcome variable is censored have been widely studied [8–15]. However, the mediator itself can also be a censored variable. For instance, genes may affect the age at which a person stops smoking, a variable that is censored for current smokers and has been associated with lung cancer–associated mortality [16]. Few studies have considered mediation models with a censored mediator. Wang and Shete [17] used a multiple imputation approach along with the accelerated failure time (AFT) model to address a censored nature of a mediator when outcome was a continuous variable, and this approach yielded accurate estimations for the coefficients of different paths, the indirect effect (*IE*), and proportion of the total effect mediated (*PM*) by the mediator. Wang et al. [18] further extended the mediation approach with a censored mediator for studies with binary outcomes (e.g., case-control studies), based on the semiparametric AFT model with an unspecified error distribution combined with a pseudo-likelihood function, which was shown to be efficient yet robust.

In genetic association studies of complex diseases—including our motivating study of the association between single nucleotide polymorphisms (SNPs), woman’s age at menopause, and type 2 diabetes—because often there is no concrete evidence of the genetic mode of inheritance, one usually uses three classic genotypic models: the additive, dominant, and recessive genetic models [19]. For example, consider a SNP with two alleles *R* and *r*, and let *R* be the risk allele and *r* be the normal allele. The additive genetic model is defined using a categorical random variable, *X* = (0, 1, 2), to denote the three genotypes (*rr*, *Rr*, *RR*), assuming that the disease risk depends on the dose of the risk allele *R*. When the dominant or recessive genetic model is assumed, we use a binary variable, *X* = (0, 1, 1) or *X* = (0, 0, 1), respectively, to denote the three genotypes (*rr*, *Rr*, *RR*). The additive genetic model is the most commonly used model because typically the mode of action of susceptibility SNPs is unknown and the additive model can detect effects from either recessive or dominant models, or any model in between [20–22]. In addition to SNPs, the highly polymorphic loci, such as human leukocyte antigen (HLA) genes, can also be involved in the mediation model as an exposure. Such genes have many different alleles, resulting in many different genotypes (i.e., more than three genotypes found in di-allelic SNPs). The previous approaches by Wang and Shete [17] and Wang et al. [18] assumed that the exposure is either continuous (e.g., gene expression) or binary (e.g., dominant or recessive mode of inheritance for a genetic variant). These methods therefore cannot be directly applied when the genetic model is additive (e.g. the most commonly used model for SNPs) or categorical (e.g. polymorphic loci), and thus, modification to measures of indirect and direct effects is warranted for these general scenarios. Therefore, we extended our approach to the scenario in which the exposure is a categorical variable which can be applied to the model where the mediator is subject to censoring.

In particular, we proposed the measures for calculating the overall *IE*, direct effect (*DE*), and total effect (*TE*) in such a mediation model, where we first assessed the *IE*, *DE*, and *TE* for each category of the exposure and then calculated the *IE*, *DE*, and *TE*, weighted by the frequency of each category (e.g., frequency of each genotype), to estimate overall effect (*IE*, *DE*, and *TE*) of the exposure on the outcome variable in the presence of a censored mediator. The proposed measures provide the overall contribution (indirect, direct and total) of the categorical exposure to the outcome variable, instead of the relative contribution comparing one category to another in previous studies [23,24]. The proposed measures are general and valid regardless of whether the outcome variable and mediator are continuous, binary, or censored.

We applied the proposed overall measures of *IE*, *DE*, and *TE* to the motivating study of the mediating effect of a woman’s age at menopause on the effect of SNPs on risk of type 2 diabetes. Type 2 diabetes is a complex disease characterized by interplay between genetic and environmental factors [25–30]. Previous studies using epidemiologic data or genetic epidemiologic data have suggested an association between a woman’s age at menopause and type 2 diabetes [31–33] as well as an association between several SNPs and a woman’s age at menopause [34,35]. Therefore, we hypothesized that there are potential dual pathways between SNPs and type 2 diabetes, one via a direct effect and the other indirectly through a women’s age at menopause, which is hypothesized as a mediator for this association and is a censored variable because not all women have had gone through menopause (Fig 1).

Nodes represent the variables being analyzed in the mediation model, including genetic variant (i.e., exposure), a women’s age at menopause (i.e., mediator) and type 2 diabetes (i.e., outcome). A direct edge implies a potential direct causal effect. A pathway from one variable (genetic variant) to another (type 2 diabetes) implies a potential causal relationship through the mediator on the path (age at menopause).

In Methods section, we introduce the notations; mediation models; definitions of the overall *IE*, *DE*, and *TE*; and the associated estimation approaches [18]. We assess the empirical performance of the proposed overall *IE*, *DE* and *TE* via simulation studies in the Simulation section, conduct a data analysis in the section of Application to the motivation study, and provide a discussion in the Discussion section.

## Methods

We first review the approach by Hayes and Preacher [23], which is a widely used mediation model in which the exposure has multiple categories, and point out its limitations in the context of our motivating study. In their approach, Hayes and Preacher proposed to code the multi-categorical exposure using different coding strategies, including dummy coding, contrast coding etc., depending on the research interest. For example, when using the dummy coding, if the multi-categorical exposure *X* has *k* groups, one can create *k*-1 dummy variables *X*_{i}, *i* = 1, … *k*-1, with *X*_{i} = 1 if the subject is in group *i* and *X*_{i} = 0 otherwise, where one group is considered the reference group in the analysis. In this way, the approach creates multiple binary exposure variables in one mediation model.

As assumed in the Hayes and Preacher model, when outcome *Y* and mediator *T* are continuous, the direct and indirect effects of *X* on *Y* are estimated using the dummy-coded multiple exposure variables, mediator, and outcome using following linear regressions:
where *a*_{i} represents the path from the dummy-coded exposure *X*_{i} to the mediator *T*, represents the path from *X*_{i} to the outcome *Y* conditional on the mediator *T*; *c*_{i} represents the relative total effect of *X*_{i} on the outcome *Y*; *i* = 1, … *k*-1; and *b* represents the effect of the mediator *T* on the outcome *Y*. To assess the mediating effects, Hayes and Preacher adopted the terms relative *IE*, relative *DE*, and relative *TE*, respectively, to refer to *a*_{i}*b*, , and ; and the effects were calculated for each of the binary exposure variables recoded from the original exposure. (see details in [23]).

The approach proposed by Hayes and Preacher has advantages over other approaches such as aggregating groups or discarding to construct a dichotomous exposure, but it still has certain limitations. Since the relative *IE* is calculated for each created binary exposure separately, it cannot provide the overall mediating effect of the mediator on the relation between the exposure and outcome variable. Also, if the exposure has many categories, the number of possible tests to be conducted can be large. In this case, multiple correction tests reduce the power of the test for the mediating effect. Importantly, the approach proposed by Hayes and Preacher [23] assumes that both outcome and mediator are continuous and normally distributed, which allows one to estimate the relative *TE* as the summation of relative *IE* and relative *DE*. Therefore, the approach cannot be directly applied in many practical situations, such as when the mediator is a non-normal distributed variable subject to censoring and the outcome is binary. More recent work has been conducted to develop approaches for the analysis of treatment/exposure with multiple categories [24,36–38]. However, these approaches compare one category to another of the treatment/exposure without providing overall direct and indirect effects of the categorical exposure variable on the outcome variable. Furthermore, the direct application of these methods to a mediation model with a censored mediator and a binary outcome is not straightforward. For instance, the approaches proposed by Samoilenko et al. focused on continuous mediator and outcome [24].

To address these limitations, we proposed an approach to assess the overall mediating effect of a mediation model in which the exposure is categorical, and the mediator is subject to censoring. In the next sections, we propose definitions for the overall measures for *IE*, *DE*, and *TE* by extending the approach proposed in Wang et al. [18]. We focused on this approach because, compared to other existing approaches, it employed the semiparametric AFT model, which does not require a parametric distribution assumption on the mediators, and the pseudo-likelihood function, which is more flexible to be extended to different outcome variables (e.g., continuous outcome). However, as mentioned above, the proposed overall measures of *IE*, *DE* and *TE* are general and valid regardless of whether the outcome variable and mediator are continuous, binary, or censored or the approaches used.

*IE* for a mediation model with a categorical exposure and a continuous outcome

We first present methodology for the scenario when outcome variable is continuous. Let *X* be the categorical exposure, *T* be the mediator subject to right censoring, *Y* be the continuous outcome variable and *Z* be the other covariates involved in a mediation model. For the mediator, given an individual *i*, *i* = 1, …, *n*, we observe *m*_{i} = min(*t*_{i}, *c*_{i}) and *δ*_{i} = *I*(*t*_{i} ≤ *c*_{i}), where *c*_{i} is the right-censored time and *δ*_{i} is the indicator for censored (*δ*_{i} = 0) or observed (*δ*_{i} = 1). For the categorical exposure, we utilize the dummy coding as suggested in Hayes and Preacher. If the categorical exposure *X* has *k* categories, we create *k*-1 dummy variables, *X*_{j}, *j* = 1, … *k*-1, with one of the categories as the reference (Fig 2), where *X*_{j} = 1 if *X* = *d*_{j}, and *X*_{j} = 0 if *X* ≠ *d*_{j}. That is, *X* = *d*_{j} means (*X*_{1} = 0, … *X*_{j} = 1, … *X*_{k−1} = 0). For the reference category, *X* = *d*_{0} is equivalent to (*X*_{1} = 0, … *X*_{j} = 0, … *X*_{k−1} = 0). For example, in our motivation study, the exposure is coded as an additive genetic model with three categories (0, 1, 2) denoting the three genotypes (*rr*, *Rr*, *RR*). Here *k* = 3 and we create two dummy variables, *X*_{1} and *X*_{2}, using the genotype *rr* (i.e., 0) as the reference group. Specifically, *X*_{1} = 1 when *X* = 1 (genotype *Rr*) and *X*_{2} = 1 when *X* = 2 (genotype *RR*).

Nodes represent the variables being analyzed in the mediation model, including the *k*-1 dummy-coded exposure variables, *X*_{1}, *X*_{2}, …, *X*_{k−1}, the mediator *T* and the outcome *Y*. A direct edge implies a potential direct causal effect. A pathway from one variable (e.g., *X*_{1}) to another (*Y*) implies a potential causal relationship through the mediator on the path (*T*).

The relationships among the variables can be specified using the linear regression model (i.e., association of exposure and mediator with outcome) and AFT model (i.e., association of exposure with mediator) as follows:
and
where *b*_{0}, *b*, , *j* = 1, …, *k* − 1, *γ*, and are the regression coefficients; ; *ε*_{yi} ~ Normal(0, *σ*^{2}); and *ε*_{ti} represents the independently and identically distributed random errors with mean zero and an unspecified distribution. Particularly, the coefficients *a*_{j}, *j* = 1, …, *k* − 1, correspond to the paths from *k*-1 dummy variables created for the original exposure, *X*_{1}, *X*_{2}, …, *X*_{k−1}, to the mediator *T*; the coefficients , *j* = 1, …, *k* − 1, correspond to the paths from *k*-1 dummy-coded exposure variables to the outcome *Y*; and *b* corresponds to the path from the mediator *T* to the outcome *Y*. In the presence of right-censoring, given a continuous outcome, the likelihood function for the observed data (*y*_{i}, *m*_{i}, *δ*_{i}, *x*_{1i}, …, *x*_{(k-1)i}, ) for an individual *i* is given as

We use the two-stage approach proposed in Wang et al. [18] to assess the coefficients for different paths *a*_{j}, , *j* = 1, …, *k* − 1, and *b*. In particular, in the first stage, based on the semiparametric AFT model [11,39,40], we use a weighted least square estimator to estimate coefficients *a*_{j}, with the closed form as
where is the Kaplan-Meier estimator for the censoring survival function which accounts for the right-censoring. Based on and the AFT model error distribution , which is estimated from the censored residues by Kaplan-Meier estimator, in the second stage, we assess the coefficients for paths *b*, and , *j* = 1, …, *k* − 1, with the use of a log-pseudo-likelihood function as below, given a sample of *n* individuals:
where is the set of parameters to be estimated and *τ* is the largest observed event time on a residual scale. The conditional probabilities Pr_{φ}(*y*_{i}|*m*_{i}, *A*_{i}) and can be formulated using the AFT model and the linear regression model. The estimators are assessed by maximizing the above log-pseudo-likelihood with the use of the minimization algorithms such as Nelder and Mead approach [41]. See Wang et al. [18] for details on parameter estimation.

We propose to assess the overall *IE*, *DE* and *TE* using an approach in which the *IE*s, *DE*s and *TE*s are calculated based on different categories of the exposure and combined using the corresponding frequencies of different categories of the exposure. The measures for *IE*, *DE*, and *TE* of each category of the exposure are derived following the counterfactual framework, which has been widely applied in mediation analysis, especially for scenarios involving nonlinearities and interactions [10,42–49]. Briefly, we denote and respectively to be the values of the outcome *Y* and mediator *T* that would have been observed if the exposure *X* had been set to *d*_{j}; and denote to be the value of *Y* that would have been observed if *T* and *X* had been set to *t* and *d*_{j} respectively [18,44,45]. Based on the counterfactual framework, conditional on the covariates *Z*, the natural *IE* is defined as , which compares the effects of the mediator *T* at values of and on the outcome *Y* when the exposure *X* is set to value of *d*_{j}; while the natural *DE* is defined as , which compares the effects of the exposure *X* on *Y* by setting the mediator *T* to the value it would have been observed if *X* had been set to be *d*_{0} (reference category). Here, the assumptions on the absence of unmeasured confounders and consistency are required, which have been extensively discussed in the literature [8,12,13,18,43–48]. The detailed derivations for calculating *IE* and *DE* and associated assumptions are shown in the online S1 Appendix.

For each of the binary dummy-coded exposure variables *X*_{j}, *j* = 1, …, *k* − 1, we evaluate the *IE* in the mediation model and denote it as *IE*_{j_versus_0} (indirect effect of category *X* = *d*_{j} versus the reference category *X* = *d*_{0}), given *x*_{j} = 0 as the reference group:

The overall indirect effect of the exposure *X* on the outcome *Y* mediated through the mediator *T* needs to account for the frequency of each possible categories *d*_{j} of *X*. Therefore, we define overall *IE* as:
(1)
where *f*_{j}, *j* = 1, …, *k* − 1, is the frequency of category *d*_{j} of the exposure. For genetic studies, *f*_{j} is the genotypic frequencies of possible genotypes of SNP *X* and can be obtained from external sources that represent general population such as the 1000 genome project data [42]; or one could estimate the genotypic frequencies from the current data.

If the models in the above equations are correctly specified, we can estimate the overall measures of *IE* based on the estimated AFT model error distribution for the mediator and the estimated coefficients and . On the basis of the counterfactual framework, the natural *IE*_{j_versus_0} for the dummy-coded exposure *X*_{j}, as defined above, measures the expected change in outcome *Y* due to the effects of the mediator *T* at values of versus when *X*_{j} is set to 1, conditional on the covariates [18,50]. In turn, the overall *IE*, calculated accounting for the frequency of different categories of the exposure, measures the average expected change in *Y* from the effects of the mediator *T* responding to change/transition in the categorical exposure *X* from any group to the reference group [50].

We can similarly define direct effects, *DE*_{j_versus_0}:

Similarly, the overall *DE* is then defined:
(2)

The overall *TE* is the summation of the *IE* and *DE*, calculated as *. The proportion of total effect of **X* on *Y* mediated by the mediator *T* (*PM*) can be estimated as *PM* = *IE/TE*, which is commonly reported when there is a significant *IE* [51].

*IE* for a mediation model with a categorical exposure and a binary outcome in a case-control study

We further extended the framework in the above section to accommodate a categorical exposure, a censored mediator and a binary outcome (e.g. presence or absence of a phenotype in case-control studies). The relationships among the variables can be specified using the logistic regression model and AFT model as follows: and

All coefficients are defined similarly as for the scenario with a continuous outcome. To account for the binary nature of the outcome variable in a case-control study where cases may be oversampled, in stage one, in addition to the weight to account for the right-censoring, we consider one additional sampling weight, i.e. *w*_{i}, in the weighted least square estimator to estimate coefficients *a*_{j}, *j* = 1, …, *k* − 1, as below:

The log-pseudo-likelihood function can be similarly defined by including the sampling weight:
where the coefficients can be evaluated to provide the coefficients for paths *b* and , *j* = 1, …, *k* − 1. Specifically, the coefficients , *j* = 1, …, *k* − 1, correspond to the paths from *k*-1 dummy variables created for the exposure, *X*_{1}, *X*_{2}, …, *X*_{k−1}, to the outcome *Y*, and *b* corresponds to the path from the mediator *T* to the outcome *Y*. The sampling weight *w*_{i} is included to account for the sampling mechanism of case-control study design. Such weighting strategy of using inverse disease prevalence is well-established [18,52–58].

To assess the overall *IE* and *DE*, we first estimate the *IE* and *DE* for each of the dummy-coded exposure variables *X*_{j}, *j* = 1, …, *k* − 1, as described above, *IE*_{j_versus_0} and *DE*_{j_versus_0}:
and

Based on the estimated AFT model error distribution for the mediator and the estimated coefficients and , the overall *IE* and *DE* of the exposure *X* for the mediation model are calculated using Eqs (1) and (2).

The study and data use were approved by US National Institute of Health and The University of Texas MD Anderson Cancer Center through Material Transfer Agreement (MTA ID: 00016197). The data of the motivating study was downloaded from dbGaP (phs000209.v13.p3) [59] for the Multi-Ethnic Study of Atherosclerosis (MESA) cohort study. The MESA study was approved by the Institutional Review Board at each site of the study and informed consent was obtained from each participant [60].

## Simulation

### Simulation approach

To examine the performance of the proposed overall measures of *IE*, *DE*, and *TE*, we conducted simulations for a mediation model with a categorical exposure, where the mediator is subject to right censoring.

#### Binary outcome.

To mimic the motivation study, we first considered a case-control study in the simulations, with a binary outcome and an additive genetic variant (SNP) as the exposure. We used the robust estimating approach by Wang et al. [18] to estimate the parameters under the mediation model. For each individual *i*, the genotype of the genetic variant *x*_{i} (exposure) was generated with the use of the genotypes’ frequencies assuming the genetic variant is in Hardy-Weinberg proportion. We assumed a genetic variant with a minor allele frequency (MAF) of 0.1, 0.3 and 0.5. The genotypic frequencies could be calculated accordingly. For example, when the MAF was 0.3, the genotypic frequencies were 0.49, 0.42, and 0.09 for the three genotypes *rr*, *rR*, and *RR*, respectively. Given the exposure *x*_{i}, the censored mediator *t*_{i} was generated using an AFT model log(*t*_{i}) = *a*_{0} + *ax*_{i} + *ε*_{t}, where *ε*_{t} ~ Normal(0, 1) and the coefficients were set as *a*_{0} = 6 and *a*_{0} = 0 or 0.4. The right-censoring time *c*_{i} was generated independently from the uniform distributions using different intervals to create different censoring percentages, ~20% and ~40%. The observed censored variable *m*_{i} and indicator *δ*_{i} for each individual were then obtained as *m*_{i} = min(*t*_{i}, *c*_{i}) and *δ*_{i} = *I*(*t*_{i} ≤ *c*_{i}). Conditioned on the values of *x*_{i} and *t*_{i}, the outcome *y*_{i} was generated using the logistic regression model, where the coefficients were set to be *b* = 0 or 0.4, and . The intercept coefficient *b*_{0} was set to various values to reflect different disease prevalence of ~10% and ~30%. In this way, we simulated a large amount of data on the population, from which we randomly sampled same numbers of cases and controls based on the outcome status. In particular, we randomly sampled 500 cases and controls for the scenarios where MAF = 0.3 and 0.5; while sampled 2000 cases and controls for the scenarios where MAF = 0.1 to ensure the sufficient sample size in *RR* genotype category, thereby, producing stable estimations of the regression coefficients.

Note that in the simulations, we employed *X* = (0, 1, 2) corresponding to the genotypes (*rr*, *rR*, *RR*). In this situation, the regression coefficients, *a* and , are interpreted as the per-allele effect, which corresponds to the effect of each copy of the deleterious allele *R*. When analyzing the mediation model, as described in the Methods section, we created two dummy variables, *X*_{1} = (0, 1, 0) and *X*_{2} = (0, 0, 1) using the genotype *rr* as the reference group. Given such a coding, it is straightforward to derive that *a*_{1} = *a*, *a*_{2} = 2*a*, , and . That is, *a*_{1} = 0 or 0.4; *a*_{2} = 0 or 0.8; ; and . We report the estimates for *a*_{1}, *a*_{2}, , and for the simulation studies.

For each of the MAF values (i.e., 0.1, 0.3, 0.5), when either coefficient *a* = 0 or *b* = 0, we considered twelve scenarios for which there is no *IE* through the mediator, with respect to different censoring percentages (~20 or 40%) and different disease prevalence (~10 or 30%). When both *a* and *b* are non-zero (i.e., *a* = *b* = 0.4), we considered four scenarios for which there is an *IE* through the mediator with respect to different censoring percentages and disease prevalence. For different scenarios, the theoretical true values of *IE*s and *PM*s were calculated using the prespecified parameters and prespecified normal distribution for the conditional probability of the mediator. In our estimating procedure to calculate the empirical *IE*s and *PM*s, the nonparametric Kaplan-Meier estimator of the censored residuals was used to assess the conditional probability of the mediator. To test the significance of the *IE*s and *PM*s, the bias-corrected and accelerated (BCa) bootstrap approach [61] was employed to determine the confidence intervals (CIs) for the *IE*s and *PM*s [17,18].

#### Continuous outcome.

When investigating the performance of the proposed overall measures for a mediation model with a continuous outcome, a categorical exposure (genetic variant) and a censored mediator, we generated the exposure *X* and the mediator *T* similarly as described above. We still assumed different MAFs for the genetic variant (i.e., 0.1, 0.3, 0.5) and different censoring percentages for the mediator (i.e., ~20%, ~40%). For each individual *i*, given the values of the exposure *x*_{i} and the mediator *t*_{i}, the outcome *y*_{i} was generated using the linear regression model. The coefficients were set to be *b*_{0} = 1, *b* = 0 or 0.4, and . We randomly generated 1000 samples for the scenarios where MAF = 0.3 and 0.5; and 4000 samples for the scenarios where MAF = 0.1. For each of the MAF values (i.e., 0.1, 0.3, 0.5), there were six scenarios for which there is no *IE* through the mediator (i.e., *a* = 0 or *b* = 0); and two scenarios for which there is an *IE* through the mediator (i.e., *a* = *b* = 0.4), with respect to different censoring percentages of the mediator.

### Simulation results

#### Binary outcome.

Tables 1 and 2 report the simulation results for all the scenarios assuming a binary outcome in a case-control study and an additive SNP with MAF = 0.3, including the scenarios for which there is no *IE* through the mediator (top panel) and those for which there is an *IE* through the mediator (bottom panel). In Table 1, we report the means and standard errors of the estimated coefficients for the different paths, *a*_{0}, *a*_{1}, *a*_{2}, *b*_{0}, *b*, , and . As expected, the robust approach provided accurate estimations for all coefficients through different scenarios. As an example of no *IE*, scenario 3, in which the prespecified values were *a*_{0} = 6, *a* = 0 (i.e., *a*_{1} = 0, *a*_{2} = 0), *b*_{0} = -5 (i.e., disease prevalence = ~10%), *b* = 0.4, and (i.e., , ), the estimated values were *a*_{0} = 5.9995, *a*_{1} = 0.0019, *a*_{2} = 0.0080, *b*_{0} = -5.0064, *b* = 0.3984, and , respectively, which were close to the prespecified parameter values in the simulation model. Similarly, under the scenarios where there is an *IE* through the mediator, all parameters were accurately estimated. For example, for scenario 15, in which the prespecified values were *a*_{0} = 6, *a* = 0.4 (i.e., *a*_{1} = 0.4, *a*_{2} = 0.8), *b*_{0} = -3.7 (i.e., disease prevalence = ~30%), *b* = 0.4, and (i.e., , ), the estimated values were *a*_{0} = 6.0043, *a*_{1} = 0.3985, *a*_{2} = 0.7809, *b*_{0} = -3.6902, *b* = 0.3975, and , respectively, which were close to the prespecified parameter values in the simulation model.

Table 2 report the means and standard errors of the estimated *IE*s and *PM*s (for the scenarios with significant *IE*s) for MAF = 0.3, obtained using the overall measures proposed in the study, and the coverage probabilities of the 95% CIs of the *IE*s and *PM*s. When estimating the *IE*s and *PM*s, our overall measures provided accurate estimations for all scenarios. For example, for scenario 16, when the theoretical *IE* and *PM* were respectively 0.043 and 0.268, the estimated values obtained using our approach were 0.0430 and 0.2767, which were close to the theoretical values. The 95% coverage probabilities for the *IE* and *PM*, based on the proposed approach, were close to the nominal value of 0.95. The proposed measures were practically not impacted by different disease prevalence values (~10 or 30%) and censoring percentages (~20 or 40%).

For the other scenarios where the MAFs of the genetic variant (exposure) were 0.1 and 0.5, the simulation results are reported in the online S1 and S2 Tables, respectively. Similar results were observed. The proposed approach provided accurate estimations for all coefficients of different paths, as well as *IE*s and *PM*s, through various scenarios. It is worth to note that a relatively small MAF for the genetic variant (e.g., 10%) would affect the parameter estimations for the paths *a*_{2}, and , which were particularly pronounced when sample size was smaller (e.g., 500 cases and controls; data not shown). This is not surprising to observe because in this situation, the expected frequency of the genotype *RR* is only 1%, resulting in a very small number of samples in this category. Larger sample sizes can help to ensure the accurate estimations of these parameters. Therefore, when MAF = 0.1, we increased the sample size to 2000 cases and 2000 controls (or 4000 samples for the continuous outcome) in the simulation studies.

#### Continuous outcome.

Tables 3 and 4 report the simulation results for all the scenarios assuming a continuous outcome and an additive SNP with MAF = 0.3. Similar to the scenarios with a binary outcome, the proposed approach provided accurate estimations for all coefficients, *IE*s and *PM*s, for mediation models with a continuous outcome, regardless of different values of *a*, *b*, and censoring percentage. For example, for scenario 8, where the prespecified values for simulation were *a*_{0} = 6, *a* = 0.4 (i.e., *a*_{1} = 0.4, *a*_{2} = 0.8), *b*_{0} = 1, *b* = 0.4, and (i.e., , ), the estimated values were *a*_{0} = 5.9954, *a*_{1} = 0.4004, *a*_{2} = 0.7979, *b*_{0} = 1.0084, *b* = 0.3990, , and , respectively, which were close to the prespecified parameter values (Table 3). Meanwhile, the estimated overall *IE* and *PM* were 0.1877 and 0.2433, respectively, which were close to the theoretical values of 0.188 and 0.242 (Table 4). The 95% coverage probabilities for the *IE* and *PM* were 0.948 and 0.944 and both were close to the nominal value of 0.95. The simulation results for scenarios where MAF = 0.1 and 0.5 are reported in the online S3 and S4 Tables, respectively. As in the binary outcome scenarios, we increased the sample size to 4000 when MAF = 0.1 for the genetic variant.

## Application to the motivation study

We applied the proposed overall measures for *IE*, *DE*, and *TE* for the mediation analysis to the data from a genetic case-control study of type 2 diabetes downloaded from dbGaP [59], relating to the Multi-Ethnic Study of Atherosclerosis (MESA) cohort study. The conceptual mediation model is shown in Fig 1, where the genetic variant is the exposure (*X*), the age at menopause is the mediator (*T*), and type 2 diabetes status is the outcome variable (*Y*).

There were 47,871 genetic variants from 2,956 women included in the mediation analysis. A woman’s age at menopause was censored if she had not gone through menopause, and the censoring percentage was ~14.5%. Assuming an additive genetic model for all the genetic variants, we conducted the association analyses of genetic variants with a woman’s age at menopause (path *a*) as well as with type 2 diabetes status (paths *b* and ). In particular, when assessing the association between a woman’s age at menopause and a genetic variant with type 2 diabetes (paths *b* and ), we used logistic regression model, where type 2 diabetes status was the dependent variable and the genetic variant and age at menopause were the predictors. We included age and ethnicity as covariates in the logistic regression model. When assessing the association between a genetic variant with a woman’s age at menopause (path *a*), we used the AFT model, where age at menopause was the dependent variable and the genetic variant was the predictor. Ethnicity was adjusted as a covariate in the AFT model. There were four categories for the ethnicity in the MESA data, including White, Caucasian; Black, African-American; Chinese American; and Hispanic. Ethnicity was considered as a categorical variable in the analysis where White, Caucasian was used as the reference category, resulting in three related coefficients in the AFT model (, and ) and the logistic regression model (*γ*_{1}, *γ*_{2} and *γ*_{3}). Age was considered as a continuous variable, resulting in one coefficient (*γ*_{4}) in the logistic regression model.

For the purpose of demonstration, we considered a threshold of 0.005 to identify top variants. Our approach identified three variants—rs12744291, rs2503182, and rs11771343—associated with both type 2 diabetes and age at menopause to be included in the mediation analysis. Specifically, the *p*-values were 1.27×10^{−3}, 2.56×10^{−3}, and 1.43×10^{−3} for rs12744291, rs2503182, and rs11771343, respectively, for their association with type 2 diabetes and 5.53×10^{−4}, 4.01×10^{−3}, and 4.12×10^{−3}, respectively, for their association with age at onset of menopause. The top and middle panels of Table 5 list the estimations for all the coefficients for the AFT model and logistic regression model, respectively, for the three top genetic variants. Consider SNP rs12744291 as an example, in the AFT model where age at menopause was the dependent variable, the estimated coefficients were *a*_{1} = 0.8768 and *a*_{2} = 1.4933 for the SNP (path *a*); and , and for ethnicity. In the logistic regression model where type 2 diabetes was the dependent variable, the estimated coefficients were *b* = -0.0090, , for the age at menopause and SNP (paths *b* and ); and *γ*_{1} = 1.4437, *γ*_{2} = 1.4859, *γ*_{3} = 1.7071, and *γ*_{4} = 0.0305 for ethnicity and age.

The bottom panel of Table 5 reports the overall *IE*s, *DE*s, *TE*s, and 95% CIs obtained from the mediation analysis of the three genetic variants, a woman’s age at menopause, and type 2 diabetes. BCa bootstrapping was used to assess the CIs for *IE*s as in the simulation studies. The overall *IE*s for the three genetic variants, rs12744291, rs2503182, and rs11771343, were reported as -0.0007, -0.0004, and 0.0005, respectively; and the 95% CIs of *IE*s for all three genetic variants include zero. These results suggest no statistically significant mediating effect of the age at menopause on the association between the three variants and type 2 diabetes risk.

## Discussion

In this study, we proposed overall measures to calculate the *IE*, *DE*, and *TE* for a single censored mediator model involving a categorical exposure. Specifically, we defined the *IE*, *DE*, and *TE* for each of the categories for the exposure first and then assessed the overall *IE*, *DE*, and *TE* of the exposure accounting for the frequencies of different categories of the categorical exposure variable.

Compared with the traditional approach for a multi-categorical exposure, the proposed measure has several advantages. First, it provides an overall *IE*, *DE*, and *TE* of the mediation model from the exposure, instead of relative *IE*, *DE*, and *TE* as described in previous studies. Second, it avoids the multiple testing issue caused by recoding the multi-categorical exposure into multiple binary exposure variables. We did not compare the proposed approach of handling categorical exposure variable with the one used in Wang et al. [18] because their approach is limited to binary exposure only.

We demonstrated the performance of proposed overall measures with simulation studies for the mediation model with a binary outcome or a continuous outcome and a right-censored mediator. Note that such measures are general and robust and can be employed regardless of whether the outcome variable is continuous or binary and the mediator is censored or not. We also investigated the performance of the proposed overall measures for mediation models in the presence of covariates using simulations (online S5 Table). In particular, we considered a mediation model with a binary outcome. Without loss of generality, we fixed the MAF at 0.3, the censoring percentage at ~20% and the disease prevalence at ~10%. We followed the same procedure as described in the Simulation section to generate data. In addition to the exposure, mediator and outcome, we generated a continuous covariate *Z*~Normal(0, 0.5^{2}), which was associated with both the mediator *T* and outcome *Y*. Based on the simulation results, we observed accurate estimations for all the coefficients, as well as *IE* and *PM*. For example, for the scenario 4 in S5 Table, the estimates of *IE* and *PM* were 0.0223 and 0.3000, respectively, which were close to the theoretical values of 0.022 and 0.309. The corresponding coverage probabilities were 0.939 and 0.944, respectively, which were close to the nominal value of 0.95. These results show that the proposed measures are robust even in presence of covariates.

Furthermore, in practice, one may encounter censored data for both outcome variable (e.g., time to onset of disease) and mediator. The approach, using the semiparametric AFT model combined with a pseudo-likelihood function, can be extended to such a mediation model where the outcome variable is also censored. Particularly, one can revise the pseudo-likelihood function to accommodate the survival component. Survival regression models, such as the commonly used Weibull regression model [62], may be employed to address this issue. However, the development of such extension is not straightforward and will need further investigation.

We applied the overall measures of *IE*, *DE*, and *TE* to the motivation study of genetic variants, a woman’s age at menopause, and type 2 diabetes risk. Assuming the additive genetic model for the genetic variants, we identified three variants, rs12744291, rs2503182, and rs11771343, to be included in the mediation analysis because they were associated with both the mediator (i.e., a woman’s age at menopause) and the outcome (i.e., type 2 diabetes status). The results from the mediation analysis showed that a woman’s age at menopause had no mediating effect on the effect of the three genetic variants on type 2 diabetes risk.

Important assumptions required for the mediation analysis have been discussed previously [17,18]. The sensitivity analysis for the assumptions about unmeasured confounders for the derivations of *IE* and *DE* have been extensively conducted and discussed for the motivation study in our previous study [18]. In addition to the “no-unmeasured-confounder” assumptions, we assumed that the mediation model was accurately specified and there were no measurement errors for all the variables in the mediation model. Specifically, for our real data application, we conceptualized the mediation model based on the literature, including the causal orders and causal directions [25–35], and assumed that all the variables, including the exposure, mediator, outcome, and covariates, had no measurement errors.

Besides the assumptions for the mediation analysis, for the parameter estimation approach using the semiparametric AFT model, we assumed that the censoring process for the mediator was independent of the mediator *T*, exposure *X*, outcome *Y*, and covariates *Z*. Sensitivity analysis was conducted previously and showed some degree of robustness for the approach to the violation of the independence assumption [18]. We used the AFT model to relate the exposure to the mediator because it provides the change in the length of survival time as a function of the effect of the exposure, which has an easy way to interpret in the mediation context [17,18]. Other semiparametric survival models, such as the most popularly used Cox proportional model, could be adapted in the mediation model with a censored mediator. However, such adaptation is not straightforward. For example, the measure of effect for the Cox proportional model is the hazard ratio. In such a case, the mediating effect could be difficult to be interpreted because it is the survival time but not the hazard ratio to be expected to have causal effect on the outcome variable. Using other semiparametric survival model in the mediation analysis is of interest to investigate; however, future work is necessary for the derivation of the *IE*, *DE* and *TE* so that these effects can be appropriately interpreted in the mediation context.

Parametric (linear and logistic regressions) and semiparametric approaches (semiparametric AFT model) were employed in the estimation of coefficients for different paths in the mediation model, which usually rely on certain modeling assumptions in one way or another [63–65]. Alternatively, non-parametric approaches to mediation analysis, which has been received a great deal of attention recently, could be considered [63,66–68]. Extension of the current proposed approach for mediation model, where the mediator is censored and the exposure is categorical, to the non-parametric framework will be worthy of future research.

## Supporting information

### S1 Table. Binary outcome: Simulation results given the minor allele frequency (MAF) = 0.1.

Means and standard errors of estimated coefficients for different paths, indirect effects (*IEs*) and proportions of the total effect mediated (*PMs*); and coverage probabilities of the 95% confidence intervals for the estimations of *IE* and *PM*, obtained based on 500 replicates, each with 2000 cases and 2000 controls, given the minor allele frequency (MAF) = 0.1. Different scenarios were considered based on different values of *a*, *b*_{0}, *b*, censoring percentage (*CP*), and disease prevalence (prev), with *a*_{0} = 6 and .

https://doi.org/10.1371/journal.pone.0257628.s001

(XLSX)

### S2 Table. Binary outcome: Simulation results given the minor allele frequency (MAF) = 0.5.

Means and standard errors of estimated coefficients for different paths, indirect effects (*IEs*) and proportions of the total effect mediated (*PMs*); and coverage probabilities of the 95% confidence intervals for the estimations of *IE* and *PM*, obtained based on 500 replicates, each with 500 cases and 500 controls, given the minor allele frequency (MAF) = 0.5. Different scenarios were considered based on different values of *a*, *b*_{0}, *b*, censoring percentage (*CP*), and disease prevalence (prev), with *a*_{0} = 6 and .

https://doi.org/10.1371/journal.pone.0257628.s002

(XLSX)

### S3 Table. Continuous outcome: Simulation results given the minor allele frequency (MAF) = 0.1.

Means and standard errors of estimated coefficients for different paths, indirect effects (*IEs*) and proportions of the total effect mediated (*PMs*); and coverage probabilities of the 95% confidence intervals for the estimations of *IE* and *PM*, obtained based on 500 replicates, each with 4000 samples, given the minor allele frequency (MAF) = 0.1. Different scenarios were considered based on different values of *a*, *b*, and censoring percentage (*CP*), with *a*_{0} = 6, *b*_{0} = 1, and .

https://doi.org/10.1371/journal.pone.0257628.s003

(XLSX)

### S4 Table. Continuous outcome: Simulation results given the minor allele frequency (MAF) = 0.5.

Means and standard errors of estimated coefficients for different paths, indirect effects (*IEs*) and proportions of the total effect mediated (*PMs*); and coverage probabilities of the 95% confidence intervals for the estimations of *IE* and *PM*, obtained based on 500 replicates, each with 1000 samples, given the minor allele frequency (MAF) = 0.5. Different scenarios were considered based on different values of *a*, *b*, and censoring percentage (*CP*), with *a*_{0} = 6, *b*_{0} = 1, and .

https://doi.org/10.1371/journal.pone.0257628.s004

(XLSX)

### S5 Table. Binary outcome: Simulation results for a mediation model in presence of a covariate.

Means and standard errors of estimated coefficients for different paths, indirect effects (*IEs*) and proportions of the total effect mediated (*PMs*); and coverage probabilities of the 95% confidence intervals for the estimations of *IE* and *PM*, obtained based on 500 replicates, each with 1000 cases and 1000 controls, given the minor allele frequency (MAF) = 0.3. Different scenarios were considered based on different values of *a*, *b*_{0}, and *b*, with *a*_{0} = 6, , , censoring percentage (*CP*) of ~20%, and disease prevalence (prev) of ~10%.

https://doi.org/10.1371/journal.pone.0257628.s005

(XLSX)

### S1 Appendix. Derivations of indirect, direct and total effects.

https://doi.org/10.1371/journal.pone.0257628.s006

(DOCX)

## Acknowledgments

We thank the other investigators, the staff, and the participants of the MESA study for their valuable contributions. A full list of participating MESA investigators and institutions can be found at http://www.mesa-nhlbi.org. Editorial support was provided by Mr. Bryan Tutt in Scientific Publications Services, Research Medical Library.

## References

- 1. Wang J, Spitz MR, Amos CI, Wilkinson AV, Wu X, Shete S. Mediating effects of smoking and chronic obstructive pulmonary disease on the relation between the CHRNA5-A3 genetic locus and lung cancer risk. Cancer. 2010;116(14):3458–62. pmid:20564069.
- 2. Wang J, Spitz MR, Amos CI, Wu X, Wetter DW, Cinciripini PM, et al. Method for evaluating multiple mediators: mediating effects of smoking and COPD on the association between the CHRNA5-A3 variant and lung cancer risk. Plos One. 2012;7(10):e47705. Epub 2012/10/19. pmid:23077662.
- 3. VanderWeele TJ, Asomaning K, Tchetgen Tchetgen EJ, Han Y, Spitz MR, Shete S, et al. Genetic variants on 15q25.1, smoking, and lung cancer: an assessment of mediation and interaction. Am J Epidemiol. 2012;175(10):1013–20. pmid:22306564.
- 4. Rojo C, Zhang Q, Keles S. iFunMed: Integrative functional mediation analysis of GWAS and eQTL studies. Genet Epidemiol. 2019;43(7):742–60. pmid:31328826.
- 5. Bi X, Yang L, Li T, Wang B, Zhu H, Zhang H. Genome-wide mediation analysis of psychiatric and cognitive traits through imaging phenotypes. Hum Brain Mapp. 2017;38(8):4088–97. pmid:28544218.
- 6. Shan N, Wang Z, Hou L. Identification of trans-eQTLs using mediation analysis with multiple mediators. BMC bioinformatics. 2019;20(Suppl 3):126. pmid:30925861.
- 7. Parker MM, Lutz SM, Hobbs BD, Busch R, McDonald MN, Castaldi PJ, et al. Assessing pleiotropy and mediation in genetic loci associated with chronic obstructive pulmonary disease. Genet Epidemiol. 2019;43(3):318–29. pmid:30740764.
- 8. Lange T, Hansen JV. Direct and indirect effects in a survival context. Epidemiology. 2011;22(4):575–81. pmid:21552129.
- 9.
Luo P, Geng Z. Causal mediation analysis for survival outcome with unobserved mediator-outcome confounders. Comput Stat Data An. 2015.
- 10. Tchetgen Tchetgen EJ. On causal mediation analysis with a survival outcome. International Journal of Biostatistics. 2011;7(1):Article 33. pmid:22049268.
- 11.
Tein JY, MacKinnon DP. Estimating mediated effects with survival data. In: Yanai H, Okada A, Shigemasu K, Kano Y, Meulman JJ, editors. New Developments in Psychometrics. Tokyo: Springer; 2003. p. 405–12.
- 12. VanderWeele TJ. Causal mediation analysis with survival data. Epidemiology. 2011;22(4):582–5. pmid:21642779.
- 13. Huang YT, Yang HI. Causal Mediation Analysis of Survival Outcome with Multiple Mediators. Epidemiology. 2017;28(3):370–8. pmid:28296661.
- 14. Lin SH, Young JG, Logan R, VanderWeele TJ. Mediation analysis for a survival outcome with time-varying exposures, mediators, and confounders. Stat Med. 2017;36(26):4153–66. pmid:28809051.
- 15. Fasanelli F, Giraudo MT, Ricceri F, Valeri L, Zugna D. Marginal Time-Dependent Causal Effects in Mediation Analysis With Survival Data. Am J Epidemiol. 2019;188(5):967–74. pmid:30689682.
- 16. Wakai K, Seki N, Tamakoshi A, Kondo T, Nishino Y, Ito Y, et al. Decrease in risk of lung cancer death in males after smoking cessation by age at quitting: findings from the JACC study. JPN J Cancer Res. 2001;92(8):821–8. pmid:11509112.
- 17. Wang J, Shete S. Estimation of indirect effect when the mediator is a censored variable. Stat Methods Med Res. 2017:962280217690414. pmid:28132585.
- 18. Wang J, Ning J, Shete S. Mediation analysis in a case-control study when the mediator is a censored variable. Stat Med. 2019;38(7):1213–29. pmid:30421436.
- 19. Zhao F, Song M, Wang Y, Wang W. Genetic model. J Cell Mol Med. 2016;20(4):765. pmid:26762596.
- 20. Gaye A, Davis SK. Genetic model misspecification in genetic association studies. BMC Res Notes. 2017;10(1):569. pmid:29115983.
- 21. So HC, Sham PC. Robust association tests under different genetic models, allowing for binary or quantitative traits and covariates. Behav Genet. 2011;41(5):768–75. pmid:21305351.
- 22. Min JL, Taylor JM, Richards JB, Watts T, Pettersson FH, Broxholme J, et al. The use of genome-wide eQTL associations in lymphoblastoid cell lines to identify novel genetic pathways involved in complex traits. Plos One. 2011;6(7):e22070. pmid:21789213.
- 23. Hayes AF, Preacher KJ. Statistical mediation analysis with a multicategorical independent variable. Br J Math Stat Psychol. 2014;67(3):451–70. pmid:24188158.
- 24. Samoilenko M, Arrouf N, Blais L, Lefebvre G. Comparing two counterfactual-outcome approaches in causal mediation analysis of a multicategorical exposure: An application for the estimation of the effect of maternal intake of inhaled corticosteroids doses on birthweight. Stat Methods Med Res. 2020;29(10):2767–82. pmid:32200753.
- 25. Almind K, Doria A, Kahn CR. Putting the genes for type II diabetes on the map. Nat Med. 2001;7(3):277–9. pmid:11231616.
- 26. Cornelis MC, Hu FB. Gene-environment interactions in the development of type 2 diabetes: recent progress and continuing challenges. Annu Rev Nutr. 2012;32:245–59. pmid:22540253.
- 27. Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, et al. The genetic architecture of type 2 diabetes. Nature. 2016;536(7614):41–7. pmid:27398621.
- 28. Lyssenko V, Laakso M. Genetic screening for the risk of type 2 diabetes: worthless or valuable? Diabetes care. 2013;36 Suppl 2:S120–6. pmid:23882036.
- 29. Morris AP, Voight BF, Teslovich TM, Ferreira T, Segre AV, Steinthorsdottir V, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet. 2012;44(9):981–90. pmid:22885922.
- 30. Willemsen G, Ward KJ, Bell CG, Christensen K, Bowden J, Dalgard C, et al. The concordance and heritability of type 2 diabetes in 34,166 twin pairs from international twin registers: the discordant twin (DISCOTWIN) consortium. Twin Res Hum Genet. 2015;18(6):762–71. pmid:26678054.
- 31. Brand JS, van der Schouw YT, Onland-Moret NC, Sharp SJ, Ong KK, Khaw KT, et al. Age at menopause, reproductive life span, and type 2 diabetes risk: results from the EPIC-InterAct study. Diabetes care. 2013;36(4):1012–9. pmid:23230098.
- 32. LeBlanc ES, Kapphahn K, Hedlin H, Desai M, Parikh NI, Liu S, et al. Reproductive history and risk of type 2 diabetes mellitus in postmenopausal women: findings from the Women’s Health Initiative. Menopause. 2017;24(1):64–72. Epub 2016/07/29. pmid:27465714.
- 33. Malacara JM, Huerta R, Rivera B, Esparza S, Fajardo ME. Menopause in normal and uncomplicated NIDDM women: physical and emotional symptoms and hormone profile. Maturitas. 1997;28(1):35–45. pmid:9391993.
- 34. Day FR, Ruth KS, Thompson DJ, Lunetta KL, Pervjakova N, Chasman DI, et al. Large-scale genomic analyses link reproductive aging to hypothalamic signaling, breast cancer susceptibility and BRCA1-mediated DNA repair. Nat Genet. 2015;47(11):1294–303. pmid:26414677.
- 35. Laven JS. Genetics of Early and Normal Menopause. Semin Reprod Med. 2015;33(6):377–83. pmid:26569518.
- 36. Ao W, Calonico S, Lee YY. Multivalued Treatments and Decomposition Analysis: An Application to the WIA Program. J Bus Econ Stat. 2021;39(1):358–71.
- 37. Linden A, Uysal SD, Ryan A, Adams JL. Estimating causal effects for multivalued treatments: a comparison of approaches. Stat Med. 2016;35(4):534–52. pmid:26482211
- 38. Cattaneo MD. Efficient semiparametric estimation of multi-valued treatment effects under ignorability. J Econometrics. 2010;155(2):138–54.
- 39. Swindell WR. Accelerated failure time models provide a useful statistical framework for aging research. Exp Gerontol. 2009;44(3):190–200. pmid:19007875.
- 40.
Collett D. Modeling Survival Data in Medical Research. Boca Raton, FL: CRC Press; 2003.
- 41. Nelder JA, Mead R. A simplex-method for function minimization. Comput J. 1965;7(4):308–13.
- 42. Genomes Project C, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. pmid:23128226.
- 43.
Pearl J, editor Direct and indirect effects. Proceedings of the Seventeenth Conference on Uncertainty and Artificial Intelligence; 2001; San Francisco, CA: Morgan Kaufmann; 2001.
- 44. Vanderweele TJ, Vansteelandt S. Odds ratios for mediation analysis for a dichotomous outcome. Am J Epidemiol. 2010;172(12):1339–48. pmid:21036955.
- 45. VanderWeele TJ, Tchetgen Tchetgen EJ. Mediation analysis with time varying exposures and mediators. Journal of the Royal Statistical Society Series B, Statistical methodology. 2017;79(3):917–38. pmid:28824285.
- 46. Huang YT, Pan WC. Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biometrics. 2016;72(2):402–13. pmid:26414245.
- 47. Imai K, Keele L, Yamamoto T. Identification, inference and sensitivity analysis for causal mediation effects. Stat Sci. 2010;25(1):51–71.
- 48. Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992;3(2):143–55. pmid:1576220.
- 49. Holland PW. Causal inference, path analysis and recursive structural equations models. ETS Research Report Series. 1988;1988(1):i–50.
- 50. Pearl J. Interpretation and identification of causal mediation. Psychol Methods. 2014;19(4):459–81. pmid:24885338.
- 51.
Kenny DA. Mediation. 2018 [cited 2020 March 15]. http://davidakenny.net/cm/mediate.htm.
- 52. Prentice RL, Pyke R. Logistic Disease Incidence Models and Case-Control Studies. Biometrika. 1979;66(3):403–11.
- 53. Breslow NE. Statistics in epidemiology: The case-control study. J Am Stat Assoc. 1996;91(433):14–28. pmid:12155399
- 54. Rose S, van der Laan MJ. A Targeted Maximum Likelihood Estimator for Two-Stage Designs International Journal of Biostatistics. 2011;7(1). Artn 17 pmid:21556285
- 55. Hejazi NS, van der Laan MJ, Janes HE, Gilbert PB, Benkeser DC. Efficient nonparametric inference on the effects of stochastic interventions under two-phase sampling, with applications to vaccine efficacy trials. Biometrics. 2020. pmid:32949147
- 56. Richardson DB, Rzehak P, Klenk J, Weiland SK. Analyses of case-control data for additional outcomes. Epidemiology. 2007;18(4):441–5. pmid:17473707.
- 57. Wang J, Shete S. Estimation of odds ratios of genetic variants for the secondary phenotypes associated with primary diseases. Genet Epidemiol. 2011;35(3):190–200. pmid:21308766.
- 58. Rose S, van der Laan MJ. Simple optimal weighting of cases and controls in case-control studies. International Journal of Biostatistics. 2008;4(1):Article 19. pmid:20231910.
- 59. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, et al. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet. 2007;39(10):1181–6. pmid:17898773.
- 60. Fujiyoshi A, Jacobs DR Jr., Fitzpatrick AL, Alonso A, Duprez DA, Sharrett AR, et al. Coronary Artery Calcium and Risk of Dementia in MESA (Multi-Ethnic Study of Atherosclerosis). Circ Cardiovasc Imaging. 2017;10(5). Epub 2017/05/04. pmid:28465455.
- 61. Efron B. Better Bootstrap Confidence-Intervals. J Am Stat Assoc. 1987;82(397):171–85.
- 62.
Liu X. Survival Analysis: Models and Applications. United Kingdom: John Wiley & Sons; 2012.
- 63.
Cha K, Imai K, Yam SCP, Zhang Z. Efficient nonparametric estimation of causal mediation effects. arXiv:1601.035012016 [cited 2021 March 2]. https://arxiv.org/abs/1601.03501.
- 64. Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82(4):669–88.
- 65. Imai K, Keele L, Tingley D. A General Approach to Causal Mediation Analysis. Psychol Methods. 2010;15(4):309–34. pmid:20954780
- 66.
Benkeser D. Nonparametric inference for interventional effects with multiple mediators. arXiv:2001.060272020 [cited 2021 March 2]. https://arxiv.org/abs/2001.06027.
- 67. Diaz I, Hejazi NS. Causal mediation analysis for stochastic interventions. J R Stat Soc B. 2020;82(3):661–83.
- 68. Diaz I, Hejazi NS, Rudolph KE, van der Laan MJ. Non-parametric efficient causal mediation with intermediate confounders. Biometrika. 2020;asaa085:1–16.