Theories behind propensity score analysis assume that the covariates are fully observed (Rosenbaum & Rubin, 1983, 1984). However, in practice, observational analyses require large administrative databases or surveys, which inevitably will have missingness in the covariates. The response patterns of people with missing covariates may be different than those of people with observed data (Mohan, Pearl, & Tian, 2013). Therefore, ways to handle missing covariate data need to be examined. The basic estimation of propensity scores using logistic regression will delete cases with missing data, which can be problematic as it can cause bias in the treatment effect estimates (Baraldi & Enders, 2010).
Missing Data Methods in Propensity Score Analysis
Below I explain three major methods used in the applied propensity score analysis literature when \(X\) is not fully observed. I also explain three other methods to handle missing data that are not commonly used in applied literature but have been proposed theoretically. I also describe the assumptions about missing data and strong ignorability underlying each of the methods. Let \(X_{obs}\) indicate the observed parts of \(X\) and \(X_{mis}\) indicate the missing parts of \(X\). \(D\) indicates the fully observed treatment indicator and \(Y\) indicates a fully observed outcome variable.
Complete Case Analysis
This approach deletes cases with missing data in any of the variables used in the analysis (Baraldi & Enders, 2010; Hill, 2004). The traditional propensity score estimation method of using logistic regression implements complete case analysis by default. Therefore, this method is commonly used in applied research. The data that remains after deleting cases with missing data are assumed to be a simple random sample of the full data (Baraldi & Enders, 2010). Missingness is not related to any study variables nor to the hypothetically complete values of itself. According to Hill (2004), the assumption underlying complete case analysis is that the joint distributions of \(X_{obs}\) and \(X_{mis}\) are same across the two treatment conditions: \[\begin{equation} X_{obs}, X_{mis} \perp\!\!\!\perp D \end{equation}\] Therefore, an unbiased causal effect estimate can be retrieved after deleting cases with missing data. Such an assumption is very stringent and unlikely to be met in the types of data required for propensity score analyses (Baraldi & Enders, 2010; Hill, 2004). As mentioned above, deleting cases can also result in loss of power. Additionally, whether \(X_{mis}\) is balanced between the treatment groups cannot be confirmed.
Multiple Imputation
Multiple imputation (MI) generates multiple sets of data with the missing values drawn from an imputation model (Mitra & Reiter, 2016; Rubin, 1987). MI will create \(m > 1\) imputed datasets that contain different imputed values (Murray, 2018; van Buuren, 2018). Analyses can be performed on each of the datasets and results from each dataset can be aggregated across to derive a final estimate, standard error, degrees of freedom, and test result. Thus, MI involves two stages: (1) imputation and creation of the \(m\) imputed datasets, and (2) analysis and pooling of estimates across the datasets (Murray, 2018; van Buuren, 2018).
There are two approaches for imputing multivariate missing data: (1) joint modeling, JM, and (2) fully conditional specification, FCS, also called multivariate imputation by chained equations, MICE (Murray, 2018; van Buuren, 2018; van Buuren & Groothuis-Oudshoorn, 2011). JM entails jointly modeling variables with missingness by drawing from a multivariate distribution (Murray, 2018; van Buuren, 2018; van Buuren & Groothuis-Oudshoorn, 2011). FCS entails univariate conditional imputation models of variables with missing data that iteratively condition on all other variables using Monte Carlo Markov chain methods (van Buuren, 2018; van Buuren & Groothuis-Oudshoorn, 2011). JM imputes all variables simultaneously whereas FCS imputes one variable at a time (van Buuren, 2018). Because JM requires specification of a joint distribution for all the variables, it may not be as flexible as FCS when dealing with a large number of covariates with missing data (Akande, Li, & Reiter, 2017). However, FCS is computationally more intensive than JM (van Buuren, 2018). FCS also has been shown to outperform JM for categorical variables and is more robust under mis-specification of imputation model (van Buuren, 2018). Therefore, van Buuren (2018) recommended to use FCS over JM.
If the missingness mechanism is MAR or MCAR and if assumptions underlying the imputation model are correct, MI will yield unbiased results, as it uses the information available in \(X_{obs}\) to impute missing values (Murray, 2018). In the causal inference context, Hill (2004) argued that MI relies on the assumption of latent ignorability, a concept introduced by Frangakis & Rubin (1999). The assumption requires that the treatment assignment mechanism is ignorable given complete covariate data including the values that are latent or missing. These missing values are derived from MI. Below, let \(e_{MI}(X)\) denote propensity scores derived after multiple imputation: \[\begin{equation} X_{obs}, X_{mis} \perp\!\!\!\perp D| e_{MI}(X) \end{equation}\] \[\begin{equation} Y(1), Y(0) \perp\!\!\!\perp D | e_{MI}(X) \end{equation}\] Hill (2004) proposed two different ways to combine propensity scores estimated in each of the m datasets:
Multiple Imputation Across (MI Across)
This approach involves creating m imputed datasets and then estimating propensity scores within each of the datasets and then averaging each unit’s m propensity scores across the m datasets (Hill, 2004). Stratification, matching or IPW can be implemented using these averaged propensity scores (Hill, 2004). Outcome models that include covariates will need to use the weights or strata derived from the averaged propensity scores and the m sets of covariate values. The weighted regression estimates will then need to be pooled.
Multiple Imputation Within (MI Within)
This approach involves creating m imputed datasets and then estimating propensity scores within each of the datasets (Hill, 2004). Instead of averaging the propensity scores across the datasets, this method entails conditioning on the propensity scores within the datasets and running the outcome analyses within each dataset (Hill, 2004). The separate regression estimates have to be pooled.
Generalized Propensity Scores
Rosenbaum & Rubin (1984) proposed the use of generalized propensity scores (GPS) as a way to address missing covariate data. The GPS represents the probability of treatment given observed covariates and missingness indicators (Rosenbaum & Rubin, 1984): \[\begin{equation} e^*(X) = P(D = 1|X_{obs}, R) \end{equation}\] Conditioning on \(e^*(X)\) will balance the treatment groups in terms of the observed covariates and missingness patterns (Rosenbaum & Rubin, 1984). The observed part of \(X\) and the missingness pattern indicators, \(R\), will be independent of treatment assignment given the GPS (Rosenbaum & Rubin, 1984): \[\begin{equation} X_{obs}, R \perp\!\!\!\perp D| e^*(X) \end{equation}\] However, conditioning on GPS will not balance the groups in terms of the unobserved values of \(X\) (Rosenbaum & Rubin, 1984): \[\begin{equation} X_{mis} \not\!\perp\!\!\!\perp D| e^*(X) \end{equation}\] Although this technique of treating missing data is not generally recommended for other types of missing data analyses, it has been recommended for use in propensity score analysis literature (Rosenbaum & Rubin, 1984; Stuart, 2010). In the context of propensity score analysis, this approach does not assume latent ignorability of treatment assignment because legitimate values for missing data are never derived. The assumption underlying this method is that balancing the treatment and control groups on \(X_{obs}\) and \(R\) is a sufficient condition to satisfy ignorability. With the GPS, the treatment and control groups are possibly not going to be balanced in terms of \(X_{mis}\).
For large studies with few missing data patterns, Rosenbaum & Rubin (1984) suggested estimating separate logit models for each missingness pattern. In practice, it is common to encounter many patterns of missing data. For these scenarios, Rosenbaum & Rubin (1984) suggested creating an additional category indicating missingness for categorical variables. For continuous variables, Stuart (2010) recommended imputing missing data with a single arbitrary value, such as the overall mean of the covariate, and then creating a missingness indicator variable. In general missing data analysis context, van Buuren (2018) noted that this method of combining arbitrary (mean) imputation along with missingness indicators can underestimate the standard error of the estimate of interest.
The CART algorithms treat missing data natively as they split missingness as a category itself. In this manner, this approach is similar to the GPS which uses missingness pattern indicators when estimating propensity scores. The missingness categories are used to estimate propensity scores and conditioning on the propensity scores should balance the treatment and control condition in terms of the patterns. However, splitting does not actually impute the missing data so it is plausible to assume that like GPS, scores derived using the splitting method will not balance the groups in terms of the latent missing data. In addition, unlike MI, there are no imputed complete datasets saved to analyze for the outcome model. Therefore, splitting would need to be combined with some other technique for outcome modeling.
Other Methods
The following methods have been discussed theoretically in literature examining missing data methods in propensity score analysis. However, these methods are not commonly used in applied literature.
Complete Variables
This method removes any variable with missing data (Hill, 2004). By removing variables with missing data, the approach assumes that the distribution of those variables (both the observed and missing parts) are the same across the two treatment groups (Hill, 2004). If this assumption does not hold, then this method can result in bias in treatment effect estimates due to removal of important confounding variables (Hill, 2004).
D’Agostino and Rubin Expectation Maximization
Another approach is a method introduced by D’Agostino & Rubin (2000), which estimates propensity scores using an Expectation Conditional Maximization (ECM) algorithm (Hill, 2004). This method, DR, works similar to GPS as it models \(X_{obs}\), \(R\), and the treatment indicator variable. However, instead of imputing \(X_{mis}\), the DR method uses ECM to estimate propensity scores in presence of missing data (Hill, 2004). The assumption underlying DR is that within each missingness pattern defined by \(R\), \(X_{mis}\) is independent of \(D\) given the observed data, \(X_{obs}\) (Hill, 2004): \[\begin{equation} X_{mis} \perp\!\!\!\perp D| X_{obs}, R \end{equation}\] Such independence is sufficient to satisfy the ignorability assumption in presence of missing covariate data. With this method, the assumption cannot be checked, however, as DR does not actually impute the missing values. This method is not readily available in commonly used software like R.
Multiple Imputation Missingness Indicator Pattern Mixutre
Qu & Lipkovich (2009) extended MI by introducing the missingness indicator pattern mixture (MIMP) approach, which is the same as MI but adds \(R\) in the propensity score estimation model. The rationale behind this approach is to use information given by missingness patterns to estimate treatment propensities. The method will assume latent ignorabilty. However, this approach should also balance the treatment group on \(R\) as \(R\) is used to estimate the propensity scores: \[\begin{equation} X_{obs}, X_{mis}, R \perp\!\!\!\perp D| e_{MIMP}(X) \end{equation}\] Qu & Lipkovich (2009) argued that extending MI by adding R to the propensity score estimation accounts for non-ignorability or MNAR (Qu & Lipkovich, 2009; van Buuren, 2018). This method allows missingness itself to provide information on missingness: \[\begin{equation} P(X| X_{obs}, R = 1) \neq P(X| X_{obs}, R = 0) \end{equation}\]