The Magic of Weighting in Causal Effect Estimation
This is a personal summary of some of the contents I have learned from the course Causal Inference taught by Professor Fan Li. In causal inference, a huge benefit of randomization experiments is that it can balance observed covariates, unobserved covariates as well as potential outcomes. If we use $Z$ to represent the treatment variable, $X$ to indicate covariates, $U$ to denote unobserved covariates and $Y(Z_i)$ to represent the potential outcome when treatment is $Z_i$, then the advantage of randomization experiments is that it can
- balance observed covariates: $Z\perp X$
- balance unobserved covariates: $Z\perp U$
- balance potential outcomes: $Z\perp (Y(1),Y(0))$
The balance of observed covariates plays an important role in causal effect estimation, since it is the foundation of an extremely useful result called model-free validity for model-based statistics. In randomization experiments, where the balance is satisfied, the consistency of the desired estimator holds even if the model assumption is wrong. However, this is not the case for observational study, where the balance of observed covariates cannot be guaranteed. In that case, if the imbalance of the covariates between the treatment group and control group is large, the model-based results heavily relies on extrapolation in the region with little overlap, which is sensitive to the model specification assumption. The sensitivity is not what we desire, because it’s usually quite hard to identify the correct model to apply. Therefore, some mechanisms to reduce the imbalance of observed covariates are required.
First of all, some metrics to quantify the level of imbalance are needed. The most common one is the absolute standardized difference (ASD). There are two forms of ASD, the first one is defined as
$$ ASD_1=\bigl\vert \frac{\sum_{i=1}^N X_iZ_i}{\sum_{i=1}^NZ_i} - \frac{\sum_{i=1}^N X_i(1-Z_i)}{ \sum_{i=1}^N (1-Z_i)}\bigr\vert/\sqrt{s_1^2/N_1+ s_0^2/N_0} $$
where $s_z^2$ is the sample variance of the covariate in group $z$ for $z=0,1$. For a continuous covariate, ASD is the standard two-sample $t$-statistic, and the threshold is based on a $t$-test. This definition is reasonable with small sample size. However, when the sample size is extremely large, the denominator will always approximate 0, leading the imbalance to be always declared. This is not rational, thus resulting in the second form of ASD:
$$ ASD_2=\bigl\vert \frac{\sum_{i=1}^N X_iZ_i}{\sum_{i=1}^NZ_i} - \frac{\sum_{i=1}^N X_i(1-Z_i)}{ \sum_{i=1}^N (1-Z_i)}\bigr\vert/\sqrt{s_1^2+ s_0^2 } $$
The only difference between $ASD_1$ and $ASD_2$ is the denominator. For $ASD_2$, $t$-distribution is not applied, and the commonly used threshold is 0.1.
In a causal study, a good practice is always to first check the covariate balance. When encountered with many imbalanced covariates, some manipulations are needed to employ. There are many choices in this area like
- Stratification
- Matching
- Weighting
and so on. According to personal experience, I recommend to implement weighting to reduce imbalance. Weighting, as can be deduced from its name, is about assigning unique weights to each point in the dataset. And the final balance checking is based on the weighted covariates between treatment and control group. Mathematically, the weighted absolute standardized differences have similar forms as the ASDs before. The first representation is
$$ ASD_1=\frac{\bigl\vert \frac{\sum_{i=1}^N X_iZ_iw_1(X_i)}{\sum_{i=1}^NZ_iw_1(X_i)} - \frac{\sum_{i=1}^N X_i(1-Z_i)w_0(X_i)}{ \sum_{i=1}^N (1-Z_i)w_0(X_i)}\bigr\vert}{\sqrt{s_1^2/N_1+ s_0^2/N_0}} $$
where $s_z^2$ remains the sample variance of the unweighted covariate in group $z$ for $z=0,1$, and the threshold is determined by $t$-distribution. Similarly, the second form is
$$ ASD_2=\frac{\bigl\vert \frac{\sum_{i=1}^N X_iZ_iw_1(X_i)}{\sum_{i=1}^NZ_iw_1(X_i)} - \frac{\sum_{i=1}^N X_i(1-Z_i)w_0(X_i)}{ \sum_{i=1}^N (1-Z_i)w_0(X_i)}\bigr\vert}{\sqrt{s_1^2+ s_0^2}} $$
where the threshold is $0.1$ as before.
The usual causal estimand is average treatment effect (ATE) and can be defined as
$$ \hat{\tau}=\frac{\sum_{i=1}^N Z_iY_i(1)}{\sum_{i=1}^N Z_i}-\frac{\sum_{i=1}^N (1-Z_i)Y_i(0)}{\sum_{i=1}^N(1-Z_i)} $$
Accordingly, the target estimand after weighting can be defined as weighted average treatment effect (WATE):
$$ \hat{\tau}=\frac{\sum_{i=1}^N w_1(X_i)Z_iY_i(1)}{\sum_{i=1}^N w_1(X_i)Z_i}-\frac{\sum_{i=1}^N w_0(X_i) (1-Z_i)Y_i(0)}{\sum_{i=1}^N w_0(X_i) (1-Z_i)} $$
Then the problem is, how to decide the appropriate weights for each points? The answer is actually not specified. There are many ways to define different weights and they are suitable for estimating different estimands like ATE, Average treatment effect for the treated (ATT) etc. The commonly used weights are based on propensity score. It is essentially a probability and can be analogous to a summary statistic. The definition of propensity score is
$$ e(X) = Pr(Z=1\mid X)=E(Z\mid X) $$
Since it transfers a $p$-dimensional vector to a scalar, the analogousness to a summary statistic makes sense. There are two great properties of propensity score $e(X)$.
Property 1: The propensity score $e(X)$ balances the distribution of all $X$ between the treatment groups: $$Z\perp X\mid e(X)$$ Equivalently, $Pr(Z_i=1\mid X_i,e(X_i))=Pr(Z_i=1\mid e(X_i))$
Property 2: $${Y_i(1),Y_i(0)}\perp Z_i\mid X_i\Rightarrow {Y_i(1),Y_i(0)}\perp Z_i\mid e(X_i)$$
It is common to use the above properties to identify whether a probability at a certain circumstance is a valid propensity score. A problem is given at the end of this post for practice. Based on the propensity score, here are some of the weight combinations.
target population | estimand | weight ($w_1$,$w_0$) |
---|---|---|
combined | ATE | $(\frac{1}{e(x)},\frac{1}{1-e(x)})$ |
treated | ATT | $(1,\frac{e(x)}{1-e(x)})$ |
control | ATC | $(\frac{1-e(x)}{e(x)},1)$ |
overlap | ATO | $(1-e(x),e(x))$ |
If we are interested in ATE and ATO, and use the quality of service dataset to check their performance, the propensity score can be obtained through the following code:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy.stats import t
import matplotlib.pyplot as plt
data_original = pd.read_csv('HW2_data.txt', sep = ' ')
data_original['pg'] = 2 - data_original['pg']
# change categorical variable to K-1 dummy variables
data = pd.get_dummies(data_original, columns = ['i_race', 'i_educ', 'i_insu', 'i_seve'], drop_first = True)
X = data[[i for i in data.columns if i not in ['i_aqoc', 'pg']]]
trt = data['pg']
y = data['i_aqoc']
x = 'pg ~ ' + '+'.join(X.columns)
PS = smf.logit(x, data).fit(disp=0).predict()
After getting the propensity score, we can check the covariate balance of original dataset through visualization of the propensity scores in treated group (Physician group 1) and control group (Physician group 2).
Based on the plot Histogram of the Estimated Propensity Scores, the histograms of treated group and control group are different, indicating a poor overlap. Thus there exists some covariate imbalance between treated group and control group.
Then we can apply the weighting strategies for combined population (also called Inverse Probability Weighting) and overlap population (also called Overlap Weighting) to see if there is any improvement in balance situation. The implementation is
w1 = 1/PS
w0 = 1/(1 - PS)
Z1 = trt == 1
Z0 = trt == 0
N1 = np.sum(Z1)
N0 = np.sum(Z0)
w1_OW = 1 - PS
w0_OW = PS
ASD = []
ASD_IPW = []
ASD_OW = []
for i in range(X.shape[1]):
s12 = np.var(X.iloc[:,i][Z1], ddof = 1)
s02 = np.var(X.iloc[:,i][Z0], ddof = 1)
sd = np.sqrt(s12/N1 + s02/N0)
# original ASD
diff = abs(X.iloc[:,i][Z1].mean() - X.iloc[:,i][Z0].mean())
ASD.append(diff/sd)
# weighted ASD using IPW
diff = abs(((X.iloc[:,i] * w1)[Z1]).sum()/w1[Z1].sum() -\
((X.iloc[:,i] * w0)[Z0]).sum()/w0[Z0].sum())
ASD_IPW.append(diff/sd)
# weighted ASD using OW
diff = abs(((X.iloc[:,i] * w1_OW)[Z1]).sum()/w1_OW[Z1].sum() -\
((X.iloc[:,i] * w0_OW)[Z0]).sum()/w0_OW[Z0].sum())
ASD_OW.append(diff/sd)
The three datasets ASD
, ASD_IPW
and ASD_OW
are used to store original ASD, the weighted ASD based on Inverse Probability Weighting (IPW) and the weighted ASD based on Overlap Weighting (OW) respectively. And the comparison can be visualized in a Love plot shown as below.
From plot Balance between Physician Group 1 and Physician Group 2, we can see that according to the original ASD, the imbalance exists for variable com_t
, pcs_sd
, i_race_1
, i_race_2
, i_educ_3
, i_educ_4
and i_educ_5
according to the threshold of $t$-distribution, which is about 1.98. However, the balance condition is greatly improved if we apply weighted ASD using IPW or OW. For weighted ASD using IPW, i_race_2
is the only imbalanced variable. For weighted ASD using OW, there is no imbalance situation. As can be shown in the plot, the ASDs using OW are 0 for all of the covariates. This is actually a theorem about OW.
Theorem: When the propensity scores are estimated by maximum likelihood under a logistic regression model, the overlap weights lead to exact balance in the means of any included covariate between treatment and control groups:
$$\frac{\sum_{i=1}^NX_{ij}Z_i(1-\hat{e}_i)}{\sum_{i=1}^NZ_i(1-\hat{e}_i)}=\frac{\sum_{i=1}^NX_{ij}(1-Z_i)\hat{e}_i}{\sum_{i=1}^N(1-Z_i)\hat{e}_i},\text{for }j=1,\dots,p$$
Therefore, Overlap Weighting is an excellent choice if the goal is to balance the observed covariate between treatment group and control group, since it can eliminate the imbalance just like magic! The whole theory and proof can be viewed at [1]. But one should always keep in mind, covariate balance is always the middle step but not the final goal, causal effect estimation is what we should focus on.
Problem
Now let $V$ be the gender, and let the target estimand be $E(Y(1)\mid V=1)-E(Y(2)\mid V=1)$. Consider the following three possible but not necessarily right ways to estimate it:
- Estimate the propensity score $e$ with all pre-treatment variables including $V$; then within the subgroup of $V=1$, do IPW.
- Same as (1), except that here we exclude $V$ when estimating the propensity score $e$.
- Within the subgroup of $V=1$, estimate $e$ with all pre-treatment variables except for $V$; then do IPW with the estimated $e$.
Please identify the correct one(s) and explain why.
For method 1, since the propensity score $e$ with all pre-treatment variables including $V$ is defined as $e=P(Z=1\mid X,V)$, where $X$ is all the other covariates except $V$, it means that $(X,V)\perp Z\mid e$. Therefore, $X\perp Z\mid e, V=1$ also applies, indicating that method 1 is a correct method.
For method 2, the propensity score $e$ with all pre-treatment variables except $V$ is defined as $e=P(Z=1\mid X)$, indicating that $X\perp Z\mid e$. Based on this condition, $X\perp Z\mid e, V=1$ is not guaranteed. Thus method 2 is not a valid method.
For method 3, within the subgroup of $V=1$, the propensity score $e$ with all pre-treatment variables except $V$ can be defined as $e=P(Z=1\mid X,V=1)$. This leads to the fact that $X\perp Z\mid e, V=1$, so method 3 is a correct estimate.
Reference
[1] Li, Fan, Kari Lock Morgan, and Alan M. Zaslavsky. “Balancing covariates via propensity score weighting.” Journal of the American Statistical Association 113.521 (2018): 390-400.