This assignment involves causal inference. Note that this is an individual assignment, so you must work alone. You can discuss basic details with classmates but your final work must be yours alone! Please type your solutions using R Markdown, LaTeX or any other word processor but YOU MUST knit or convert the final output file to “.pdf”. Submissions should be made on gradescope: go to Assignments \(\rightarrow\) Data Analysis Assignment 5.
DO NOT INCLUDE R CODE OR OUTPUT IN YOUR SOLUTIONS/REPORTS All R code can be included in an appendix, and R outputs should be converted to nicely formatted tables. Feel free to use R packages such as kable
, xtable
, stargazer
, etc.
Also, you can round up ALL numbers/estimates to 2 decimal places (4 decimal places at the most to avoid exact zeros when possible).
Reminder: You are allowed and even encouraged to talk to each other about general concepts, or to the instructor/TAs. However, the write-ups, solutions, and code MUST be entirely your own work.
ASTHMA PATIENTS IN CALIFORNIA.
The data for this question can be found in the file “Asthma.txt” on Sakai.
The data set is from a study to compare the quality of services provided by two physician groups for asthma patients in California. Specifically, for patient i, let Yi(w) be the quality of service as judged by the patient (1=satisfactory, 0=not satisfactory), if the patient is served by physician group \(w\), for \(w = 1,2\). The patients who visit the two groups can differ, and so a set of covariates are measured. The variables in the data are:
Variable | Description |
---|---|
pg (treatment assignment) | physician group; values = 1 and 2 |
i_age | age (continuous) |
i_sex | sex (binary) |
i_race | race (categorical) |
i_educ | education (categorical) |
i_insu | insurance status (categorical) |
i_drug | drug coverage status (categorical) |
i_seve | severity (categorical) |
com_t | total number of comorbidity (numeric) |
pcs_sd | standard physical comorbidity scale (continuous) |
mcs_sd | standard mental comorbidity scale (continuous) |
i_aqoc (outcome) | satisfaction status of patient (binary) |
Let pr(Y(w) = 1) = pw be the fraction of patients who would be satisfied with the service provided if all patients were to be served by physician group w (again, w=1 or 2). The target estimand is the average causal effect Q = p2 - p1.
Are the covariates in this data balanced between the two groups? If no, which covariates are not? How did you assess balance?
Estimate the propensity score e using a logistic regression with all pre-treatment variables entering in the model as main effects.
Which of the methods do you consider most reliable (or feel most comfortable with) for estimating the causal effect? Why?
Notes:
pg
to a binary variable with values 0 and 1.com_t
, pcs_sd
, and mcs_sd
and use the centered versions for all analyses.pg
and i.aqoc
, and the meaning of both are very clear here.i_sex
, i_educ
and i_seve
by doing Data$i_sex <- relevel(factor(Data$i_sex), ref = 1)
, Data$i_educ <- relevel(factor(Data$i_educ), ref = 5)
, and Data$i_seve <- relevel(factor(Data$i_seve), ref = 3)
.pg = 1
) and not really ATE, because we will discard observations in the control group for which we cannot find matches for.You may find the following code useful.
#Fit the model
cov_names <- names(Data)
p_formula <- as.formula(paste("pg ~",
paste(cov_names[!cov_names %in% c("i_aqoc","pg")],
collapse = " + ")))
pscorereg <- glm(p_formula,......)
pscores <- ...... #predict propensity scores here
#Check number of observations outside of the overlap region between the two groups
#First the left tails
sum(pscores < max(min(pscores[Data$pg==1]),
min(pscores[Data$pg==2])))
#Next the right tails
sum(pscores > min(max(pscores[Data$pg==1]),
max(pscores[Data$pg==2])))
#If there are "outliers",
#get row index for observations that violate overlap.
index <- which((pscores < max(min(pscores[Data$pg==1]),
min(pscores[Data$pg==2])) |
pscores > min(max(pscores[Data$pg==1]),
max(pscores[Data$pg==2]))) == TRUE)
Data <- Data[-index,]; pscores <- pscores[-index]
20 points.