Assignment 5
Deadline: Saturday, November 30th.
Part 1: Heterogeneous treatment effects using causal trees and forests
For this part, we will be using experimental data for computing heterogeneous effects through causal trees and forests. For all exercises, the predictors $X$ are all variables that are not the outcome $Y$ or the treatment $D$.
1.1. Load the data (1 points). This is data for and experiment regarding the National Supported Work Demonstration (NSW) job-training program. You can find the data here, and read a description of the data here. For further details of the experiment and the program, you can use this link
1.2. Find the ATE (1.5 points). With re78 as the outcome variable of interest, find the Average Treatment Effect of participation in the program. Specifically, you should find it by calculating the difference between the means of the treatment group and the control group (the Simple Difference of Means or SDM). What can you say about the program?
1.3. Heterogeneous effects with causal trees (3 points). Use causal trees like we saw in class. For Python, you should use the econml package; for R, use the grf package; and for Julia, you will need to create the auxiliary variable $Y^*$ and fit a decision tree regressor. Report the splits the tree finds and interpret them.
1.4. Heterogeneous effects with causal forests (3 points). Use causal forests like we saw in class. For Python, you should use the econml package; for R, use the grf package; and for Julia, you will need to use the auxiliary variable $Y^*$ computed in the previous exercise and fit a random forest regressor. Report the importance of the prediction variables.
1.5. Plot heterogeneous effects (1.5 points). Plot how the predicted treatment effect changes depending on a variable of your choice. (You can see the last example in PD11 for clarification of what you should do in this exercise)
Part 2: Double/Debiased machine learning in observational data
In this part, we will be using observational data for computing the average treatment effect of the same program as in Part 1. This data is constructed by taking the treatment group from the same dataset as in Part 1, but constructing the control group from a different dataset; that is, the entirety of the control is comprised of observations from the Current Population Survey. Therefor, we may not have comparable treatment and control groups. To tackle this issue, we can use Double/Debiased machine learning.
2.1. Load the data (1 points). You can find the data here, and read a description of the data here. For further details on how this data was created, you can use this link.
2.2. Group comparisons (1.5 points). For the treatment and control group separately, report summary statistics of three variables of your choice. Can you spot any big differences between the treatment and control groups?
2.3. Compute the SMD (1.5 points). Find the simple difference of means, which we can use as a naive estimate of the ATE. How does the result in this case compare to the result in point 1.2.?
2.4. Using DML (6 points). Use the DML procedure as we saw in the Lab, in order to find a better estimate of the ATE. You may use the doubleML packages for Python and R, but this package does not exist for Julia, so you will have to build your own procedure like we saw in class. You will be rewarded extra points for using more than one method for predictions. At the end, report the treatment effect you found, as well as the MSE for $D$ and $Y$ achieved by the method(s) you used.
Assignment 5
Deadline: Saturday, November 30th.
Part 1: Heterogeneous treatment effects using causal trees and forests
For this part, we will be using experimental data for computing heterogeneous effects through causal trees and forests. For all exercises, the predictors$X$ are all variables that are not the outcome $Y$ or the treatment $D$ .
1.1. Load the data (1 points). This is data for and experiment regarding the National Supported Work Demonstration (NSW) job-training program. You can find the data here, and read a description of the data here. For further details of the experiment and the program, you can use this link
1.2. Find the ATE (1.5 points). With
re78as the outcome variable of interest, find the Average Treatment Effect of participation in the program. Specifically, you should find it by calculating the difference between the means of the treatment group and the control group (the Simple Difference of Means or SDM). What can you say about the program?1.3. Heterogeneous effects with causal trees (3 points). Use causal trees like we saw in class. For Python, you should use the$Y^*$ and fit a decision tree regressor. Report the splits the tree finds and interpret them.
econmlpackage; for R, use thegrfpackage; and for Julia, you will need to create the auxiliary variable1.4. Heterogeneous effects with causal forests (3 points). Use causal forests like we saw in class. For Python, you should use the$Y^*$ computed in the previous exercise and fit a random forest regressor. Report the importance of the prediction variables.
econmlpackage; for R, use thegrfpackage; and for Julia, you will need to use the auxiliary variable1.5. Plot heterogeneous effects (1.5 points). Plot how the predicted treatment effect changes depending on a variable of your choice. (You can see the last example in PD11 for clarification of what you should do in this exercise)
Part 2: Double/Debiased machine learning in observational data
In this part, we will be using observational data for computing the average treatment effect of the same program as in Part 1. This data is constructed by taking the treatment group from the same dataset as in Part 1, but constructing the control group from a different dataset; that is, the entirety of the control is comprised of observations from the Current Population Survey. Therefor, we may not have comparable treatment and control groups. To tackle this issue, we can use Double/Debiased machine learning.
2.1. Load the data (1 points). You can find the data here, and read a description of the data here. For further details on how this data was created, you can use this link.
2.2. Group comparisons (1.5 points). For the treatment and control group separately, report summary statistics of three variables of your choice. Can you spot any big differences between the treatment and control groups?
2.3. Compute the SMD (1.5 points). Find the simple difference of means, which we can use as a naive estimate of the ATE. How does the result in this case compare to the result in point 1.2.?
2.4. Using DML (6 points). Use the DML procedure as we saw in the Lab, in order to find a better estimate of the ATE. You may use the$D$ and $Y$ achieved by the method(s) you used.
doubleMLpackages for Python and R, but this package does not exist for Julia, so you will have to build your own procedure like we saw in class. You will be rewarded extra points for using more than one method for predictions. At the end, report the treatment effect you found, as well as the MSE for