Assignment 2

# Instructions

One member of each group has to create a repo on GitHub called **Lasso_Potential_Outcomes_RCTs** to upload this assignment. Then that member should add other group members as collaborators to the repo so that they can all work together in branches. 

There should be a directory for each assignment with the following structure

  ```
High_Dimensional_Linear_Models/
├── Python/
│   ├── input/
│   ├── output/
│   └── scripts/
├── R/
│   ├── input/
│   ├── output/
│   └── scripts/
└── Julia/
    ├── input/
    ├── output/
    └── scripts/
  ```

To clarify
- In input goes every raw file or dataset your script will use.
- In output goes every file (figures, tables, etc.) your script generates, including a dataset if you are asked to do so.
- In scripts goes your code. You can have multiple scripts organized in some manner by sections, or just have one. 

Your answers to the Comments section should be in .md format and placed in at least one of the three scripts folders for the assignment. To be safe, we advise you to place them in all three. You have until 23/09 23:59, to send us a link to the repo by email, indicating your group.

# Assignment 2
Each point is divided by $3$ to grade a specific language. 

## 1 Coments (3 points)

1. **1 point** Suppose you estimate LASSO with a very large $\lambda$ and with a very small $\lambda$. Describe qualitatively how the coefficients and training/test error behave in both cases. 

2. **2 points** Explain what cross-validation is and why it is useful in machine learning. Illustrate with a sketch of how data is split. 

## 2 Lasso (8 points)

In this assignment, you will estimate the female literacy rate in India using a LASSO model. Replicate the results from the script developed in Lab 4 in Python, R, and Julia. The data is available at CausalAI-Course/Data/Districtwise_literacy_rates.xlsx. To consult the description of each variable, please review **Sheet 2 (Metadata)** of the same file.  You are free to **select the variables you consider relevant** for your models.

- **0.25 points** → Keep only the observations with no missing values  
- **1 point** → Create a histogram of the female and male literacy rate and comment briefly on its distribution.
- **2 points** → Estimate a low-dimensional specification and compute the $R^2$ on the test set  
- **2 points** → Estimate a high-dimensional (flexible) specification: **interaction terms and squared terms** and compute the $R^2$ on the test set  
- **2.75 points** → For $\lambda$ ranging from 10,000 down to 0.001, plot the path of the number of nonzero coefficients and briefly comment on the result.  

# 3. Potential Outcomes and RCTs (9 points)

## 3.1 Data Simulation (3 points)

- **2 points**: Simulate a dataset with $n = 1000$ individuals. Generate:  
  - Covariates $X_1, X_2, X_3, X_4$ (continuous or binary)
  - Treatment assignment $D \sim \text{Bernoulli}(0.5)$ 
  - Outcome variable:

$$
Y = 2D + 0.5 X_1 - 0.3 X_2 + 0.2 X_3 + \epsilon, \quad \epsilon \sim N(0,1)
$$

  - Save everything in a `data.frame`.  

- **1 point**: Perform a balance check: compare the means of $ $X_1, X_2, X_3, X_4$$ across treatment and control groups (e.g., using `t.test` or regression).


## 3.2 Estimating the Average Treatment Effect (3 points)

- **1 point**: Estimate the treatment effect (ATE) using a simple regression:

$$
Y \sim D
$$

- **1 point**: Estimate the ATE controlling for all covariates:

$$
Y \sim D + X_1 + X_2 + X_3 + X_4
$$

- **1 point**: Compare the two estimates. Answer the following:  
  - Does the ATE change?  
  - What happens to the standard errors?


## 3.3 LASSO and Variable Selection (3 points)

- **1 point** Use `cv.glmnet` to fit a LASSO model of $Y$ on the covariates $X_1, ... , X_4$, **excluding the treatment**.  
  - Report which covariates are selected at $\lambda\_{\min}$.

- **1 point** Re-estimate the ATE with only the covariates selected by LASSO:

$$
Y \sim D + X_{\text{selected}}
$$

- **1 point** Compare this estimate with those from Part B. Discuss whether the accuracy changes and what advantages using LASSO might have in this context.


Deadline = September 23 23:59.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Assignment 2 #1160

Instructions

Assignment 2

1 Coments (3 points)

2 Lasso (8 points)

3. Potential Outcomes and RCTs (9 points)

3.1 Data Simulation (3 points)

3.2 Estimating the Average Treatment Effect (3 points)

3.3 LASSO and Variable Selection (3 points)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Assignment 2 #1160

Description

Instructions

Assignment 2

1 Coments (3 points)

2 Lasso (8 points)

3. Potential Outcomes and RCTs (9 points)

3.1 Data Simulation (3 points)

3.2 Estimating the Average Treatment Effect (3 points)

3.3 LASSO and Variable Selection (3 points)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions