Skip to content

Assignment 2 #1160

Description

@valeriahoyosm

Instructions

One member of each group has to create a repo on GitHub called Lasso_Potential_Outcomes_RCTs to upload this assignment. Then that member should add other group members as collaborators to the repo so that they can all work together in branches.

There should be a directory for each assignment with the following structure

High_Dimensional_Linear_Models/
├── Python/
│   ├── input/
│   ├── output/
│   └── scripts/
├── R/
│   ├── input/
│   ├── output/
│   └── scripts/
└── Julia/
  ├── input/
  ├── output/
  └── scripts/

To clarify

  • In input goes every raw file or dataset your script will use.
  • In output goes every file (figures, tables, etc.) your script generates, including a dataset if you are asked to do so.
  • In scripts goes your code. You can have multiple scripts organized in some manner by sections, or just have one.

Your answers to the Comments section should be in .md format and placed in at least one of the three scripts folders for the assignment. To be safe, we advise you to place them in all three. You have until 23/09 23:59, to send us a link to the repo by email, indicating your group.

Assignment 2

Each point is divided by $3$ to grade a specific language.

1 Coments (3 points)

  1. 1 point Suppose you estimate LASSO with a very large $\lambda$ and with a very small $\lambda$. Describe qualitatively how the coefficients and training/test error behave in both cases.

  2. 2 points Explain what cross-validation is and why it is useful in machine learning. Illustrate with a sketch of how data is split.

2 Lasso (8 points)

In this assignment, you will estimate the female literacy rate in India using a LASSO model. Replicate the results from the script developed in Lab 4 in Python, R, and Julia. The data is available at CausalAI-Course/Data/Districtwise_literacy_rates.xlsx. To consult the description of each variable, please review Sheet 2 (Metadata) of the same file. You are free to select the variables you consider relevant for your models.

  • 0.25 points → Keep only the observations with no missing values
  • 1 point → Create a histogram of the female and male literacy rate and comment briefly on its distribution.
  • 2 points → Estimate a low-dimensional specification and compute the (R^2) on the test set
  • 2 points → Estimate a high-dimensional (flexible) specification: interaction terms and squared terms and compute the (R^2) on the test set
  • 2.75 points → For $\lambda$ ranging from 10,000 down to 0.001, plot the path of the number of nonzero coefficients and briefly comment on the result.

3. Potential Outcomes and RCTs (9 points)

3.1 Data Simulation (3 points)

  • 2 points: Simulate a dataset with (n = 1000) individuals. Generate:
    • Covariates $X_1, X_2, X_3, X_4$ (continuous or binary)
    • Treatment assignment $D \sim \text{Bernoulli}(0.5)$
    • Outcome variable:

$$ Y = 2D + 0.5 X_1 - 0.3 X_2 + 0.2 X_3 + \epsilon, \quad \epsilon \sim N(0,1) $$

  • Save everything in a data.frame.

  • 1 point: Perform a balance check: compare the means of ( $X_1, X_2, X_3, X_4$) across treatment and control groups (e.g., using t.test or regression).

3.2 Estimating the Average Treatment Effect (3 points)

  • 1 point: Estimate the treatment effect (ATE) using a simple regression:

$$ Y \sim D $$

  • 1 point: Estimate the ATE controlling for all covariates:

$$ Y \sim D + X_1 + X_2 + X_3 + X_4 $$

  • 1 point: Compare the two estimates. Answer the following:
    • Does the ATE change?
    • What happens to the standard errors?

3.3 LASSO and Variable Selection (3 points)

  • 1 point Use cv.glmnet to fit a LASSO model of (Y) on the covariates $X_1, ... , X_4$, excluding the treatment.

    • Report which covariates are selected at $\lambda_{\min}$.
  • 1 point Re-estimate the ATE with only the covariates selected by LASSO:

$$ Y \sim D + X_{\text{selected}} $$

  • 1 point Compare this estimate with those from Part B. Discuss whether the accuracy changes and what advantages using LASSO might have in this context.

Deadline = September 23 23:59.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions