Instructions
One member of each group has to create a repo on GitHub called High_Dimensional_Linear_Models to upload this assignment. Then that member should add other group members as collaborators to the repo so that they can all work together in branches.
There should be a directory for each assignment with the following structure
High_Dimensional_Linear_Models/
├── Python/
│ ├── input/
│ ├── output/
│ └── scripts/
├── R/
│ ├── input/
│ ├── output/
│ └── scripts/
└── Julia/
├── input/
├── output/
└── scripts/
To clarify
- In input goes every raw file or dataset your script will use.
- In output goes every file (figures, tables, etc.) your script generates, including a dataset if you are asked to do so.
- In scripts goes your code. You can have multiple scripts organized in some manner by sections, or just have one.
Your answers to the math section should be in .tex format in at least one the three scripts folders for the assignment. To be safe we advise you placed them in all three. You have until the end of Friday 05/09 to send us a link to the repo by email indicating your group.
Assignment 1
Each point is divided by $3$ to grade a specific language.
1 Math (3 points)
Consider the linear regression model:
$$
y = X_1 \beta_1 + X_2 \beta_2 + u
$$
where:
-
$y$ is an $n \times 1$ vector of outcomes,
-
$X_1$ is an $n \times k_1$ matrix of regressors of interest,
-
$X_2$ is an $n \times k_2$ matrix of control variables,
-
$u$ is an $n \times 1$ vector of errors.
Prove the Frisch-Waugh-Lovell (FWL) theorem, which states that the OLS estimate of $\beta_1$ in the regression of $y$ on $[X_1 \ X_2]$ is equal to the OLS estimate obtained from the following two-step procedure:
- Regress $y$ on $X_2$ and obtain the residuals $\tilde{y} = M_{X_2} y$, where $M_{X_2} = I - X_2 (X_2'X_2)^{-1} X_2'$.
- Regress $X_1$ on $X_2$ and obtain the residuals $\tilde{X_1} = M_{X_2} X_1$.
- Regress $\tilde{y}$ on $\tilde{X}_1$ and show that the resulting coefficient vector is equal to $\hat{\beta}_1$ from the full regression.
Formally, show that:
$$
\hat{\beta}_1 = (\tilde{X}_1' \tilde{X}_1)^{-1} \tilde{X}_1' \tilde{y}.
$$
2 Overfitting (8 points)
Simulate a data generating process like in Lab2 with only 2 variables X and Y, for $n=1000$. For practical purposes, make the intercept parameter of your data generating process be equal to zero. You are asked to do the following in a loop:
- Estimate a linear model with just one feature $x1 = x$
- Estimate a linear model with two features $x1 = x$ and $x2 = x^2$
- Repeat the same process for the following number of features
- 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000
In each iteration you should keep the R-squared, adjusted R-squared, and the out-of-sample R-squared. Only for the out-of-sample R-squared you'll need to split the data into 75% training and 25% testing.
(1 point - Variable generation and adecuate loop)
(1 point - Estimation on full sample)
(2 points - Estimation on train/test split)
(1 points - R-squared computation and storage)
Finally, in three separate graphs you should plot the respective R-squared in the y-axis, and the number of features in the x-axis. Comment your results and provide intuition.
(3 points - One for each graph; keep in mind titles and axis ranges and labels)
3 Real data (9 points)
In this assignment, you will estimate a hedonic pricing model using data on apartment prices in Poland. A hedonic pricing model estimates the value of a good based on its features. For apartments, the price depends on attributes such as area, number of rooms, distance to points of interests, etc.
Data is available at CausalAI-Course/Data/apartments.csv. Below, you will find a detailed description of each variable in the dataset. Make sure to carefully review these variable definitions
3a Cleaning (2 points)
In this section you'll need to do the following:
- Create a variable area2 that's the square of area (0.25 points)
- Convert 'hasparkingspace', 'hasbalcony', 'haselevator', 'hassecurity', 'hasstorageroom' to dummy variables (where 'yes' 'no' get mapped to 1, 0) (0.75 points)
- For each last digit of area (i.e. 0,1,...,9), create a dummy variable if the last digit of area happens to be that number. Name your variables accordingly (e.g. end_0, end_1, ...end_9). (1 point)
3b Linear model estimation (4 points)
Regress 'price' against the following covariates:
- Area's last digit dummies (ommit 9 to have a base category)
- Area, area squared
- Distance from apartment to point of interest (such as school, clinic, postoffice, etc.)
- 'hasparkingspace', 'hasbalcony', 'haselevator', 'hassecurity', 'hasstorageroom'
- Month, type, rooms, ownership, buildingmaterial (treat these as categorical variables)
Print a summary table and comment your results on the area's last digit dummy when the area's last digit is 0 (end_0).
(2 points)
Perform the same regression but this time by partialling-out. Your target parameter will be the one associated with end_0. Print a summary table and verify the coefficients are the same with both methods.
(2 points)
3c Price premium for area that ends in 0-digit (3 points)
In this section we'll attempt to see if apartments whose area ends at 0 are valued higher than what their features would suggest. Perform the following tasks.
- Train the model
Estimate the same linear regression model, but only using apartments whose area does not end in 0.
(1.25 points)
- Predict prices
Using the estimated coefficients from step 1, predict apartment prices for the entire sample, including those apartments whose area ends in 0.
(1.25 points)
- Compare averages
For apartments whose area ends in 0, compute both the average actual price and the average predicted price.
Based on this comparison, try to determine whether apartments with areas ending in 0 are sold at a higher price than what the model predicts. (You don't need to make a statistical tests, just say a guess based on your results)
(0.5 points)
Dataset Description
- price: Apartment price in PLN (Polish złoty).
- month: Month of year
- id: Unique identifier for each listing.
- type: Type of apartment (e.g., flat, studio, etc.).
- area: Total usable area of the apartment (in m²).
- rooms: Number of rooms.
- schoolDistance: Distance to the nearest school (in km).
- clinicDistance: Distance to the nearest clinic or hospital (in km).
- postOfficeDistance: Distance to the nearest post office (in km).
- kindergartenDistance: Distance to the nearest kindergarten (in km).
- restaurantDistance: Distance to the nearest restaurant (in km).
- collegeDistance: Distance to the nearest college/university (in km).
- pharmacyDistance: Distance to the nearest pharmacy (in km).
- ownership: Type of ownership (e.g., freehold, cooperative).
- buildingMaterial: Main material used for construction (e.g., brick, concrete).
- hasParkingSpace: Boolean (1/0) indicating if a parking space is available.
- hasBalcony: Boolean (1/0) indicating if the apartment has a balcony.
- hasElevator: Boolean (1/0) indicating if the building has an elevator.
- hasSecurity: Boolean (1/0) indicating if the building has security features.
- hasStorageRoom: Boolean (1/0) indicating if the apartment has a storage room.
Deadline = September 7 23:59.
Instructions
One member of each group has to create a repo on GitHub called High_Dimensional_Linear_Models to upload this assignment. Then that member should add other group members as collaborators to the repo so that they can all work together in branches.
There should be a directory for each assignment with the following structure
To clarify
Your answers to the math section should be in .tex format in at least one the three scripts folders for the assignment. To be safe we advise you placed them in all three. You have until the end of Friday 05/09 to send us a link to the repo by email indicating your group.
Assignment 1
Each point is divided by$3$ to grade a specific language.
1 Math (3 points)
Consider the linear regression model:
where:
Prove the Frisch-Waugh-Lovell (FWL) theorem, which states that the OLS estimate of$\beta_1$ in the regression of $y$ on $[X_1 \ X_2]$ is equal to the OLS estimate obtained from the following two-step procedure:
Formally, show that:
2 Overfitting (8 points)
Simulate a data generating process like in Lab2 with only 2 variables X and Y, for$n=1000$ . For practical purposes, make the intercept parameter of your data generating process be equal to zero. You are asked to do the following in a loop:
In each iteration you should keep the R-squared, adjusted R-squared, and the out-of-sample R-squared. Only for the out-of-sample R-squared you'll need to split the data into 75% training and 25% testing.
(1 point - Variable generation and adecuate loop)
(1 point - Estimation on full sample)
(2 points - Estimation on train/test split)
(1 points - R-squared computation and storage)
Finally, in three separate graphs you should plot the respective R-squared in the y-axis, and the number of features in the x-axis. Comment your results and provide intuition.
(3 points - One for each graph; keep in mind titles and axis ranges and labels)
3 Real data (9 points)
In this assignment, you will estimate a hedonic pricing model using data on apartment prices in Poland. A hedonic pricing model estimates the value of a good based on its features. For apartments, the price depends on attributes such as area, number of rooms, distance to points of interests, etc.
Data is available at
CausalAI-Course/Data/apartments.csv. Below, you will find a detailed description of each variable in the dataset. Make sure to carefully review these variable definitions3a Cleaning (2 points)
In this section you'll need to do the following:
3b Linear model estimation (4 points)
Regress 'price' against the following covariates:
Print a summary table and comment your results on the area's last digit dummy when the area's last digit is 0 (end_0).
(2 points)
Perform the same regression but this time by partialling-out. Your target parameter will be the one associated with end_0. Print a summary table and verify the coefficients are the same with both methods.
(2 points)
3c Price premium for area that ends in 0-digit (3 points)
In this section we'll attempt to see if apartments whose area ends at 0 are valued higher than what their features would suggest. Perform the following tasks.
Estimate the same linear regression model, but only using apartments whose area does not end in 0.
(1.25 points)
Using the estimated coefficients from step 1, predict apartment prices for the entire sample, including those apartments whose area ends in 0.
(1.25 points)
For apartments whose area ends in 0, compute both the average actual price and the average predicted price.
Based on this comparison, try to determine whether apartments with areas ending in 0 are sold at a higher price than what the model predicts. (You don't need to make a statistical tests, just say a guess based on your results)
(0.5 points)
Dataset Description
Deadline = September 7 23:59.