A implementation of LASSO regression techniques for high-dimensional data analysis and potential outcomes modeling for randomized controlled trials.
This project analyzes female literacy rates across Indian districts using advanced statistical learning methods. The analysis employs both traditional econometric approaches and modern machine learning techniques to understand the determinants of educational outcomes and demonstrate causal inference methods.
- Low-dimensional specification: Uses economic intuition to select key variables while avoiding highly correlated predictors (e.g., excluding male literacy and overall literacy)
- High-dimensional specification: Comprehensive feature engineering including interactions and polynomial terms, resulting in 400+ variables
- Regularization analysis: Complete LASSO path evaluation with cross-validation
- Performance metrics: Out-of-sample R² evaluation and feature selection assessment
- Randomized controlled trial simulation: Synthetic data generation for treatment effect analysis
- Average Treatment Effect (ATE) estimation: Multiple approaches including simple regression, LASSO-assisted, and randomization inference
- Variable selection: LASSO-guided covariate adjustment for improved causal estimates
- Balance assessment: Statistical tests for randomization validity
Lasso_Potential_Outcomes_RCTs/
├── Python/ # Python implementation
│ ├── scripts/
| | ├── comments.pdf # Answers for the comments section
│ │ ├── lasso_analysis.ipynb # Female literacy LASSO analysis
│ │ └── potential_outcomes_rcts.ipynb # Causal inference methods
│ ├── requirements.txt # Python dependencies
│ ├── input/ # Data directory
│ └── output/ # Results and visualizations
├── R/ # R implementation
│ ├── scripts/
| | ├── comments.pdf # Answers for the comments section
│ │ ├── lasso_analysis.ipynb # Female literacy LASSO analysis
│ │ └── potential_outcomes_rcts.ipynb # Causal inference methods
│ ├── requirements.txt # R package dependencies
│ ├── input/ # Data directory
│ └── output/ # Results and visualizations
├── Julia/ # Julia implementation
│ ├── scripts/
| | ├── comments.pdf # Answers for the comments section
│ │ ├── lasso_analysis.ipynb # Female literacy LASSO analysis
│ │ └── potential_outcomes_rcts.ipynb # Causal inference methods
│ ├── requirements.txt # Julia package dependencies
│ ├── input/ # Data directory
│ └── output/ # Results and visualizations
├── .gitignore # Version control exclusions
├── LICENSE # MIT License
└── README.md # This documentation
- Dataset: District-wise literacy rates from Indian Census data (680 districts, 100 variables)
- Target variable: Female literacy rate (FEMALE_LIT)
- Exploratory analysis: Distribution analysis, missing value treatment, correlation assessment
- Visualization: Professional statistical graphics showing literacy distributions and relationships
The low-dimensional specification follows economic theory and avoids multicollinearity by carefully selecting variables:
Selected Variables (5 features):
GROWTHRATE: Population growth rate (demographic transition indicator)SEXRATIO: Sex ratio (gender equality proxy)ENR501: Schools with enrollment ≤ 50 (primary education access)TCH1: Primary school teachers (educational infrastructure)SCHTOT: Total number of schools (educational capacity)
Rationale: This specification avoids highly correlated variables such as male literacy and overall literacy rates, which would create multicollinearity issues. Instead, it focuses on underlying structural factors that influence educational outcomes through economic development, gender equality, and educational infrastructure channels.
The high-dimensional specification maximizes predictive power through extensive feature engineering:
Feature Construction (400+ variables):
- Base variables: All available numeric variables (excluding highly correlated literacy measures)
- Interaction terms: All pairwise interactions between base variables
- Polynomial features: Squared terms for all continuous variables
- Result: ~400+ engineered features for LASSO selection
Purpose: This specification tests LASSO's ability to perform automatic feature selection in high-dimensional settings while maintaining predictive accuracy.
- Synthetic RCT: 1000 observations with random treatment assignment
- Treatment effect: Simulated effect size with realistic noise structure
- Covariates: Four baseline variables (X1, X2, X3, X4) with known relationships
- Simple regression: Y ~ D (naive treatment effect)
- LASSO-assisted: Uses LASSO variable selection for covariate adjustment
- Randomization inference: Permutation-based hypothesis testing
- Core libraries: scikit-learn, pandas, numpy, matplotlib, seaborn, statsmodels
- Features: Advanced data preprocessing, comprehensive visualization suite
- Strengths: Extensive machine learning ecosystem, excellent documentation
- Core libraries: glmnet, tidyverse, ggplot2, caret, broom
- Features: Native statistical computing, publication-quality graphics
- Strengths: Mature statistical packages, built-in hypothesis testing
- Core libraries: GLMNet.jl, DataFrames.jl, MLJ.jl, Plots.jl
- Features: High-performance computing, modern syntax
- Strengths: Speed optimization, growing ecosystem for scientific computing
cd Python/
pip install -r requirements.txt
jupyter notebook scripts/lasso_analysis.ipynbcd R/
# Install required packages using the requirements list
pip install -r requirements.txt # For pip-style format
# OR use the R script: Rscript requirements.R
jupyter notebook scripts/lasso_analysis.ipynbcd Julia/
# Install packages from requirements.txt
julia -e 'using Pkg; [Pkg.add(split(line, ">=")[1]) for line in readlines("requirements.txt") if !startswith(line, "#") && !isempty(strip(line))]'
jupyter notebook scripts/lasso_analysis.ipynbThis project is licensed under the MIT License - see the LICENSE file for details.
This repository demonstrates advanced statistical learning methods with emphasis on reproducible research, professional documentation, and cross-platform statistical computing for modern econometric and causal inference applications.