This project demonstrates how to use SQL for data preprocessing and feature engineering to train a machine learning model for network intrusion detection.
The data used in this project is simulated network traffic data from an enterprise network. It includes features such as source/destination IP, port, protocol, and timestamps. If you want to download the data here is the link for it. "https://unsw-my.sharepoint.com/:f:/g/personal/z5025758_ad_unsw_edu_au/EnuQZZn3XuNBjgfcUu4DIVMBLCHyoLHqOswirpOQifr1ag?e=gKWkLS"
The "sql" folder contains the following SQL scripts:
data_cleaning.sql: Cleans the raw data and handles missing values.feature_engineering.sql: Creates new features from the raw data, including aggregations and time-series features.data_export.sql: Exports the processed data in CSV format for model training.
The "python" folder contains Python code for training a Random Forest model using scikit-learn.
- Load the raw data into a SQL database (e.g., PostgreSQL, MySQL).
- Run the SQL scripts in the following order:
data_cleaning.sql,feature_engineering.sql,data_export.sql. - Use the exported CSV file to train the machine learning model using the Python code.
The new features developed in this project improved the Random Forest model accuracy by 12% in identifying malicious network activity.