A cloud-native, serverless ETL pipeline that automates real-time sales data generation, ingestion, and visualization.
Click here to view the Live Dashboard
(Note: If the app is asleep, please allow 30-60 seconds for the container to wake up.)
The objective of this project was to architect a scalable data pipeline that simulates real-time transaction processing. The system automatically generates synthetic sales records, stores them in a data lake, and provides immediate business intelligence insights through a custom dashboard.
- Fully Automated: No manual intervention required. Data is generated and ingested automatically via Amazon EventBridge triggers.
- Serverless Ingestion: Uses AWS Lambda to generate and process data without provisioning servers.
- Scalable Storage: Leverages Amazon S3 as a durable Data Lake for JSON documents.
- Real-Time Analytics: Features a cloud-deployed Streamlit dashboard for instant KPI visualization.
- Infrastructure as Code: Uses Python
boto3for programmatic interaction with AWS services.
This project moves away from traditional server-based architectures to a fully Event-Driven Serverless model.
graph LR
A[EventBridge Scheduler] -- Trigger (Every 5 min) --> B[AWS Lambda]
B -- Generate & Load JSON --> C[Amazon S3 Data Lake]
C -- Fetch Data --> D[Streamlit Dashboard]
D -- Visualize --> E[End User]
- Orchestration (Amazon EventBridge): Acts as the cron scheduler, triggering the data generation event every 5 minutes to simulate a live production environment.
- Computer (AWS Lambda): A Python-based serverless function that generates synthetic transaction records (Sales, Quantity, Product Category).
- Storage (Amazon S3): Acts as a durable Data Lake, storing raw JSON ingestion files.
- Visualization (Streamlit): A cloud-deployed Python application that ingests data from S3, performs transformation (Pandas), and renders real-time KPIs.
Before running this project, ensure you have the following:
- AWS Account: Access to the AWS Console (Free Tier is sufficient).
- Python 3.10+: Installed on your local machine.
- AWS CLI: Installed and configured with valid IAM credentials.
git clone https://github.com/shivamgravity/aws-serverless-data-pipeline
cd aws-serverless-data-pipelineIt is recommended to use a virtual environment to manage dependencies.
# Create venv
python -m venv venv
# Activate venv (Windows)
venv\Scripts\activate
# Activate venv (Mac/Linux)
source venv/bin/activatepip install -r requirements.txt(Dependencies include: streamlit, boto3, pandas, matplotlib, seaborn, python-dotenv)
To allow the local dashboard to read from S3, rename the template file and add your keys:
- Rename
.env.exampleto.envor create a new.envfile. - Add your credentials:
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_DEFAULT_REGION=your_s3_bucket_region
AWS_S3_BUCKET_NAME=your_s3_bucket_name(Note: For Cloud deployment, use Streamlit Secrets instead of .env)
The pipeline is automated via Amazon EventBridge.
- Navigate to AWS Lambda > Functions >
GenerateSalesData. - Verify that the EventBridge trigger is active (e.g., set to run every 5 minutes).
- Verification: Check your S3 Bucket to see new
.jsonfiles appearing automatically.
Option A: Run Locally Run the streamlit application from your terminal: Run the streamlit application from your terminal:
streamlit run dashboard.pyThe dashboard will open at http://localhost:8501.
Option B: Create Live Cloud Deployment on streamlit cloud.
lambda_function.py: The logic deployed to AWS Lambda. It uses therandomlibrary to simulate sales andboto3to write to S3.dashboard.py: The main application file. It handles the S3 connection, data parsing, and UI rendering using Streamlit and Matplotlib. It automatically detects if it is running locally or on the cloud.requirements.txt: List of Python libraries required to run the dashboard.
- "No Data Found": Ensure you have run the Lambda function at least once. Check your S3 bucket permissions.
- "Access Denied" Error: This usually means your local AWS credentials are missing or incorrect. Re-run
aws configureor check your.envfile. - "Module Not Found": Ensure you activated your virtual environment before running
streamlit run.
- Integration: Add AWS Glue for schema inference and cataloging.
- Querying: Implement Amazon Athena to run SQL queries directly on S3 JSON data.
- Alerting: Configure SNS (Simple Notification Service) to alert on unusually high-value transactions.
