A production-grade financial data pipeline that normalizes multi-currency bank transactions to a single base currency (SGD) using real-time exchange rates. Built with AWS services and PySpark for scalable processing and analytics.
- Business Overview
- Project Aim
- Architecture
- Key Features
- Tech Stack
- Data Description
- Approach
- Prerequisites
- Project Structure
- Setup Instructions
- Usage
- Analytics Queries
- Key Takeaways
- Contributing
- License
In the application of remittance and currency conversion, the normalization of values plays a crucial role in various industries, especially the banking sector. Most FinTech and bank sector companies extend loans to individuals and businesses only after evaluating their creditworthiness based on rigorous analysis of past transactions and remittances.
Customers who are frequent travelers or engage in international transactions often perform banking activities such as deposits and withdrawals in different currencies like INR, USD, and SGD. This inconsistency creates challenges when assessing their financial health for services like lending.
This comprehensive data pipeline integrates raw bank transaction data with real-time API exchange rates, then maps transactions to standardized currency values by date. This solution:
- Streamlines workflows for currency normalization
- Empowers organizations to focus on lending decisions
- Drives insights rapidly for business growth
| Use Case | Description |
|---|---|
| Credit Worthiness Analysis | Evaluate customer creditworthiness based on deposit-to-withdrawal ratios |
| Currency Exposure Analysis | Track deposits and withdrawals across 160+ currencies |
| Account Balance Tracking | Monitor balance changes over time in standardized SGD |
| Transaction Analytics | Perform ad-hoc analysis on banking transactions |
The project aims to normalize raw bank transaction data into a single base currency (Singapore Dollars - SGD) using AWS services. The primary goal is to transform customers' transaction values with precise reference to real-time exchange rates fetched from the Open Exchange API at the corresponding transaction dates.
Pipeline Flow:
- Docker container on EC2 fetches real-time exchange rates from Open Exchange API
- Exchange rates uploaded to S3 alongside bank transaction data
- Spark transformations run on EMR cluster to calculate normalized transaction values
- Processed data stored in S3 as partitioned Parquet files
- Glue Crawler catalogs data for Athena SQL analysis
flowchart LR
subgraph Input["Input Layer"]
API[🌐 API]
subgraph Docker["🐳 Docker"]
Python[🐍 Python]
end
end
API -->|GET/Response| Python
Python -->|Ingest if status 200| S3Raw[☁️ S3<br/>data/currency]
S3Raw -->|Data Manipulation| EMR[🔥 EMR]
subgraph Output["Output Layer"]
S3Proc[☁️ S3<br/>Processed]
Athena[📊 Athena]
end
EMR --> S3Proc
S3Proc --> Athena
| Feature | Description |
|---|---|
| Data Extraction | API calls in Docker container on EC2 ensure isolated, scalable, and consistent execution across environments |
| Central Location | Raw data, API data, and processed data stored in single S3 bucket for simplified monitoring and updates |
| Complex Workflow Handling | Cloud-based pipeline efficiently manages complex data workflows with scalability and robustness |
| Data Analysis | Transformed data analyzed using SQL queries in Athena within the cloud workspace |
| Historical Backfill | Bulk historical data loading capability for past exchange rates |
| Partitioned Storage | Year/month partitioning for optimized query performance |
| Service | Description |
|---|---|
| AWS S3 | Object storage for data lake - stores raw transactions, API responses, and processed data |
| AWS EC2 | Scalable virtual servers hosting Docker containers for API data extraction |
| AWS EMR | Managed Spark clusters for distributed big data processing and transformations |
| AWS Glue | Serverless ETL with Data Catalog for metadata management and table discovery |
| AWS Athena | Serverless SQL queries directly on S3 data without loading into a database |
| Docker | Containerization ensuring consistent execution across dev, test, and production |
| PySpark | Distributed data processing library for large-scale transformations |
| Python | Primary programming language for API ingestion and orchestration |
This project uses two distinct data sources:
Historical exchange rates with time-specific timestamps for converting transaction amounts to SGD.
| Column | Description |
|---|---|
account_no |
Unique identifier for every bank account |
date |
Date when the transaction occurred |
transaction_details |
Describes the nature of the transaction |
chq_no |
Cheque number used, if applicable |
value_date |
Date when transaction value is applied to account |
withdrawal_amt |
Amount withdrawn in the transaction |
withdrawal_currency |
Currency of the withdrawal (e.g., USD, INR, EUR) |
deposit_amt |
Amount deposited into the account |
deposit_currency |
Currency of the deposit |
balance_amt |
Account balance after transaction |
flowchart LR
A[1️⃣ Load Raw Data<br/>to S3] --> B[2️⃣ Create EC2<br/>Install Docker]
B --> C[3️⃣ Run Docker<br/>Fetch Rates]
C --> D[4️⃣ Create EMR<br/>Cluster]
D --> E[5️⃣ CSV to<br/>Parquet]
E --> F[6️⃣ PySpark<br/>Normalize SGD]
F --> G[7️⃣ Store in S3]
G --> H[8️⃣ Glue Crawler]
H --> I[9️⃣ Athena<br/>Analytics]
Step-by-Step:
- Raw Data Loading: Bank transaction data loaded into S3 bucket as central location
- EC2 + Docker Setup: EC2 instance created with Docker installed for isolated execution
- API Data Extraction: Dockerfile executed within container to fetch currency exchange rates
- EMR Cluster Creation: Big data cluster provisioned for Spark processing
- Data Format Conversion: Spark job converts CSV to optimized Parquet format
- Currency Normalization: PySpark transforms raw data, normalizing all currencies to SGD
- Processed Storage: Final data stored in S3 as partitioned Parquet files
- Catalog Population: Glue Crawler creates metadata in Glue Data Catalog
- Analytics Queries: Athena SQL queries analyze the processed data
- AWS Account - Sign up for free tier
- OpenExchangeRates API Key - Get free API key
- Docker - Install Docker Desktop
- AWS CLI - Installation Guide
- Amazon S3 (Storage)
- Amazon EMR (Spark Processing)
- Amazon Athena (SQL Queries)
- AWS Glue Data Catalog
financial-data-pipeline/
├── Code/
│ └── pysparkapi/
│ ├── main.py # Main API ingestion script
│ ├── backfill.py # Historical data backfill script
│ ├── source_processing.py # CSV to Parquet conversion
│ ├── Data Manipulation.py # PySpark transformation job
│ ├── Athena SQL Queries.sql # Analytics queries
│ ├── Dockerfile # Docker configuration
│ ├── requirements.txt # Python dependencies
│ ├── aws.txt # AWS credentials template
│ ├── Connect to EMR.txt # EMR connection guide
│ └── crontab.txt # Cron schedule configuration
├── Data/
│ └── banktxn.csv # Sample bank transaction data
└── README.md
git clone https://github.com/yourusername/financial-data-pipeline.git
cd financial-data-pipeline# Install AWS CLI
# macOS
brew install awscli
# Configure credentials
aws configure
# Enter your AWS Access Key ID, Secret Access Key, Region# Create buckets for the pipeline
aws s3 mb s3://your-bucket-name
# Create folder structure
aws s3api put-object --bucket your-bucket-name --key api_response/
aws s3api put-object --bucket your-bucket-name --key banktxn/csv/
aws s3api put-object --bucket your-bucket-name --key banktxn/parquet/
aws s3api put-object --bucket your-bucket-name --key banktxn/processed/aws s3 cp Data/banktxn.csv s3://your-bucket-name/banktxn/csv/cd Code/pysparkapi
pip install -r requirements.txt# Build Docker image
docker build -t financial-pipeline . -f Dockerfile
# Run container
docker run -dit financial-pipeline
# Copy AWS credentials to container
docker cp ~/.aws/credentials <container-id>:/root/.aws/credentials# Fetch exchange rates for a specific date
python main.py --run_ts "2024-01-15" --config '{
"app_id": "YOUR_OPENEXCHANGE_API_KEY",
"s3_out_location": "s3://your-bucket-name/api_response/",
"s3_error_out_location": "s3://your-bucket-name/error_temp_dir/"
}'
# Fetch today's exchange rates (omit --run_ts)
python main.py --config '{
"app_id": "YOUR_OPENEXCHANGE_API_KEY",
"s3_out_location": "s3://your-bucket-name/api_response/",
"s3_error_out_location": "s3://your-bucket-name/error_temp_dir/"
}'To load historical exchange rates:
python backfill.py- SSH into EMR Cluster:
ssh -i your-key.pem hadoop@your-emr-master-dns- Convert CSV to Parquet:
spark-submit source_processing.py- Run Data Transformation:
spark-submit "Data Manipulation.py"Run these queries in Amazon Athena after setting up the Glue Data Catalog:
SELECT
account_no,
SUM(COALESCE(withdrawal_sgd_amt, 0)) AS total_withdrawals_sgd,
SUM(COALESCE(deposit_sgd_amt, 0)) AS total_deposits_sgd
FROM "pyspark-database"."processed"
GROUP BY account_no;SELECT
account_no,
SUM(deposit_sgd_amt) AS total_deposits_sgd,
SUM(withdrawal_sgd_amt) AS total_withdrawals_sgd,
SUM(deposit_sgd_amt) / NULLIF(SUM(withdrawal_sgd_amt), 0) AS credit_ratio,
CASE
WHEN SUM(deposit_sgd_amt) / NULLIF(SUM(withdrawal_sgd_amt), 0) > 1.5
THEN 'High Creditworthiness'
WHEN SUM(deposit_sgd_amt) / NULLIF(SUM(withdrawal_sgd_amt), 0) BETWEEN 1.0 AND 1.5
THEN 'Moderate Creditworthiness'
ELSE 'Low Creditworthiness'
END AS creditworthiness
FROM "pyspark-database"."processed"
GROUP BY account_no;SELECT
deposit_currency,
SUM(deposit_sgd_amt) AS total_deposit
FROM "pyspark-database"."processed"
GROUP BY deposit_currency;Set up automated daily ingestion:
# Edit crontab
crontab -e
# Add this line to run daily at 8 AM
0 8 * * * /usr/bin/python3 /path/to/main.py --config '{"app_id":"YOUR_KEY","s3_out_location":"s3://bucket/api_response/","s3_error_out_location":"s3://bucket/error/"}'| Variable | Description |
|---|---|
AWS_ACCESS_KEY_ID |
AWS access key |
AWS_SECRET_ACCESS_KEY |
AWS secret key |
AWS_DEFAULT_REGION |
AWS region (e.g., us-east-1) |
OPENEXCHANGE_APP_ID |
OpenExchangeRates API key |
s3://your-bucket/
├── api_response/
│ └── response/
│ └── YYYY-MM-DD/
│ └── currency_details_sgd.parquet
├── banktxn/
│ ├── csv/
│ │ └── banktxn.csv
│ ├── parquet/
│ │ └── *.parquet
│ └── processed/
│ └── *.parquet
└── error_temp_dir/
└── error_error/
└── YYYY-MM-DD/
└── currency_details_sgd_error.parquet
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
By completing this project, you will gain:
- ✅ Understanding of real-world FinTech use cases and datasets
- ✅ Hands-on experience with AWS services (S3, EC2, EMR, Glue, Athena)
- ✅ Ability to visualize and build complete data pipelines
- ✅ SSH connection to AWS EC2 instances
- ✅ Docker installation and container management on EC2
- ✅ Building Docker images and running containers
- ✅ Processing API data within containerized environments
- ✅ Creating and managing AWS EMR clusters
- ✅ PySpark data processing and transformations
- ✅ Working with Glue Crawlers and Data Catalog
- ✅ Writing analytical SQL queries in Athena
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenExchangeRates for providing the exchange rate API
- AWS for cloud infrastructure
- Apache Spark for big data processing capabilities
⭐ If you found this project helpful, please consider giving it a star!