Pattern-aware fraud analyticsΒ β¦ Β Interactive dashboardΒ β¦ Β CSV-to-insight pipelineΒ β¦ Β Explainable model scoring
π IntroductionΒ β’ Β π΅οΈ PatternsΒ β’ Β π Model PerformanceΒ β’ Β π₯οΈ FrontendΒ β’ Β βοΈ UsageΒ β’ Β π Deployment
π₯ Highlights
- Full fraud workflow: upload raw transaction CSVs, clean them, engineer features, detect fraud patterns, train models, and visualize the results.
- Explainable fraud rules: every suspicious transaction can be tied back to explicit business-style patterns, not just a black-box score.
- Dashboard-ready API: the backend returns dataset summaries, quality scoring, distributions, top risky transactions, confusion matrix, thresholds, and feature importance.
- Hackathon-friendly storytelling: this project is built to clearly show what was cleaned, what patterns were triggered, what the model learned, and how the UI helps investigators drill down.
π§ Core idea
- Rule + ML hybrid: the system combines interpretable fraud patterns with machine learning.
- Cleaning-first pipeline: the model only works after standardizing dirty amounts, timestamps, cities, IPs, devices, and statuses.
- Interactive investigation: users can inspect distributions, filter by pattern, and click into risky transactions from the frontend.
Fraud datasets are rarely clean, and fraud detection is rarely just a modeling problem. The real challenge is building a pipeline that can:
- handle messy CSVs from the real world
- normalize inconsistent transaction fields
- detect suspicious behavioral patterns
- produce explainable fraud flags
- score transactions with machine learning
- return outputs that a frontend can actually visualize and investigate
This project does exactly that.
It combines a FastAPI backend and a React + Vite frontend into a fraud-intelligence system that accepts raw transaction files and returns:
- cleaned, model-ready data
- quality-of-data diagnostics
- fraud pattern summaries
- transaction-level fraud predictions
- model metrics and benchmark reports
- frontend-friendly distributions for charts and drilldowns
- Normalizes raw schema aliases
- Cleans:
- transaction amounts
- timestamps
- user and merchant cities
- payment methods
- merchant categories
- device identifiers
- statuses
- IP addresses
- Builds engineered fraud features
- Detects rule-based fraud patterns
- Trains and scores:
Random ForestXGBoost
- Returns rich analytics payloads for the UI
- Uploads CSVs
- Shows dataset summary and data quality
- Visualizes:
- all city distributions
- merchant city distributions
- payment method distributions
- merchant category distributions
- device type distributions
- fraud pattern counts
- Shows model performance cards
- Supports transaction drilldown and pattern-based filtering
Instead of treating this as only a model-training app, the project is designed as a fraud intelligence pipeline.
graph LR
A["Raw Transaction CSV"] --> B["Cleaning + Standardization Pipeline"]
B --> C["Feature Engineering"]
C --> D["Fraud Pattern Detection"]
D --> E["Model Training / Scoring"]
E --> F["Analytics API Response"]
F --> G["Interactive React Dashboard"]
The current fraud rule engine uses 8 patterns.
| Pattern | Meaning | Why it matters |
|---|---|---|
pattern_location_mismatch |
User city and merchant city do not match | Useful for spotting suspicious geographic behavior |
pattern_odd_hour_transaction |
Transaction happens between 12 AM and 4 AM with other suspicious context |
Fraud often spikes during low-supervision hours |
pattern_high_amount_vs_balance |
Amount is unusually high relative to account balance | Captures draining behavior and risky spend |
pattern_unknown_device |
Device is unseen, malformed, or suspicious for that user | Strong account takeover signal |
pattern_failed_high_value |
High-value transaction fails | Often seen in fraud probing and card testing |
pattern_ip_risk |
Invalid IP, shared IP, or high IP traffic volume | Helps catch shared infrastructure abuse |
pattern_velocity |
Too many transactions in a short window | Useful for burst attacks and rapid retries |
pattern_post_failure_success |
Success occurs after a streak of failures | Classic attack-recovery pattern |
- Velocity is now stricter to reduce overfiring:
txn_count_1min > 8- or
txn_count_1h > 25withtime_diff < 5
- Post-failure success now requires:
successafter at least3consecutive failures
This tuning helped reduce false aggressiveness on smaller sample files while keeping the 10k fraud dataset closely aligned with its intended label distribution.
The system does not rely only on raw columns. It derives fraud-focused features such as:
clean_amountaccount_balancetxn_count_1mintxn_count_1htime_diffconsecutive_failuresdevice_user_degreeip_velocity_all_usersip_user_degreepayment_method_entropy_10mbalance_depletion_ratioamount_to_balance_ratiois_cross_cityhouris_odd_houris_post_failure_successanomaly_score
These become the bridge between raw transaction data and model-ready signals.
On the labeled benchmark file transactions_10k_with_fraud.csv, the tuned rule engine produced:
| Metric | Value |
|---|---|
| Total rows | 10,000 |
| Source fraud labels | 3,454 |
| Tuned pipeline fraud flags | 3,435 |
| Tuned pipeline non-fraud flags | 6,565 |
| Pattern | Trigger count |
|---|---|
pattern_location_mismatch |
2,060 |
pattern_odd_hour_transaction |
1,156 |
pattern_unknown_device |
506 |
pattern_failed_high_value |
397 |
pattern_high_amount_vs_balance |
237 |
pattern_post_failure_success |
5 |
pattern_ip_risk |
0 |
pattern_velocity |
0 |
This is a much healthier result than the earlier over-aggressive rules, because the tuned fraud count stays very close to the dataset's actual fraud distribution.
The current benchmark below was run on:
- Dataset:
transactions_10k_with_fraud.csv - Rows used:
10,000 - Feature count:
21 - Target column:
raw_fraud_label - Label mode:
real
| Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| Random Forest | 0.9480 |
0.9292 |
0.9184 |
0.9238 |
0.9798 |
| XGBoost | 0.9475 |
0.9291 |
0.9169 |
0.9230 |
0.9832 |
| TN | FP | FN | TP |
|---|---|---|---|
1266 |
48 |
56 |
630 |
| TN | FP | FN | TP |
|---|---|---|---|
1266 |
48 |
57 |
629 |
| Model | Fraud predicted | Non-fraud predicted | Fraud rate |
|---|---|---|---|
| Random Forest | 3,372 |
6,628 |
33.72% |
| XGBoost | 3,387 |
6,613 |
33.87% |
- Accuracy tells us overall correctness
- Precision tells us how many flagged frauds are truly fraud
- Recall tells us how many frauds we successfully catch
- F1 balances precision and recall
- ROC-AUC tells us how well the model separates fraud from non-fraud across thresholds
For a hackathon demo, this is useful because it shows both:
- interpretable fraud rules
- strong predictive performance
Random Forest is an ensemble of decision trees trained on bootstrapped samples of the data.
- strong baseline
- robust on tabular fraud data
- handles mixed numeric and categorical features well after preprocessing
- easy to compare against XGBoost
- Training is roughly proportional to:
- number of trees
- number of rows
- number of features sampled per split
- In intuition terms:
- more trees = better stability, more compute
XGBoost is a gradient-boosted tree algorithm that builds trees sequentially to correct prior errors.
- one of the strongest tabular ML algorithms
- excellent for fraud scoring
- handles non-linear behavior well
- supports class imbalance tuning
- Training is roughly proportional to:
- number of boosting rounds
- number of rows
- number of features
- tree depth
- In intuition terms:
- more boosting rounds and deeper trees = more power, more compute
XGBClassifier(
n_estimators=450,
max_depth=5,
learning_rate=0.03,
subsample=0.75,
colsample_bytree=0.75,
min_child_weight=5,
gamma=1.0,
reg_lambda=4.0,
reg_alpha=0.5,
)The frontend is not just a file uploader. It is designed like a lightweight fraud investigation console.
- upload flow for raw CSVs
- KPI cards:
- rows
- columns
- fraud counts
- quality score
- data quality badge:
goodwarningpoor
- full distributions for:
- cities
- merchant cities
- payment methods
- merchant categories
- device types
- statuses
- pattern summary cards
- confusion matrix and threshold tables
- top risky transactions
- transaction explorer with pattern filtering
ReactTypeScriptViteReact Router- custom analytics UI
The frontend helps judges immediately understand:
- what data came in
- how dirty it was
- what the pipeline fixed
- which fraud patterns fired
- how the models performed
- which transactions are most suspicious
The cleaning pipeline explicitly handles:
- inconsistent timestamps
- malformed or missing city names
- invalid IP addresses
- messy amount formats
- unknown payment methods
- malformed device IDs
- status normalization
invalid_ipamount_normalizationtimestamp_parsingcity_normalizationdevice_normalizationstatus_normalization
The API returns a quality score and level so the frontend can visually communicate dataset reliability.
That means the UI can say:
- this dataset is clean and trustworthy
- this dataset is partially inconsistent
- this dataset is poor quality and needs caution
The backend returns more than predictions. It returns a complete analytics payload.
Returns:
- row count
- column count
- preview rows
- cleaning actions
- quality metrics
- all major distributions
- pattern summary
- cleaned CSV download URL
Returns:
- dataset summary
- quality metrics
- distributions
- pattern summary
- top risky transactions
- per-model metrics
- confusion matrix
- threshold table
- feature importance
- predictions download URLs
- threshold report download URLs
# Frontend
cd fraudlenz
npm install
# Backend
cd server
pip install -r requirements.txt# Backend
cd server
uvicorn app.server:app --reload
# Frontend
cd fraudlenz
npm run devcurl -X POST "http://127.0.0.1:8000/api/v1/clean-csv" ^
-F "file=@C:\path\to\sample.csv"curl -X POST "http://127.0.0.1:8000/api/v1/predict-csv" ^
-F "file=@C:\path\to\sample.csv"The backend is deployed on Render because the Python ML stack is too heavy for a small serverless Vercel Python function bundle.
The frontend can point to the backend with:
VITE_API_BASE_URL=https://bastikahasti-ml.onrender.com
BastiKaHasti_ML/
ββ server/
β ββ app/
β β ββ api/
β β ββ core/
β β ββ model/
β β ββ schemas/
β β ββ server.py
β ββ storage/
ββ fraudlenz/
ββ src/
ββ package.json
This project is not only about training a fraud model. It is about making fraud detection:
- explainable
- visual
- benchmarked
- deployable
- usable by a frontend team
It shows the full story from raw CSV to cleaned data to fraud patterns to model performance to interactive investigation.
If you want to build a hackathon demo that feels like a real fraud-ops product instead of just a notebook, this is that system.