Skip to content

Satharva2004/Fraud-Detections-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Fraud Intelligence Models FastAPI React + Vite

BastiKaHasti ML: Fraud Detection Intelligence

Pattern-aware fraud analyticsΒ  β—¦ Β Interactive dashboardΒ  β—¦ Β CSV-to-insight pipelineΒ  β—¦ Β Explainable model scoring

πŸ“– IntroductionΒ  β€’ Β  πŸ•΅οΈ PatternsΒ  β€’ Β  πŸ“Š Model PerformanceΒ  β€’ Β  πŸ–₯️ FrontendΒ  β€’ Β  βš™οΈ UsageΒ  β€’ Β  πŸš€ Deployment

πŸ”₯ Highlights

  • Full fraud workflow: upload raw transaction CSVs, clean them, engineer features, detect fraud patterns, train models, and visualize the results.
  • Explainable fraud rules: every suspicious transaction can be tied back to explicit business-style patterns, not just a black-box score.
  • Dashboard-ready API: the backend returns dataset summaries, quality scoring, distributions, top risky transactions, confusion matrix, thresholds, and feature importance.
  • Hackathon-friendly storytelling: this project is built to clearly show what was cleaned, what patterns were triggered, what the model learned, and how the UI helps investigators drill down.

🧠 Core idea

  • Rule + ML hybrid: the system combines interpretable fraud patterns with machine learning.
  • Cleaning-first pipeline: the model only works after standardizing dirty amounts, timestamps, cities, IPs, devices, and statuses.
  • Interactive investigation: users can inspect distributions, filter by pattern, and click into risky transactions from the frontend.

πŸ“– Introduction

Fraud datasets are rarely clean, and fraud detection is rarely just a modeling problem. The real challenge is building a pipeline that can:

  • handle messy CSVs from the real world
  • normalize inconsistent transaction fields
  • detect suspicious behavioral patterns
  • produce explainable fraud flags
  • score transactions with machine learning
  • return outputs that a frontend can actually visualize and investigate

This project does exactly that.

It combines a FastAPI backend and a React + Vite frontend into a fraud-intelligence system that accepts raw transaction files and returns:

  • cleaned, model-ready data
  • quality-of-data diagnostics
  • fraud pattern summaries
  • transaction-level fraud predictions
  • model metrics and benchmark reports
  • frontend-friendly distributions for charts and drilldowns

🎯 What The System Does

Backend responsibilities

  • Normalizes raw schema aliases
  • Cleans:
    • transaction amounts
    • timestamps
    • user and merchant cities
    • payment methods
    • merchant categories
    • device identifiers
    • statuses
    • IP addresses
  • Builds engineered fraud features
  • Detects rule-based fraud patterns
  • Trains and scores:
    • Random Forest
    • XGBoost
  • Returns rich analytics payloads for the UI

Frontend responsibilities

  • Uploads CSVs
  • Shows dataset summary and data quality
  • Visualizes:
    • all city distributions
    • merchant city distributions
    • payment method distributions
    • merchant category distributions
    • device type distributions
    • fraud pattern counts
  • Shows model performance cards
  • Supports transaction drilldown and pattern-based filtering

πŸ—οΈ Architecture

Instead of treating this as only a model-training app, the project is designed as a fraud intelligence pipeline.

graph LR
    A["Raw Transaction CSV"] --> B["Cleaning + Standardization Pipeline"]
    B --> C["Feature Engineering"]
    C --> D["Fraud Pattern Detection"]
    D --> E["Model Training / Scoring"]
    E --> F["Analytics API Response"]
    F --> G["Interactive React Dashboard"]
Loading

πŸ•΅οΈ Patterns We Detect

The current fraud rule engine uses 8 patterns.

Pattern Meaning Why it matters
pattern_location_mismatch User city and merchant city do not match Useful for spotting suspicious geographic behavior
pattern_odd_hour_transaction Transaction happens between 12 AM and 4 AM with other suspicious context Fraud often spikes during low-supervision hours
pattern_high_amount_vs_balance Amount is unusually high relative to account balance Captures draining behavior and risky spend
pattern_unknown_device Device is unseen, malformed, or suspicious for that user Strong account takeover signal
pattern_failed_high_value High-value transaction fails Often seen in fraud probing and card testing
pattern_ip_risk Invalid IP, shared IP, or high IP traffic volume Helps catch shared infrastructure abuse
pattern_velocity Too many transactions in a short window Useful for burst attacks and rapid retries
pattern_post_failure_success Success occurs after a streak of failures Classic attack-recovery pattern

Current tuned pattern logic

  • Velocity is now stricter to reduce overfiring:
    • txn_count_1min > 8
    • or txn_count_1h > 25 with time_diff < 5
  • Post-failure success now requires:
    • success after at least 3 consecutive failures

This tuning helped reduce false aggressiveness on smaller sample files while keeping the 10k fraud dataset closely aligned with its intended label distribution.


πŸ“¦ Engineered Features

The system does not rely only on raw columns. It derives fraud-focused features such as:

  • clean_amount
  • account_balance
  • txn_count_1min
  • txn_count_1h
  • time_diff
  • consecutive_failures
  • device_user_degree
  • ip_velocity_all_users
  • ip_user_degree
  • payment_method_entropy_10m
  • balance_depletion_ratio
  • amount_to_balance_ratio
  • is_cross_city
  • hour
  • is_odd_hour
  • is_post_failure_success
  • anomaly_score

These become the bridge between raw transaction data and model-ready signals.


πŸ“Š Benchmark Pattern Summary

On the labeled benchmark file transactions_10k_with_fraud.csv, the tuned rule engine produced:

Metric Value
Total rows 10,000
Source fraud labels 3,454
Tuned pipeline fraud flags 3,435
Tuned pipeline non-fraud flags 6,565

Pattern counts on the same benchmark

Pattern Trigger count
pattern_location_mismatch 2,060
pattern_odd_hour_transaction 1,156
pattern_unknown_device 506
pattern_failed_high_value 397
pattern_high_amount_vs_balance 237
pattern_post_failure_success 5
pattern_ip_risk 0
pattern_velocity 0

This is a much healthier result than the earlier over-aggressive rules, because the tuned fraud count stays very close to the dataset's actual fraud distribution.


πŸ€– Model Performance

The current benchmark below was run on:

  • Dataset: transactions_10k_with_fraud.csv
  • Rows used: 10,000
  • Feature count: 21
  • Target column: raw_fraud_label
  • Label mode: real

Model comparison

Model Accuracy Precision Recall F1 ROC-AUC
Random Forest 0.9480 0.9292 0.9184 0.9238 0.9798
XGBoost 0.9475 0.9291 0.9169 0.9230 0.9832

Confusion matrix snapshot

Random Forest

TN FP FN TP
1266 48 56 630

XGBoost

TN FP FN TP
1266 48 57 629

Full-dataset fraud predicted

Model Fraud predicted Non-fraud predicted Fraud rate
Random Forest 3,372 6,628 33.72%
XGBoost 3,387 6,613 33.87%

Why these metrics matter

  • Accuracy tells us overall correctness
  • Precision tells us how many flagged frauds are truly fraud
  • Recall tells us how many frauds we successfully catch
  • F1 balances precision and recall
  • ROC-AUC tells us how well the model separates fraud from non-fraud across thresholds

For a hackathon demo, this is useful because it shows both:

  • interpretable fraud rules
  • strong predictive performance

⚑ Algorithms Used

1. Random Forest

Random Forest is an ensemble of decision trees trained on bootstrapped samples of the data.

Why we use it

  • strong baseline
  • robust on tabular fraud data
  • handles mixed numeric and categorical features well after preprocessing
  • easy to compare against XGBoost

Practical complexity

  • Training is roughly proportional to:
    • number of trees
    • number of rows
    • number of features sampled per split
  • In intuition terms:
    • more trees = better stability, more compute

2. XGBoost

XGBoost is a gradient-boosted tree algorithm that builds trees sequentially to correct prior errors.

Why we use it

  • one of the strongest tabular ML algorithms
  • excellent for fraud scoring
  • handles non-linear behavior well
  • supports class imbalance tuning

Practical complexity

  • Training is roughly proportional to:
    • number of boosting rounds
    • number of rows
    • number of features
    • tree depth
  • In intuition terms:
    • more boosting rounds and deeper trees = more power, more compute

Current tuned XGBoost parameters

XGBClassifier(
    n_estimators=450,
    max_depth=5,
    learning_rate=0.03,
    subsample=0.75,
    colsample_bytree=0.75,
    min_child_weight=5,
    gamma=1.0,
    reg_lambda=4.0,
    reg_alpha=0.5,
)

πŸ–₯️ Frontend Experience

The frontend is not just a file uploader. It is designed like a lightweight fraud investigation console.

What the UI shows

  • upload flow for raw CSVs
  • KPI cards:
    • rows
    • columns
    • fraud counts
    • quality score
  • data quality badge:
    • good
    • warning
    • poor
  • full distributions for:
    • cities
    • merchant cities
    • payment methods
    • merchant categories
    • device types
    • statuses
  • pattern summary cards
  • confusion matrix and threshold tables
  • top risky transactions
  • transaction explorer with pattern filtering

Frontend stack

  • React
  • TypeScript
  • Vite
  • React Router
  • custom analytics UI

Why this matters in a hackathon

The frontend helps judges immediately understand:

  • what data came in
  • how dirty it was
  • what the pipeline fixed
  • which fraud patterns fired
  • how the models performed
  • which transactions are most suspicious

🧼 Data Cleaning and Quality Layer

The cleaning pipeline explicitly handles:

  • inconsistent timestamps
  • malformed or missing city names
  • invalid IP addresses
  • messy amount formats
  • unknown payment methods
  • malformed device IDs
  • status normalization

Example cleaning actions exposed to the frontend

  • invalid_ip
  • amount_normalization
  • timestamp_parsing
  • city_normalization
  • device_normalization
  • status_normalization

Data quality scoring

The API returns a quality score and level so the frontend can visually communicate dataset reliability.

That means the UI can say:

  • this dataset is clean and trustworthy
  • this dataset is partially inconsistent
  • this dataset is poor quality and needs caution

πŸ”Œ API Outputs For The Frontend

The backend returns more than predictions. It returns a complete analytics payload.

POST /api/v1/clean-csv

Returns:

  • row count
  • column count
  • preview rows
  • cleaning actions
  • quality metrics
  • all major distributions
  • pattern summary
  • cleaned CSV download URL

POST /api/v1/predict-csv

Returns:

  • dataset summary
  • quality metrics
  • distributions
  • pattern summary
  • top risky transactions
  • per-model metrics
  • confusion matrix
  • threshold table
  • feature importance
  • predictions download URLs
  • threshold report download URLs

βš™οΈ Usage

1. Install dependencies

# Frontend
cd fraudlenz
npm install

# Backend
cd server
pip install -r requirements.txt

2. Run locally

# Backend
cd server
uvicorn app.server:app --reload

# Frontend
cd fraudlenz
npm run dev

3. Example API calls

curl -X POST "http://127.0.0.1:8000/api/v1/clean-csv" ^
  -F "file=@C:\path\to\sample.csv"
curl -X POST "http://127.0.0.1:8000/api/v1/predict-csv" ^
  -F "file=@C:\path\to\sample.csv"

πŸš€ Deployment

Live backend

Deployment note

The backend is deployed on Render because the Python ML stack is too heavy for a small serverless Vercel Python function bundle.

The frontend can point to the backend with:

VITE_API_BASE_URL=https://bastikahasti-ml.onrender.com

πŸ“ Project Structure

BastiKaHasti_ML/
β”œβ”€ server/
β”‚  β”œβ”€ app/
β”‚  β”‚  β”œβ”€ api/
β”‚  β”‚  β”œβ”€ core/
β”‚  β”‚  β”œβ”€ model/
β”‚  β”‚  β”œβ”€ schemas/
β”‚  β”‚  └─ server.py
β”‚  └─ storage/
└─ fraudlenz/
   β”œβ”€ src/
   └─ package.json

⭐ Why This Project Stands Out

This project is not only about training a fraud model. It is about making fraud detection:

  • explainable
  • visual
  • benchmarked
  • deployable
  • usable by a frontend team

It shows the full story from raw CSV to cleaned data to fraud patterns to model performance to interactive investigation.

If you want to build a hackathon demo that feels like a real fraud-ops product instead of just a notebook, this is that system.

About

Hackathon Project - Fraud Detection using XGBoost and Random Forest

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors