Analytics Data Pipeline - Portfolio Demo

A comprehensive data engineering platform demonstrating production-grade ETL pipelines for ingesting, transforming, validating, and delivering analytical data from 20+ heterogeneous sources into a Snowflake data warehouse using Data Vault 2.0 methodology.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         DATA SOURCE LAYER                                   │
│                                                                             │
│  ┌──────────┐  ┌───────-───┐  ┌───────-───┐  ┌─────────-─┐  ┌──────────┐    │
│  │  Excel   │  │   CSV     │  │  SFTP     │  │   API     │  │ Parquet  │    │
│  │ Reports  │  │  Exports  │  │  Vendor   │  │ Platforms │  │  Files   │    │
│  └────┬─────┘  └──-──┬─────┘  └──-──┬─────┘  └──-──┬─────┘  └──-──┬────┘    │
│       │              │              │              │              │         │
└───────┼──────────────┼──────────────┼──────────────┼──────────────┼─────────┘
        │              │              │              │              │
        ▼              ▼              ▼              ▼              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         AWS S3 DATA LAKE                                    │
│                                                                             │
│   s3://data-lake/          s3://pipeline/                                   │
│   ├── excel_reports/       ├── stage1/ (extracted)                          │
│   ├── csv_exports/         ├── stage2/ (mapped)                             │
│   ├── vendor_data/         ├── stage3/ (distributed)                        │
│   └── api_dumps/           └── exports/ (delivered)                         │
│                                                                             │
└─────────────────────────────┬───────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    APACHE AIRFLOW ORCHESTRATION                             │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    EXTRACT PHASE                                    │    │
│  │                                                                     │    │
│  │  ┌──────────────┐  ┌───────────────┐  ┌────────────────────────┐    │    │
│  │  │ Excel        │  │ CSV Platform  │  │ FTP Scanner            │    │    │
│  │  │ Extractor    │  │ Extractor     │  │ → FTP-to-S3 Transfer   │    │    │
│  │  │ (skiprows,   │  │ (S3 Select,   │  │ (buffered streaming,   │    │    │
│  │  │  multi-sheet,│  │  chunked read,│  │  incremental detect,   │    │    │
│  │  │  col rename) │  │  server-side  │  │  metadata tracking)    │    │    │
│  │  │              │  │  SQL filter)  │  │                        │    │    │
│  │  └──────┬───────┘  └──────┬────────┘  └───────────┬────────────┘    │    │
│  │         │                 │                        │                │    │
│  └─────────┼─────────────────┼────────────────────────┼────────────────┘    │
│            │                 │                        │                     │
│            ▼                 ▼                        ▼                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                  VALIDATE PHASE (Great Expectations)                │    │
│  │                                                                     │    │
│  │  Source Data Validation:                                            │    │
│  │  • Column schema checks    • Value range validation                 │    │
│  │  • Null checks             • Row count bounds                       │    │
│  │  • Date format validation  • Type checking                          │    │
│  │                                                                     │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                            │
│                                ▼                                            │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                   TRANSFORM PHASE                                   │    │
│  │                                                                     │    │
│  │  ┌──────────────────┐  ┌──────────────────┐  ┌─────────────────┐    │    │
│  │  │ Category Mapping │  │ National→Regional│  │ Region Code     │    │    │
│  │  │ (raw names →     │  │ Distribution     │  │ Conversion      │    │    │
│  │  │  standard codes, │  │ (weighted geo    │  │ (DMA→RMA cross- │    │    │
│  │  │  unmapped report,│  │  extrapolation,  │  │  reference map) │    │    │
│  │  │  campaign remap) │  │  population      │  │                 │    │    │
│  │  │                  │  │  shares)         │  │                 │    │    │
│  │  └────────┬─────────┘  └────────┬─────────┘  └─────-──┬────────┘    │    │
│  │           │                     │                     │             │    │
│  └───────────┼─────────────────────┼─────────────────────┼─────────────┘    │
│              │                     │                     │                  │
│              ▼                     ▼                     ▼                  │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      LOAD PHASE                                     │    │
│  │                                                                     │    │
│  │  ┌────────────────────────┐     ┌────────────────────────┐          │    │
│  │  │ Classic Pipeline       │     │ Data Vault Pipeline    │          │    │
│  │  │ (S3 → Snowflake)       │     │ (S3 → Hub/Link/Sat)    │          │    │
│  │  │                        │     │                        │          │    │
│  │  │ 1. Create Schema       │     │ 1. Load YAML config    │          │    │
│  │  │ 2. Create Stage        │     │ 2. Generate Hub INSERTs│          │    │
│  │  │ 3. Create File Format  │     │ 3.Generate Link INSERTs│          │    │
│  │  │ 4. COPY INTO table     │     │ 4. Generate Sat INSERTs│          │    │
│  │  │ 5. Auto-detect columns │     │ 5. Hash key generation │          │    │
│  │  └────────────────────────┘     └────────────────────────┘          │    │
│  │                                                                     │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                            │
│                                ▼                                            │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │              VALIDATE PHASE (Warehouse)                             │    │
│  │                                                                     │    │
│  │  Snowflake Data Validation:                                         │    │
│  │  • Post-load row counts      • Referential integrity                │    │
│  │  • Value distributions       • Schema consistency                   │    │
│  │                                                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────┬───────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                     SNOWFLAKE DATA WAREHOUSE                                │
│                                                                             │
│  ┌─────────────────┐  ┌──────────────────┐  ┌────────────────────┐          │
│  │ DATA_VAULT DB   │  │ PIPELINE DB      │  │ REPORTING DB       │          │
│  │                 │  │                  │  │                    │          │
│  │ • h_category    │  │ • source tables  │  │ • _UNIFIED_METRICS │          │
│  │ • h_platform    │  │ • staging tables │  │ • PIVOT_CATEGORY   │          │
│  │ • h_creative    │  │                  │  │ • PIVOT_PORTFOLIO  │          │
│  │ • h_region      │  │                  │  │ • PIVOT_SECTOR     │          │
│  │ • nhl_* (links) │  │                  │  │                    │          │
│  │ • lsat_* (sats) │  │                  │  │ Exports → S3       │          │
│  │ • hsat_* (sats) │  │                  │  │ → Dashboards       │          │
│  └─────────────────┘  └──────────────────┘  └────────────────────┘          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Notification & Monitoring Flow

Pipeline Failure
      │
      ├──→ Slack Notification (channel routing by DAG owner)
      │    • DAG ID, Task ID, Execution Date
      │    • Error message, Airflow UI link
      │
      └──→ Sentry Error Tracking
           • Exception capture with tags
           • DAG/task/data_source context
           • Validation failure alerts

Project Structure

pandas-cloud-project/
│
├── dags/                              # Airflow DAG definitions
│   ├── constants.py                   # Central config: connections, buckets, DBs
│   ├── sample_excel_pipeline.py       # Excel → S3 → Snowflake (full chain)
│   ├── sample_csv_pipeline.py         # CSV with S3 Select + parallel branches
│   ├── sample_data_vault_pipeline.py  # Data Vault 2.0 loading (Hub/Link/Sat)
│   ├── sample_ftp_ingestion.py        # SFTP monitor → S3 → Parquet → Snowflake
│   ├── delivery/
│   │   └── sample_pivot_tables.py     # Aggregation + pivot tables + export
│   └── operators/
│       └── data_vault_operator.py     # YAML-driven Data Vault loader
│
├── pipeline/                          # Core processing modules
│   ├── extractors/
│   │   ├── base_extractor.py          # Abstract extractor interface
│   │   ├── excel_extractor.py         # Multi-sheet Excel parsing
│   │   ├── csv_extractor.py           # Chunked CSV + S3 Select
│   │   └── multi_sheet_extractor.py   # Regex-classified multi-channel Excel
│   ├── operators/
│   │   ├── extract_operator.py        # S3 source → extract → S3 dest
│   │   ├── category_map_operator.py   # Raw names → standard codes
│   │   ├── geo_distribution_operator.py # National → regional weighted split
│   │   └── worker_operator.py         # Parallel work partitioning base
│   ├── delivery/
│   │   ├── create_schema.py           # Snowflake CREATE SCHEMA
│   │   ├── create_stage.py            # Snowflake external stage (S3)
│   │   ├── create_file_format.py      # CSV/Parquet format definitions
│   │   ├── create_table_from_s3.py    # Auto-detect + COPY INTO
│   │   ├── create_strategy.py         # OR REPLACE / IF NOT EXISTS enum
│   │   ├── utils.py                   # SQL template reader, schema naming
│   │   └── sql/                       # SQL templates
│   │       ├── create_schema.sql
│   │       ├── create_stage.sql
│   │       ├── create_file_format.sql
│   │       ├── create_table_from_s3.sql
│   │       ├── aggregate_metrics.sql  # Multi-source UNION + LEFT JOIN
│   │       └── create_pivot_table.sql # Time-granularity aggregation
│   ├── connectors/
│   │   ├── ftp_to_s3.py              # Buffered SFTP → S3 transfer
│   │   ├── ftp_scanner.py            # FTP directory monitoring sensor
│   │   └── snowflake_bulk_hook.py    # Batched INSERT operations
│   ├── validation/
│   │   ├── source_validation_operator.py    # Pre-load GE validation (S3)
│   │   └── warehouse_validation_operator.py # Post-load GE validation (SF)
│   ├── notifications/
│   │   ├── slack_notifications.py     # Formatted Slack alerts
│   │   └── error_tracking.py         # Sentry exception capture
│   ├── utils/
│   │   ├── date_utils.py             # Week normalization, ISO year
│   │   ├── s3_utils.py               # DataFrame ↔ S3 I/O
│   │   └── parallel_utils.py         # Multiprocessing + Dask helpers
│   └── decorators.py                  # Runtime arg injection (Variable/XCom/Conf)
│
├── data_vault_config/
│   └── sample_display_platform.yaml   # Hub-Link-Satellite YAML schema
│
├── great_expectations/
│   ├── great_expectations.yml         # GE config (S3 + Snowflake datasources)
│   └── expectations/
│       └── sample_source_suite.json   # Column/value/type expectations
│
├── alembic/
│   └── versions/
│       └── 001_initial_data_vault_schema.py  # Hub/Link/Sat DDL + seed data
│
├── docker/
│   ├── Dockerfile                     # Production Airflow image
│   └── docker-compose.yaml            # Local dev (Postgres + S3 mock + Airflow)
│
├── tests/
│   └── test_sample_extractor.py       # Unit tests for extractors + utils
│
├── .argo-ci/
│   └── pipeline.yaml                  # CI/CD: test → build → deploy (Kaniko + K8s)
│
└── README.md                          # This file

Technology Stack

Layer	Technology	Purpose
Orchestration	Apache Airflow 1.10	DAG scheduling, task dependencies, UI
Processing	pandas, NumPy	DataFrame transformations, aggregations
Parallel	Dask, multiprocessing	Distributed/multi-core data processing
Warehouse	Snowflake	Cloud data warehouse (COPY INTO, stages)
Data Model	Data Vault 2.0	Hub-Link-Satellite schema (Alembic migrations)
Validation	Great Expectations	Automated data quality checks
Storage	AWS S3	Data lake, staging, exports
Ingestion	SFTP, S3 Select	Vendor file transfer, server-side filtering
Formats	CSV, Parquet, Excel	Multi-format extraction (openpyxl, pyarrow)
Notifications	Slack, Sentry	Pipeline failure alerts, error tracking
Containers	Docker, Docker Compose	Reproducible environments
CI/CD	Argo Workflows, Kaniko	GitOps deployment to Kubernetes
Infrastructure	Kubernetes, ECR	Container orchestration, image registry

Pipeline Patterns Demonstrated

1. Excel Report Pipeline (`dags/sample_excel_pipeline.py`)

Pattern: Excel workbook → multi-stage S3 processing → Snowflake table

Skips header rows in vendor reports
Renames vendor-specific columns to standard schema
Handles fallback category classification
Filters invalid data (negative values)
Normalizes dates to week boundaries
Full chain: Extract → Validate → Map → Distribute → Create Schema → Load → Validate

2. CSV Platform Pipeline (`dags/sample_csv_pipeline.py`)

Pattern: Large CSV exports → server-side filtering → parallel branches → merge

S3 Select: SQL-based server-side row filtering before download
Chunked reading: Memory-efficient processing of multi-GB files (1M rows/chunk)
Parallel branches: Impressions and spend processed independently
Epoch-based week normalization: Integer arithmetic for date alignment

3. Data Vault Pipeline (`dags/sample_data_vault_pipeline.py`)

Pattern: S3 files → YAML-configured Data Vault loading → Hub/Link/Satellite tables

YAML-driven schema mapping: No hardcoded SQL per source
Hash key generation: MD5-based entity identification
Multi-channel parallel loading: Display, Search, Video loaded independently
S3 sensor: Waits for new files before processing

4. FTP Ingestion Pipeline (`dags/sample_ftp_ingestion.py`)

Pattern: SFTP monitoring → incremental transfer → format conversion → warehouse load

FTP Scanner sensor: Polls SFTP, tracks processed files in S3 metadata
Buffered transfer: 1MB chunk streaming for large files
Format conversion: CSV/Excel → Parquet (columnar, compressed)
Scheduled weekly: Cron-based execution

5. Delivery Pipeline (`dags/delivery/sample_pivot_tables.py`)

Pattern: Multi-source aggregation → pivot tables → access grants → S3 export

UNION-based aggregation: Combines all source tables into unified view
Multi-granularity pivots: Week, Month, Quarter roll-ups
RBAC grants: Automatic role-based access control
S3 export: Parquet export for external consumption

Key Design Decisions

Staged S3 Processing

Data moves through numbered stages (stage1/, stage2/, stage3/) in S3. Each stage is the output of one operator, enabling:

Debugging: Inspect intermediate results at any stage
Idempotency: Re-run any stage without re-processing upstream
Incremental processing: Skip already-processed files

Dynamic Operator Arguments

The OverrideArgs decorator pattern allows operators to receive values from Airflow Variables, XCom, or DAG run config at execution time rather than DAG parse time. This enables:

Environment-specific schemas (DEV/STAGING/PROD)
Version-based delivery schemas
Dynamic configuration without DAG code changes

Data Vault 2.0 Methodology

Raw data is loaded into a Data Vault schema (Hubs, Links, Satellites) before transformation. This provides:

Auditability: Full load history with timestamps
Agility: Add new sources without schema changes
Parallel loading: Independent loads per source

Two-Phase Validation

Great Expectations validates data at two points:

Source validation (S3): Before transformation, catch bad source files early
Warehouse validation (Snowflake): After loading, verify data integrity

Data Source Types Covered

Source Type	Extractor	Example
Excel workbook (single sheet)	`ExcelReportExtractor`	Vendor spend reports
Excel workbook (multi-sheet, regex-classified)	`MultiSheetExtractor`	Platform channel reports
CSV (large, server-side filtered)	`CSVPlatformExtractor`	Ad platform exports
CSV (V2 simplified schema)	`CSVPlatformExtractorV2`	Updated format exports
SFTP file transfer	`FTPToS3Operator`	Vendor data delivery
SFTP directory monitoring	`FTPScannerSensor`	Incremental file detection
Parquet (columnar)	`CreateTableFromS3`	Converted/optimized data
Snowflake table-to-table	`DataVaultOperator`	Cross-database transforms

Getting Started

# 1. Start local development environment
cd docker
docker-compose up -d

# 2. Access Airflow UI
open http://localhost:8080

# 3. Run tests
pytest tests/ -v

# 4. Trigger a sample DAG
airflow dags trigger media_report

CI/CD Pipeline

  PR Created                PR Merged to Master
      │                           │
      ▼                           ▼
  ┌────────┐               ┌────────────┐
  │ Lint + │               │ Check      │
  │ Test   │               │ Version    │
  └────────┘               └─────┬──────┘
                                 │
                           ┌─────▼──────┐
                           │ Bump       │
                           │ Version    │
                           └─────┬──────┘
                                 │
                           ┌─────▼──────┐
                           │ Kaniko     │
                           │ Docker     │
                           │ Build      │
                           └─────┬──────┘
                                 │
                           ┌─────▼──────┐
                           │ Push to    │
                           │ ECR        │
                           └─────┬──────┘
                                 │
                           ┌─────▼──────┐
                           │ Update K8s │
                           │ Manifests  │
                           └────────────┘

This project demonstrates data engineering patterns for multi-source analytics pipelines. All names, values, and configurations are sanitized samples for portfolio demonstration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analytics Data Pipeline - Portfolio Demo

Architecture Overview

Notification & Monitoring Flow

Project Structure

Technology Stack

Pipeline Patterns Demonstrated

1. Excel Report Pipeline (`dags/sample_excel_pipeline.py`)

2. CSV Platform Pipeline (`dags/sample_csv_pipeline.py`)

3. Data Vault Pipeline (`dags/sample_data_vault_pipeline.py`)

4. FTP Ingestion Pipeline (`dags/sample_ftp_ingestion.py`)

5. Delivery Pipeline (`dags/delivery/sample_pivot_tables.py`)

Key Design Decisions

Staged S3 Processing

Dynamic Operator Arguments

Data Vault 2.0 Methodology

Two-Phase Validation

Data Source Types Covered

Getting Started

CI/CD Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
alembic/versions		alembic/versions
dags		dags
data_vault_config		data_vault_config
docker		docker
great_expectations		great_expectations
pipeline		pipeline
tests		tests
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Analytics Data Pipeline - Portfolio Demo

Architecture Overview

Notification & Monitoring Flow

Project Structure

Technology Stack

Pipeline Patterns Demonstrated

1. Excel Report Pipeline (dags/sample_excel_pipeline.py)

2. CSV Platform Pipeline (dags/sample_csv_pipeline.py)

3. Data Vault Pipeline (dags/sample_data_vault_pipeline.py)

4. FTP Ingestion Pipeline (dags/sample_ftp_ingestion.py)

5. Delivery Pipeline (dags/delivery/sample_pivot_tables.py)

Key Design Decisions

Staged S3 Processing

Dynamic Operator Arguments

Data Vault 2.0 Methodology

Two-Phase Validation

Data Source Types Covered

Getting Started

CI/CD Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Excel Report Pipeline (`dags/sample_excel_pipeline.py`)

2. CSV Platform Pipeline (`dags/sample_csv_pipeline.py`)

3. Data Vault Pipeline (`dags/sample_data_vault_pipeline.py`)

4. FTP Ingestion Pipeline (`dags/sample_ftp_ingestion.py`)

5. Delivery Pipeline (`dags/delivery/sample_pivot_tables.py`)

Packages