A tiny PySpark job packaged with uv (PEP 621) and a built-in CLI to deploy to EMR Serverless π
Before you can deploy the application, you need to provision the necessary AWS resources. This project uses Terraform to manage this infrastructure as code.
For a complete guide on setting up the S3 buckets, ECR repository, and IAM roles -- with Terraform -- please see the Infrastructure Setup with Terraform Guide.
If you want to manually create IAM roles and policies, or need a detailed overview of the required permissions for EMR Serverless, CloudWatch logging, and Apache Iceberg integration, see the EMR Serverless, CloudWatch & Iceberg Setup Guide.
If you're new to EMR Serverless, check out this helpful YouTube tutorial: Getting Started with EMR Serverless (YouTube)
The GitHub repository used in the tutorial can be found here: johnny-chivers/emr-serverless
We upload two artifacts to S3 for each deploy:
main.pyβ used as the SparkentryPoint(S3 URI)code_*.zipβ added via--py-files(S3 URI) Spark automatically adds the ZIP root toPYTHONPATH.
π¦ The pack step copies src/ contents into the ZIP root, so imports like
import emr_dummy work on the cluster without extra setup.
A frequent point of confusion with EMR Serverless is why you need both a Docker image and a zip file of your code in S3. Hereβs a clear breakdown of their roles:
- What it is: A self-contained environment with the operating system, Python,
and all your heavy third-party dependencies (like
pandas,pyspark,boto3) installed viauv. - When to build: You only need to build and push a new image when
your
pyproject.tomloruv.lockfile changes. - Purpose: Provides a stable, reproducible, and quickly-startable runtime for your job.
- What it is: A lightweight archive containing only your application's
source code (i.e., the contents of your
srcdirectory). It should NOT contain any dependencies. - When to build: This is created and uploaded on every single deployment.
- Purpose: Allows for rapid iteration. You can change your Python logic and redeploy in seconds without waiting for a multi-minute Docker build.
At runtime, EMR Serverless performs these steps:
- Starts a container from your specified Docker Image.
- Downloads your
code.zipandmain.pyfrom S3. - Places the contents of the zip file onto the
PYTHONPATH. - Executes your
main.pyentry point.
This separation is powerful: you get the stability of a Docker image for your environment and the speed of a simple file upload for your code.
- π Python 3.10 (required for EMR 6.x + PySpark 3.5 compatibility)
- π¦
uvinstalled (e.g.brew install uv) - π AWS CLI with credentials configured (
aws configureoraws sso login) - βοΈ An EMR Serverless application created in AWS
The diagram below shows the main paths a developer can take after setting up prerequisites and environment variables.
flowchart TD
A([Start]) --> B[Set up prerequisites & environment variables]
B --> C{What do you want to do?}
C -- Validate config (no run) --> E[Run dry-run deployment]
C -- Deploy to EMR Serverless --> F[Full deploy to EMR Serverless]
C -- Schedule for production --> G[Use Airflow or another orchestrator]
E --> H([End])
F --> H
G --> H
Related Sections In This README:
This repository is designed to serve as a template for building Reproducible Analytical Pipelines (RAP) for PySpark on AWS EMR Serverless.
RAP is a methodology for data analysis that incorporates software engineering best practices to create processes that are reproducible, auditable, efficient, and high-quality. The goal is to automate analytical pipelines from end-to-end, minimising manual steps and maximising trust in the results.
This project embodies the core principles of RAP:
- π¦ Environment Reproducibility: The
Dockerfileanduv.lockfile guarantee a consistent, reproducible environment with pinned dependencies for every job run. - π€ Automation: The
deploy-to-emrCLI script automates the entire deployment processβpackaging code, managing infrastructure, and submitting jobsβeliminating manual, error-prone steps. - π Auditability: By using version control (like Git) and the immutable,
versioned S3 layout for artifacts, every deployment creates a full audit trail.
The generated
manifest.jsonlinks a specific code version to the exact artifacts used in a run. - β Quality Assurance: The structure encourages modern development practices like code linting, testing, and peer review through pull requests, leading to higher-quality analysis code.
- π Embedded Documentation: Code is well-commented, and documentation is included and version-controlled within the project itself.
By using this template, you can build robust, production-ready PySpark data pipelines that are efficient, transparent, and easy to maintain.
uv venv
source .venv/bin/activate # macOS/Linux
# Windows:
# .venv\Scripts\activate
# Install with dev extras so PySpark works locally
# (Cluster already has PySpark β no need to ship it)
uv pip install -e ".[dev]"π‘ Tip: uv is fast -- dependency resolution and installs are
near-instant compared to Poetry or pip.
Set these in your shell or a .env file (we auto-load it):
REGION=eu-west-2
S3_BUCKET=your-artifacts-bucket
EMR_APP_ID=00fulej7qh7jt90t
EMR_EXECUTION_ROLE=arn:aws:iam::<your-aws-account-id>:role/YourEmrServerlessExecutionRole
DEPLOY_ENV=dev # optional, used in S3 prefix
IMAGE_URI=<your-aws-account-id>.dkr.ecr.eu-west-2.amazonaws.com/emr-pyspark:7.9.0
APP_NAME=emr-spark-uv
RELEASE_LABEL=emr-7.9.0EMR_EXECUTION_ROLEmust be an IAM Role ARN EMR Serverless can assume.- Do NOT use the βApplication ARNβ -- it will fail with a
ValidationException.
- PySpark is a dev dependency only -- EMR Serverless clusters already include it.
- Use env vars instead of hardcoding for portability and security.
- For production deployments π -- itβs recommended to orchestrate EMR Serverless job
runs using a workflow scheduler such as Apache Airflow.
- Airflow can trigger this CLI or use the
boto3EMR Serverless API directly. - Benefits include:
- Automatic retries & failure alerts
- Dependency management between multiple jobs
- Scheduling and SLA monitoring
- Airflow can trigger this CLI or use the
You can now run the entire deployment process with a single command, or select specific steps.
Full Deploy (Build Image, Create App, Package, and Submit):
uv run deploy-to-emrRun Specific Steps:
# Build and push the Docker image
uv run deploy-to-emr --build-image
# Create or update the EMR Serverless application
uv run deploy-to-emr --create-app
# Package the application and upload to S3
uv run deploy-to-emr --package
# Submit the job to EMR Serverless
uv run deploy-to-emr --submit- Build & push Docker image with your dependencies.
- Create or update EMR Serverless application with the new image.
- Build & package your Python code and dependencies into a
.zipready for Spark (--py-files). - Upload artifacts (
main.pyand the.zip) to a versioned, immutable S3 release folder. - Generate a manifest (
manifest.json) capturing package metadata, Python version, and artifact paths. - Submit job via AWS
StartJobRunAPI to EMR Serverless.
- β Full end-to-end run in the actual EMR Serverless environment.
- β Validate packaging, dependency resolution, and S3 artifact uploads.
- β Use before production runs to ensure parity with live infrastructure.
π‘ Tip: for production scheduling, consider using Apache Airflow or another orchestrator to trigger this command as part of a managed pipeline.
To avoid incurring ongoing AWS charges, you can easily stop and delete the EMR Serverless application when you are finished.
uv run deploy-to-emr --cleanupThis command will:
- Stop the EMR Serverless application.
- Wait for it to fully stop.
- Delete the application permanently.
- Remove the local
.emr_app_idfile.
The command uses the EMR_APP_ID from your .env file or falls back to
the .emr_app_id file created by the --create-app step.
You can tag your deployment with a human-readable note:
uv run deploy-to-emr --deployment-note "Testing new partitioning logic"π This deployment-note is saved in manifest.json alongside:
- Package version
- Python version
- Original entry-point name
- Exact S3 keys used for this deploy
π‘ Why itβs useful:
- Perfect for audit trails π΅οΈ
- Makes it easier to roll back or investigate a deployment
- Great for CI/CD tagging
Preview the exact AWS StartJobRun payload without actually submitting the job
to EMR Serverless. This is useful for validating all arguments, environment variables,
and generated S3 artifact paths before launching.
uv run deploy-to-emr --dry-run- Normal packaging steps still occur -- your code and dependencies are staged exactly as in a real deploy, ensuring the S3 paths will be correct.
- Payload is constructed -- the script generates the JSON body
for
boto3.client("emr-serverless").start_job_run(...). - No API call is made -- instead of submitting to AWS, the payload is pretty-printed to the console via the logger.
{
"applicationId": "00fulej7qh7jt90t",
"executionRoleArn": "arn:aws:iam::<your-aws-account-id>:role/YourEmrServerlessExecutionRole",
"executionTimeoutMinutes": 60,
"jobDriver": {
"sparkSubmit": {
"entryPoint": "s3://my-bucket/emr-code/emr_pyspark_dummy/dev/releases/20250808_123456-ab12cd34/main_ab12cd34.py",
"entryPointArguments": [],
"sparkSubmitParameters": "--py-files s3://my-bucket/emr-code/emr_pyspark_dummy/dev/releases/20250808_123456-ab12cd34/code_ab12cd34.zip"
}
},
"configurationOverrides": {}
}- β Validate CLI arguments & environment variables before an actual run.
- β Confirm S3 artifact paths -- ensures your release folder and file names match expectations.
- β Debug or share payload details for change approval or troubleshooting without triggering a real cluster run.
- β No execution -- this mode does not run your Spark job or incur EMR runtime costs -- purely for inspection.
π‘ Tip: Use dry-run mode in CI pipelines to validate that deployment scripts and environment settings are correct before allowing production runs.
- Compile dependencies via
uv pip compilefrompyproject.toml. - Install into a staging folder.
- Copy
src/package +main.pyto staging root. - Zip β upload to S3.
Example ZIP structure:
code_<deployment_id>.zip
βββ emr_dummy/
β βββ __init__.py
β βββ job.py
βββ main_<deployment_id>.py
Every deployment gets a unique deployment_id:
s3://<bucket>/emr-code/<package>/<env>/
βββ releases/
β βββ 20250808_123456-ab12cd34/
β βββ code_20250808_123456-ab12cd34.zip
β βββ main_20250808_123456-ab12cd34.py
β βββ manifest.json
βββ logs/
βββ 20250808_123456-ab12cd34/
- β
Immutable releases -- every
main.pymatches itscode.zip - β
Full audit trail -- each
manifest.jsoncontains:package_name,package_version,python_requires,python_version- S3 URIs of all artifacts
- Optional
deployment_note
- Always use a deployment note in production for traceability.
- Store
.envin a secure location, not in Git. - Use
--dry-runin CI to lint deployments without launching jobs. - Use different
DEPLOY_ENVvalues (dev,staging,prod) to avoid mixing releases.
Copy the example file and edit it:
cp .env.example .envThen set REGION, S3_BUCKET, EMR_APP_ID, EMR_EXECUTION_ROLE, and
optional DEPLOY_ENV/DEPLOYMENT_NOTE.
EMR Serverless job submissions now enable CloudWatch Logs by default.
Control this via your .env:
ENABLE_CLOUDWATCH_LOGGING=true # defaultLogs are still written to the S3 log_uri as before.
π‘ Tip: Toggle CloudWatch logging from the CLI:
uv run deploy-to-emr --enable-cw # or --no-enable-cwThe bundled job writes a tiny DataFrame to an Apache Iceberg table backed by AWS Glue Catalog and S3.
Add the following to your .env (see .env.example):
ICEBERG_CATALOG_NAME=glue_catalog
ICEBERG_GLUE_DB=your_glue_database
ICEBERG_S3_BUCKET=your-data-bucket
# Optional override (otherwise derived from bucket):
# ICEBERG_WAREHOUSE_PATH=s3://your-data-bucket/iceberg/warehouseA sample config.toml is included to mimic a future config-driven workflow.
The table name comes from config.toml (e.g. dom_iceberg_table).
All Spark catalog configs are set via --conf in the EMR Serverless job
submission payload for portability and environment standardisation.
Best practice: Centralise these settings in the deploy payload so theyβre auditable per deployment.
Environment variables from your local .env are not automatically
visible to the EMR Serverless driver.
The deploy CLI reads your local .env and injects values as
Spark --conf keys (e.g. spark.emr_dummy.ICEBERG_GLUE_DB).
The job reads them via spark.conf, with environment variables as a fallback.
- Iceberg Spark configs are passed via the EMR Serverless submit payload (
--conf). - The job reads:
spark.emr_dummy.ICEBERG_CATALOG_NAME(defaults toglue_catalogif not provided)spark.emr_dummy.ICEBERG_GLUE_DB(required)table_namefrom packagedconfig.toml.
Use S3_BUCKET only. Scripts derive code/log prefixes like:
s3://$S3_BUCKET/jobs/$APP_NAME/s3://$S3_BUCKET/logs/$APP_NAME/
REGION=eu-west-2
S3_BUCKET=my-artifacts-bucket
EMR_APP_ID=00fulej7qh7jt90t
EMR_EXECUTION_ROLE=arn:aws:iam::<your-aws-account-id>:role/YourEmrServerlessExecutionRole
APP_NAME=emr-spark-uv
RELEASE_LABEL=emr-7.9.0
DEPLOY_ENV=devEMR_APP_IDis the EMR Serverless Application ID from the AWS Console.EMR_EXECUTION_ROLEis the IAM Execution Role ARN that jobs assume (not the application ARN).- Scripts auto-derive paths:
- Code zips β
s3://$S3_BUCKET/jobs/$APP_NAME/ - Logs β
s3://$S3_BUCKET/logs/$APP_NAME/(override withS3_LOG_PREFIX)
- Code zips β
Copy .env.example to .env and fill in values for your account.
- No, if you let the scripts create the app.
Run
uv run deploy-to-emr --create-appand weβll save the resulting ID to a local file.emr_app_id. - Yes, if you want to reuse an existing app created in the AWS Console.
Put its ID into
.envasEMR_APP_ID.
The submission script will use EMR_APP_ID from .env if set, otherwise
it falls back to reading .emr_app_id.
At minimum you need:
REGIONβ AWS region (e.g.,eu-west-2)S3_BUCKETβ bucket for artifacts/logsEMR_EXECUTION_ROLEβ Execution Role ARN for EMR ServerlessIMAGE_URIβ your custom ECR image (bakes your Python deps)- (optional)
APP_NAME,RELEASE_LABEL,DEPLOY_ENV,S3_LOG_PREFIX
See .env.example for a complete, commented template.
# Build & push custom image (from Dockerfile)
uv run deploy-to-emr --build-image
# Package code (zip main.py + src/) and upload to s3://$S3_BUCKET/jobs/$APP_NAME/
uv run deploy-to-emr --package
# Create/reuse EMR Serverless app and persist its ID to .emr_app_id
uv run deploy-to-emr --create-app
# Submit a job using the latest uploaded zip
uv run deploy-to-emr --submit
# Or do everything at once:
uv run deploy-to-emr- Change Python code? Run
uv run deploy-to-emr --package --submitto re-zip and re-run quickly. - Change dependencies? Rebuild the image:
uv run deploy-to-emr --build-image, thenuv run deploy-to-emr --submit. - Switching between apps? Set
EMR_APP_IDin.envor swap.emr_app_idlocally.
Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.