This directory contains practical examples demonstrating PyMetal's capabilities for GPU computing and graphics programming using Apple's Metal API.
- macOS with Metal-compatible GPU
- Python 3.9+
- NumPy
- SciPy (for example 01 only)
Install dependencies:
pip install numpy scipy| Example | Type | Difficulty | Key Focus |
|---|---|---|---|
| 01_image_blur.py | Compute | Beginner | Shader basics, performance comparison |
| 02_matrix_multiply_naive.py | Compute | Beginner | API demonstration (intentionally slow) |
| 02_matrix_multiply_tiled.py | Compute | Intermediate | Shared memory, tiling optimization |
| 02_matrix_multiply_optimized.py | Compute | Advanced | Bank conflicts, loop unrolling |
| 03_triangle_rendering.py | Graphics | Intermediate | Complete graphics pipeline |
| 04_advanced_features.py | Advanced | Advanced | Events, capture scopes, debugging |
Demonstrates GPU-accelerated Gaussian blur with performance comparison against CPU implementation.
Key Features:
- Compute shader compilation and execution
- 2D thread grid configuration
- Zero-copy NumPy buffer integration
- CPU vs GPU performance comparison
- Multiple image sizes (256x256 to 1024x1024)
Run:
python examples/01_image_blur.pyExpected Output:
Performance Comparison: CPU vs GPU
Small (256x256):
CPU: 15.23 ms
GPU: 2.45 ms
Speedup: 6.22x
Educational implementation showing PyMetal API usage with a simple, unoptimized algorithm.
Important: NumPy will be faster because it uses Apple's Accelerate framework with the AMX matrix coprocessor. This demo prioritizes code clarity over performance to demonstrate API patterns.
Key Features:
- Simple O(N³) matrix multiplication
- Multi-buffer management
- 2D thread group dispatch
- GFLOPS calculation
- Performance comparison showing CPU advantages
Run:
python examples/02_matrix_multiply_naive.pyExpected Output:
Large (512x512) @ Large (512x512):
NumPy: 0.35 ms (766 GFLOPS)
Metal: 6.85 ms (39 GFLOPS)
NumPy wins - optimized Accelerate framework
Why NumPy is Faster:
- Uses Apple Accelerate BLAS (hand-tuned assembly)
- Leverages AMX matrix coprocessor on Apple Silicon
- Naive GPU implementation doesn't use shared memory or tiling
Production-quality implementation using advanced GPU optimization techniques.
Key Features:
- Tiled algorithm with threadgroup (shared) memory
- Coalesced memory access patterns
- Reduced global memory bandwidth
- Proper synchronization barriers
- Competitive performance with NumPy for large matrices
Run:
python examples/02_matrix_multiply_tiled.pyExpected Output:
Large (1024x1024) @ Large (1024x1024):
NumPy (Accelerate): 2.98 ms (719 GFLOPS)
Metal (Tiled): X.XX ms (XXX GFLOPS)
Improved over naive by ~X-Xx
Optimization Techniques:
- 16×16 tile blocking
- Shared memory caching
- Reduced memory bandwidth by reusing data
- Better cache utilization
Compare with 02_matrix_multiply_naive.py to see the impact of GPU optimization.
Highly optimized implementation with advanced GPU techniques for maximum performance.
Key Features:
- Bank conflict avoidance in shared memory
- Loop unrolling with
#pragma unroll - Optimal thread group sizing for M1 GPU
- Better instruction pipeline utilization
Run:
python examples/02_matrix_multiply_optimized.pyExpected Output:
Huge (2048x2048):
NumPy: 21.35 ms (804 GFLOPS)
Metal: 88.08 ms (195 GFLOPS) - 2.5× slower but improving
Massive (4096x4096):
NumPy: 158.32 ms (868 GFLOPS)
Metal: 617.51 ms (222 GFLOPS) - Getting closer!
Optimizations Applied:
- TILE_SIZE+1 padding prevents bank conflicts
#pragma unrollfor better instruction pipelining- Optimal 16×16 tile size for M1 occupancy
- ~10-20% improvement over basic tiled version
Performance Progression:
- Naive: ~100 GFLOPS
- Tiled: ~200 GFLOPS
- Optimized: ~220 GFLOPS
NumPy remains faster due to dedicated AMX matrix hardware, but optimized GPU shows the limits of software optimization.
Comparing all three implementations on Apple M1 (4096×4096 matrices):
| Implementation | GFLOPS | Time (ms) | vs Naive | Notes |
|---|---|---|---|---|
| Naive | ~100 | ~1373 | 1.0× | Baseline - no optimization |
| Tiled | ~205 | ~670 | 2.0× | 16×16 shared memory tiles |
| Optimized | ~222 | ~618 | 2.2× | Bank conflict avoidance + unrolling |
| NumPy/AMX | ~868 | ~158 | 8.7× | Dedicated matrix hardware wins |
Key Takeaways:
- GPU optimization techniques provide 2-2.2× improvement
- Specialized hardware (AMX) is 4× faster than optimized GPU for matmul
- GPU excels at custom operations where specialized hardware doesn't exist
- ~220 GFLOPS is respectable for M1 GPU (~8% of 2.6 TFLOPS peak)
Complete graphics pipeline demonstration with offscreen rendering, depth testing, and image output.
Key Features:
- Vertex and fragment shaders
- Render pass configuration
- Color and depth attachments
- Depth testing setup
- Triangle rasterization with color interpolation
- Blit encoder for texture-to-buffer copy
- PPM image file output
Run:
python examples/03_triangle_rendering.pyOutput:
- Renders a colored triangle (red, green, blue vertices)
- Saves to
/tmp/pymetal_triangle.ppm - View with:
open /tmp/pymetal_triangle.ppm
Expected Output:
Rendering 512x512 triangle on Apple M1
Compiling shaders...
Creating render pipeline...
Rendering triangle...
✓ Image saved to: /tmp/pymetal_triangle.ppm
Demonstrates advanced Metal features including event system, shared events, binary archives, and capture scopes.
Key Features:
- Event-based synchronization
- Shared events with signaled values
- Binary archive API (pipeline caching)
- Capture scopes for GPU debugging
- Multi-pass compute operations
- Fine-grained command synchronization
Run:
python examples/04_advanced_features.pyExpected Output:
=== Event Synchronization Demo ===
Event synchronization verified: all values = 3.0
=== Shared Events Demo ===
Testing shared event signaling...
Initial value: 0
After signal: 100
Final value: 999
=== Capture Scopes Demo ===
Capture scope began - GPU work is now traceable
Computation verified: first 5 results = [0. 2. 4. 6. 8.]
import pymetal as pm
device = pm.create_system_default_device()
queue = device.new_command_queue()# 1. Compile shader
library = device.new_library_with_source(shader_source)
function = library.new_function("kernel_name")
pipeline = device.new_compute_pipeline_state(function)
# 2. Create buffers
buffer = device.new_buffer(size, pm.ResourceStorageModeShared)
# 3. Encode commands
cmd_buffer = queue.command_buffer()
encoder = cmd_buffer.compute_command_encoder()
encoder.set_compute_pipeline_state(pipeline)
encoder.set_buffer(buffer, 0, 0)
encoder.dispatch_threadgroups(grid_w, grid_h, 1, thread_w, thread_h, 1)
encoder.end_encoding()
# 4. Execute
cmd_buffer.commit()
cmd_buffer.wait_until_completed()# 1. Create render targets
color_desc = pm.TextureDescriptor.texture2d_descriptor(
pm.PixelFormat.RGBA8Unorm, width, height, False
)
color_texture = device.new_texture(color_desc)
# 2. Configure render pass
render_pass = pm.RenderPassDescriptor.render_pass_descriptor()
color_att = render_pass.color_attachment(0)
color_att.texture = color_texture
color_att.load_action = pm.LoadAction.Clear
color_att.store_action = pm.StoreAction.Store
# 3. Create pipeline
pipeline_desc = pm.RenderPipelineDescriptor.render_pipeline_descriptor()
pipeline_desc.vertex_function = vertex_func
pipeline_desc.fragment_function = fragment_func
pipeline = device.new_render_pipeline_state(pipeline_desc)
# 4. Render
cmd_buffer = queue.command_buffer()
encoder = cmd_buffer.render_command_encoder(render_pass)
encoder.set_render_pipeline_state(pipeline)
encoder.draw_primitives(pm.PrimitiveType.Triangle, 0, 3)
encoder.end_encoding()
cmd_buffer.commit()# Write to GPU buffer from NumPy
data = np.array([1, 2, 3, 4], dtype=np.float32)
buffer = device.new_buffer(data.nbytes, pm.ResourceStorageModeShared)
np.copyto(np.frombuffer(buffer.contents(), dtype=np.float32), data)
# Read from GPU buffer to NumPy
result = np.frombuffer(buffer.contents(), dtype=np.float32, count=4)- ✓ Custom operations not available in optimized libraries
- ✓ Highly parallel workloads (thousands/millions of independent operations)
- ✓ Large datasets (overhead is amortized)
- ✓ Fused operations (combining multiple steps reduces memory traffic)
- ✓ Memory-bound operations where parallelism helps bandwidth
- ✓ Standard operations (matmul, FFT, etc.) - use Accelerate/MKL
- ✓ Small datasets (GPU overhead dominates)
- ✓ Sequential algorithms (limited parallelism)
- ✓ Prototyping (faster development, easier debugging)
- ✓ Apple Silicon has AMX coprocessor for matrix operations
Many real applications use both:
- NumPy/Accelerate for standard linear algebra
- GPU for custom kernels, image processing, simulations
- CPU for control flow, data preparation
-
Use Shared Storage Mode for CPU-GPU data transfer
buffer = device.new_buffer(size, pm.ResourceStorageModeShared)
-
Optimize Thread Group Size based on problem size
threads_per_group = min(16, device.max_threads_per_threadgroup.width)
-
Avoid Synchronous Waits when possible
# Instead of: cmd_buffer.wait_until_completed() # Use completion handlers or fence for async operation
-
Batch Operations to reduce command buffer overhead
# Submit multiple operations in one command buffer encoder.dispatch_threadgroups(...) # Operation 1 encoder.dispatch_threadgroups(...) # Operation 2 encoder.end_encoding()
export METAL_DEVICE_WRAPPER_TYPE=1
export MTL_DEBUG_LAYER=1
python examples/01_image_blur.pymanager = pm.shared_capture_manager()
scope = manager.new_capture_scope_with_command_queue(queue)
scope.label = "My Debug Capture"
scope.begin_scope()
# ... GPU work ...
scope.end_scope()Then capture in Xcode: Product > Perform Action > Capture GPU Frame
device = pm.create_system_default_device()
print(f"Device: {device.name}")
print(f"Max threads per threadgroup: {device.max_threads_per_threadgroup.width}")
print(f"Supports family: {device.supports_family(pm.GPUFamilyApple8)}")Problem: Shader compilation fails
- Check Metal Shading Language syntax
- Ensure kernel/vertex/fragment functions are correctly declared
- Verify buffer bindings match
[[buffer(N)]]indices
Problem: Results don't match expected
- Verify thread group size covers entire data range
- Check for race conditions in shared memory
- Ensure proper synchronization between passes
Problem: Performance is slower than expected
- Profile thread group configuration
- Check for CPU-GPU transfer bottlenecks
- Consider using Private storage mode for GPU-only data
- Batch multiple operations into single command buffer
- Metal Shading Language Specification
- Metal Best Practices Guide
- PyMetal Test Suite - Comprehensive API examples
Found a bug or want to add an example? Please open an issue or pull request on the PyMetal repository.