Skip to content

Commit 92d14bf

Browse files
author
Basu Jindal
committed
updates
1 parent cf0d052 commit 92d14bf

5 files changed

Lines changed: 186 additions & 95 deletions

File tree

blogs/index.html

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,6 @@ <h1 class="page-title fade-in">Blogs</h1>
3030
{ slug: 'parameter_estimation', description: 'Notes on parameter estimation techniques.' },
3131
{ slug: 'probability_statistics', description: 'Foundations of probability and statistics.' },
3232
{ slug: 'llm', description: 'Notes on LLMs' },
33-
{ slug: 'cuda', description: 'Basics of GPU Programming' },
34-
{ slug: 'quantization', description: 'Quantization 101' },
3533
];
3634

3735
async function loadBlogs() {
Lines changed: 71 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -327,19 +327,19 @@ ldmatrix.sync.aligned.m8n8.x1.shared::cta.b16 {d}, [addr];
327327

328328
## Swizzling
329329

330-
I recommend reading the [excellent blog by Yifan Yang](https://yang-yifan.github.io/blogs/mma_swizzle/mma_swizzle.html#6-how-transposed-input-is-handled) blog to understand swizzling. This section is basically a condensation of the blog and a few of notes.
330+
I recommend reading the [excellent blog by Yifan Yang](https://yang-yifan.github.io/blogs/mma_swizzle/mma_swizzle.html#6-how-transposed-input-is-handled) blog to understand swizzling. This section is basically a quick summary of the blog.
331331

332-
Swizzling refers to arranging the data in the SMEM in a manner to avoid bank conflits while reading or writing data to and from SMEM.
332+
Swizzling refers to arranging the data in the SMEM in a manner to avoid bank conflits while reading or writing data from SMEM.
333333

334334
<img src="https://yang-yifan.github.io/blogs/mma_swizzle/figures/swizzle_none_k.png" alt="Swizzle pattern for shared memory" style="max-width: 600px; display: block; margin: 0 auto;">
335335

336-
For the Tensor core instriction requires $8 x 16B$ data from GMEM which can be loaded into the SMEM using threads // check if any other way// and fed into the tensor core using the above `ldmatrix` instruction. This works well since there are no bank conflicts in this case. Thread 0 loads from 32bits bank 0, thread 1 from bank 1, and so on till thread 31 from bank31.
336+
For the Tensor core instruction requires 8 x 16B data from GMEM which can be loaded into the SMEM using threads // check if any other way// and fed into the tensor core using the above `ldmatrix` instruction. This works well since there are no bank conflicts in this case. Thread 0 loads from 32bits bank 0, thread 1 from bank 1, and so on till thread 31 from bank31.
337337

338338
This works well but notice that while loading from GMEM we load 8 chunks of 16B contiguous memory, which means 8 load instructions, whereas GPUs support up to 128B contiguous load. Also since loading from GMEM has a very high latency as compared loading from L2, SMEM or registers, we would like to load larger chunks.
339339

340340
But if we load 8 chunks of 32B contigous memory and store them in SMEM contiguously, we will have bank conflicts while reading from SMEM.
341341

342-
Now we will have multiple 2 way bank conflicts since both $\text{thread}\_0$ and $\text{thread}\_{16}$ will read from bank 0 and same for every $\text{thread}\_i$ and $\text{thread}\_{i+16}$ . If we add a 32B swizzling, we avoid bank conflicts.
342+
Now we will have multiple 2 way bank conflicts since both thread0 and thread16 will read from bank 0 and same for every $\text{thread}\_i$ and $\text{thread}\_{i+16}$ . If we add a 32B swizzling, we avoid bank conflicts.
343343
<img src="https://yang-yifan.github.io/blogs/mma_swizzle/figures/why_swizzle.png" alt="Swizzle pattern for shared memory" style="max-width: 800px; display: block; margin: 0 auto;">
344344

345345
```copied
@@ -352,48 +352,47 @@ A new concept called 16B atomicity. This is saying for a 16B chunk that is conti
352352
- [CUDA Mode Video on Tensor Cores](https://www.youtube.com/watch?v=hQ9GPnV0-50&t=3968s)
353353

354354

355-
### Ping-Pong
355+
<!-- ### Ping-Pong -->
356356

357357
<!-- For Ping-Pong, each warp group takes on a specialized role of either Data producer or Data consumer. The producer warp group focuses on producing data movement to fill the shared memory buffers (via TMA). Two other warp groups are dedicated consumers that process the math (MMA) portion with tensor cores, and then do any follow up work and write their results back to global memory (epilogue)
358358
359359
The producer can feed data to Tensor cores of Consumers. While one consumer is using the Tensor cores for Main Loop (MMA), the other can work on Epilogue which uses the CUDA cores. Thereby maximizing the utilization of Tensor cores -->
360360

361-
## GEMM flow in Blackwell
362-
363-
Full GEMM: (Gemm_M × Gemm_N) output, iterating over Gemm_K
364-
365-
Cluster Tile: Multiple CTAs in a cluster TOGETHER compute a larger tile
366-
│ Size: (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N)
367-
CTA Tile: Each CTA within the cluster computes its portion
368-
│ Size: MmaTile_M × MmaTile_N (one CTA's responsibility)
369-
370-
MMA Atom: The hardware instruction (tcgen05.mma)
371-
Size: e.g., 64×256×16 for SM100
372-
373-
So the relationship is:
374-
┌──────────────┬───────────────────────────┬───────────────────────────────────────────────────┐
375-
│ Level │ What computes it │ Size │
376-
├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
377-
│ Full output │ Entire grid │ Gemm_M × Gemm_N │
378-
├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
379-
│ Cluster tile │ 1 cluster (multiple CTAs) │ (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N) │
380-
├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
381-
│ CTA tile │ 1 CTA (thread block) │ MmaTile_M × MmaTile_N │
382-
├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
383-
│ MMA atom │ 1 MMA instruction │ ~64×256×16 │
384-
└──────────────┴───────────────────────────┴───────────────────────────────────────────────────┘
385-
Example:
386-
387-
cluster_shape = (2, 1, 1) // 2 CTAs per cluster in M
388-
MmaTile_M = 128, MmaTile_N = 256
389-
390-
// One CLUSTER handles: (2 × 128) × (1 × 256) = 256 × 256 output tile
391-
// Each CTA in the cluster handles: 128 × 256 (half the M dimension)
392-
393-
The cluster doesn't work on ONE MMA tile together - rather, multiple CTAs in a cluster each handle their own MMA tile, but they can share data via distributed shared memory and synchronize.
394-
395-
396-
## Blackwell Architecture
361+
## Architecture Comparison
362+
363+
| Category | Feature | A100 (Ampere) | H100 (Hopper) | B100 (Blackwell) | B200 (Blackwell) |
364+
| ----------------------- | ----------------------------------- | ---------------: | ------------: | ---------------: | ---------------: |
365+
| **Compute / SM** | Tensor Core generation | 3rd Gen | 4th Gen | 5th Gen | 5th Gen |
366+
| | FP32 cores / SM | 64 | 128 | 128 | 128 |
367+
| | FP64 cores / SM | 32 | 64 | 64 | 64 |
368+
| | INT32 cores / SM | 64 | 64 | 64* | 64* |
369+
| | SM count *(check)* | 108 | 132 | 140 | 148 |
370+
| **Scheduling / Limits** | Max resident warps / SM | 64 | 64 | 64 | 64 |
371+
| | Register file / SM (32-bit regs) | 65,536 | 65,536 | 65,536 | 65,536 |
372+
| | Max registers / thread | 255 | 255 | 255 | 255 |
373+
| | Max threads / block | 1024 | 1024 | 1024 | 1024 |
374+
| | Max threads / SM | 2048 | 2048 | 2048 | 2048 |
375+
| | Max thread blocks / SM | 32 | 32 | 32 | 32 |
376+
| **On-chip memory** | L1/Texture + Shared (combined) / SM | 192 KB | 256 KB | 256 KB | 256 KB |
377+
| | Shared memory capacity / SM (max) | 164 KB | 228 KB | 228 KB | 228 KB |
378+
| | Max shared / thread block (opt-in) || 227 KB | 227 KB | 227 KB |
379+
| | Tensor Memory / SM ||| 256 KB | 256 KB |
380+
| **HBM / Bandwidth** | Total memory | 40 / 80 GB HBM2e | 80 GB HBM3 | 192 GB HBM3e | 192 GB HBM3e |
381+
| | Memory bandwidth *(check one way)* | 1.6–2.0 TB/s | 3.35 TB/s | ~8.0 TB/s | 8.0 TB/s |
382+
| **Numeric formats** | FP8 support | No | Yes | Yes | Yes |
383+
| | FP4 / FP6 support | No | No | Yes | Yes |
384+
| **Interconnect** | NVLink | v3 (600 GB/s) | v4 (900 GB/s) | v5 (1.8 TB/s) | v5 (1.8 TB/s) |
385+
| **Power / Silicon** | TDP (max) *(check)* | 400 W | 700 W | 700 W | 1000 W |
386+
| | Transistor count *(check)* | 54B | 80B | 208B | 208B |
387+
388+
389+
Notice that **Scheduling / Limits** have not changed across generations.
390+
391+
## Blackwell
392+
393+
[GTC video on CuTe for Blackwell](https://www.nvidia.com/en-us/on-demand/session/gtc25-s72720/)
394+
395+
[Tuning guide for Blackwell](https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html)
397396

398397
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-10-x
399398

@@ -414,7 +413,35 @@ A Streaming Multiprocessor (SM) level:
414413
Block Level: Maximum shared memory per thread block is 227 KB.
415414
Thread level: The maximum number of registers per thread is 255.
416415

416+
### GEMM flow
417+
418+
Full GEMM: (Gemm_M × Gemm_N) output, iterating over Gemm_K
419+
420+
Cluster Tile: Multiple CTAs in a cluster TOGETHER compute a larger tile
421+
│ Size: (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N)
422+
CTA Tile: Each CTA within the cluster computes its portion
423+
│ Size: MmaTile_M × MmaTile_N (one CTA's responsibility)
424+
425+
MMA Atom: The hardware instruction (tcgen05.mma)
426+
Size: e.g., 64×256×16 for SM100
427+
428+
So the relationship is:
429+
| Level | What computes it | Size |
430+
|---|---|---|
431+
| Full output | Entire grid | Gemm_M × Gemm_N |
432+
| Cluster tile | 1 cluster (multiple CTAs) | (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N) |
433+
| CTA tile | 1 CTA (thread block) | MmaTile_M × MmaTile_N |
434+
| MMA atom | 1 MMA instruction | ~64×256×16 |
435+
436+
Example:
417437

438+
cluster_shape = (2, 1, 1) // 2 CTAs per cluster in M
439+
MmaTile_M = 128, MmaTile_N = 256
440+
441+
// One CLUSTER handles: (2 × 128) × (1 × 256) = 256 × 256 output tile
442+
// Each CTA in the cluster handles: 128 × 256 (half the M dimension)
443+
444+
The cluster doesn't work on ONE MMA tile together - rather, multiple CTAs in a cluster each handle their own MMA tile, but they can share data via distributed shared memory and synchronize.
418445

419446
### Loading from SMEM in Blackwell
420447

@@ -433,12 +460,13 @@ GPU Memory controller can issue upto 128B load from SMEM in a single cycle. Also
433460

434461
### Load tiles into SMEM using TMA
435462

436-
Load BMxBK and BKxBN tiles from 64x64 fp16 (8192B) tiles from GMEM to SMEM. TMA and tensor cores operates on "core matrices" which are 8x16B of data which for half is 8x8 tile of data. Which means we need to load (64/8)x(64/8) == (8x8) core matrices. While loading data in SMEM we need to keep in mind that it will be fed to Tensor cores (tcgen05) which expects the data in a certain format. TMA can load a column of 8 core matrices (1024B) (8,1) at a time which means to load 8192B we load 8 times. Use Tcgen05.mma instruction and store the results in TMEM. Move results from TMEM to registers and finally to GMEM
463+
Load BMxBK and BKxBN tiles from 64x64 fp16 (8192B) tiles from GMEM to SMEM. TMA and tensor cores operates on "core matrices" which are 8x16B of data which for half is 8x8 tile of data. Which means we need to load (64/8)x(64/8) == (8x8) core matrices. While loading data in SMEM we need to keep in mind that it will be fed to Tensor cores (tcgen05) which expects the data in a certain format. TMA can load a column of 8 core matrices (1024B) (8,1) at a time which means to load 8192B we load 8 times. Use `Tcgen05.mma` instruction and store the results in TMEM. Move results from TMEM to registers and finally to GMEM
437464

438465

439466

440-
## References
467+
## References & Recommended resources
441468

469+
- [Articles by colfax research](https://research.colfax-intl.com/blog/)
442470
- [CUDA Training Series by NVIDIA and OLCF](https://www.olcf.ornl.gov/cuda-training-series/)
443471
- [CUDA Training Series YouTube Playlist](https://www.youtube.com/playlist?app=desktop&list=PL6RdenZrxrw-zNX7uuGppWETdxt_JxdMj)
444472
- [CUDA Training Exercises](https://github.com/olcf/cuda-training-series/tree/master/exercises)
Lines changed: 113 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,42 @@ If you allocate your host memory with cudaMallocHost and initialize the data the
9494

9595
### Compiling CUDA code
9696

97-
CUDA code can be compiled using the `nvcc` command. It does code-generation in two stages:
97+
CUDA code can be compiled using the `nvcc` command
98+
99+
```bash
100+
nvcc -std=c++17 \
101+
-arch=sm_100a \
102+
-I/Users/basujindal/cutlass/include \
103+
-I/Users/basujindal/cutlass/examples/cute/tutorial/blackwell \
104+
-o mma \
105+
examples/cute/tutorial/blackwell/mma.cu
106+
```
107+
108+
Or use a Makefile
109+
110+
```Makefile
111+
NVCC = nvcc
112+
FLAGS = -std=c++17 -arch=sm_100a
113+
SRC = examples/cute/tutorial/blackwell/mma.cu
114+
INCLUDES = -I/opt/cutlass/include -I/opt/cutlass/examples/cute/tutorial/blackwell
115+
116+
mma: mma.o
117+
$(NVCC) $(FLAGS) -o $@ $^
118+
119+
mma.o: $(SRC)
120+
$(NVCC) $(FLAGS) $(INCLUDES) -dc -o $@ $<
121+
122+
clean:
123+
rm -f *.o mma
124+
```
125+
126+
Run the command
127+
128+
```bash
129+
make && ./mma
130+
```
131+
132+
It does code-generation in two stages:
98133

99134
Stage | What nvcc produces | Option that drives it
100135
1. Front-end | PTX for a virtual architecture (“compute XX…”) | arch= inside -gencode (or --gpu-architecture)
@@ -159,6 +194,82 @@ This will generate PTX code for compute capability 3.0, 5.2, and 7.0. Generate S
159194

160195
`compute_XX` refers to a PTX version and sm_XX refers to a cubin version and the `arch=` clause must always be a PTX version, while the `code=` clause can be cubin or PTX or both
161196

197+
# CUTLASS/CuTe Development
198+
199+
clangd for IDE features like, Go to definition, Hover documentation, Auto-completion, Error diagnostics. Clangd is part of the LLVM/Clang project and understands C++ deeply. Unlike simple syntax highlighting, clangd actually compiles the code in the background to understand types, templates, and symbols.
200+
201+
- Install `clangd` Extension in VS Code. If prompted, let it download the clangd binary
202+
- If you have Microsoft's "C/C++" extension installed, disable its IntelliSense to avoid conflicts by: Settings → search `C_Cpp.intelliSenseEngine` → set to `disabled`
203+
- Create a `.clangd` file in the project root. This file tells clangd how to compile your code.
204+
205+
---
206+
207+
### Example Configuration
208+
209+
```yaml
210+
CompileFlags:
211+
Add:
212+
- "-xc++"
213+
- "-std=c++17"
214+
- "-I/path/to/cutlass/include"
215+
- "-I/path/to/cutlass/tools/util/include"
216+
Remove:
217+
- "-forward-unknown-to-host-compiler"
218+
- "--generate-code*"
219+
- "-gencode*"
220+
```
221+
222+
### Flag Explanations
223+
224+
| Flag | Purpose |
225+
|------|---------|
226+
| `-xc++` | Treat `.cu` files as C++ (clangd doesn't understand CUDA natively) |
227+
| `-std=c++17` | Use C++17 standard (CUTLASS requires C++17) |
228+
| `-I/path/to/include` | Include paths - where to find headers |
229+
230+
### Remove Flags
231+
232+
These are nvcc-specific flags that clang doesn't understand:
233+
- `-forward-unknown-to-host-compiler`
234+
- `--generate-code*`
235+
- `-gencode*`
236+
---
237+
238+
- Finally Restart clangd using `Cmd+Shift+P` → `clangd: Restart language server`
239+
240+
---
241+
242+
## Optional: Thrust/CUB Support
243+
244+
If you want Thrust headers to work (for `thrust::device_vector`, etc.), download them separately:
245+
246+
```bash
247+
# Clone Thrust (header-only library)
248+
git clone https://github.com/NVIDIA/thrust.git ~/thrust
249+
250+
# Clone CUB (Thrust dependency)
251+
git clone https://github.com/NVIDIA/cub.git ~/cub
252+
```
253+
254+
Then add to `.clangd`:
255+
```yaml
256+
CompileFlags:
257+
Add:
258+
# ... existing flags ...
259+
- "-I/Users/yourusername/thrust"
260+
- "-I/Users/yourusername/cub"
261+
```
262+
263+
---
264+
265+
### Check clangd status
266+
- Cmd+Shift+P → "clangd: Check status"
267+
268+
### View clangd logs
269+
- View → Output → Select "clangd" from dropdown
270+
271+
---
272+
162273
## Profiling and Debugging
163274

164275
**nsys**: CLI for Nsight Systems which supports system wide profiling.
@@ -168,7 +279,6 @@ This will generate PTX code for compute capability 3.0, 5.2, and 7.0. Generate S
168279
**nvprof**: CLI for the NVIDIA Visual Profiler which supports profiling and tracing of CUDA applications. It is deprecated in CUDA 11.0 and will be removed in a future release.
169280

170281
###
171-
172282
- [CUDA Debugging Video](https://www.youtube.com/watch?v=nAsMhH1tnYw)
173283
- [CUDA Debugging by vLLM](https://blog.vllm.ai/2025/08/11/cuda-debugging.html)
174284

@@ -203,4 +313,4 @@ That’s your faulting PC in hex. You can disassemble around it:
203313

204314
```bash
205315
(cuda-gdb) disassemble $errorpc-0x40, $errorpc+0x40
206-
```
316+
```
File renamed without changes.

0 commit comments

Comments
 (0)