basujindal
diff --git a/‎blogs/index.html‎
Lines changed: 0 additions & 2 deletions b/‎blogs/index.html‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎blog-posts/cuda.md‎ ‎drafts/cuda.md‎blog-posts/cuda.md renamed to drafts/cuda.md
Lines changed: 71 additions & 43 deletions b/‎blog-posts/cuda.md‎ ‎drafts/cuda.md‎blog-posts/cuda.md renamed to drafts/cuda.md
Lines changed: 71 additions & 43 deletions
diff --git a/‎blog-posts/cuda2.md‎ ‎drafts/cuda2.md‎blog-posts/cuda2.md renamed to drafts/cuda2.md
Lines changed: 113 additions & 3 deletions b/‎blog-posts/cuda2.md‎ ‎drafts/cuda2.md‎blog-posts/cuda2.md renamed to drafts/cuda2.md
Lines changed: 113 additions & 3 deletions
diff --git a/‎blog-posts/quantization.md‎ ‎drafts/quantization.md‎blog-posts/quantization.md renamed to drafts/quantization.md b/‎blog-posts/quantization.md‎ ‎drafts/quantization.md‎blog-posts/quantization.md renamed to drafts/quantization.md
@@ -30,8 +30,6 @@ <h1 class="page-title fade-in">Blogs</h1>
       { slug: 'parameter_estimation', description: 'Notes on parameter estimation techniques.' },
       { slug: 'probability_statistics', description: 'Foundations of probability and statistics.' },
       { slug: 'llm', description: 'Notes on LLMs' },
-      { slug: 'cuda', description: 'Basics of GPU Programming' },
-      { slug: 'quantization', description: 'Quantization 101' },
     ];
 
     async function loadBlogs() {
 
@@ -327,19 +327,19 @@ ldmatrix.sync.aligned.m8n8.x1.shared::cta.b16 {d}, [addr];
 
 ## Swizzling
 
-I recommend reading the [excellent blog by Yifan Yang](https://yang-yifan.github.io/blogs/mma_swizzle/mma_swizzle.html#6-how-transposed-input-is-handled) blog to understand swizzling. This section is basically a condensation of the blog and a few of notes.
+I recommend reading the [excellent blog by Yifan Yang](https://yang-yifan.github.io/blogs/mma_swizzle/mma_swizzle.html#6-how-transposed-input-is-handled) blog to understand swizzling. This section is basically a quick summary of the blog.
 
-Swizzling refers to arranging the data in the SMEM in a manner to avoid bank conflits while reading or writing data to and from SMEM. 
+Swizzling refers to arranging the data in the SMEM in a manner to avoid bank conflits while reading or writing data from SMEM. 
 
 <img src="https://yang-yifan.github.io/blogs/mma_swizzle/figures/swizzle_none_k.png" alt="Swizzle pattern for shared memory" style="max-width: 600px; display: block; margin: 0 auto;">
 
-For the Tensor core instriction requires $8 x 16B$ data from GMEM which can be loaded into the SMEM using threads // check if any other way// and fed into the tensor core using the above `ldmatrix` instruction. This works well since there are no bank conflicts in this case. Thread 0 loads from 32bits bank 0, thread 1 from bank 1, and so on till thread 31 from bank31. 
+For the Tensor core instruction requires 8 x 16B data from GMEM which can be loaded into the SMEM using threads // check if any other way// and fed into the tensor core using the above `ldmatrix` instruction. This works well since there are no bank conflicts in this case. Thread 0 loads from 32bits bank 0, thread 1 from bank 1, and so on till thread 31 from bank31. 
 
 This works well but notice that while loading from GMEM we load 8 chunks of 16B contiguous memory, which means 8 load instructions, whereas GPUs support up to 128B contiguous load. Also since loading from GMEM has a very high latency as compared loading from L2, SMEM or registers, we would like to load larger chunks.
 
 But if we load 8 chunks of 32B contigous memory and store them in SMEM contiguously, we will have bank conflicts while reading from SMEM. 
 
-Now we will have multiple 2 way bank conflicts since both $\text{thread}\_0$ and $\text{thread}\_{16}$ will read from bank 0 and same for every $\text{thread}\_i$ and $\text{thread}\_{i+16}$ . If we add a 32B swizzling, we avoid bank conflicts.
+Now we will have multiple 2 way bank conflicts since both thread0 and thread16 will read from bank 0 and same for every $\text{thread}\_i$ and $\text{thread}\_{i+16}$ . If we add a 32B swizzling, we avoid bank conflicts.
 <img src="https://yang-yifan.github.io/blogs/mma_swizzle/figures/why_swizzle.png" alt="Swizzle pattern for shared memory" style="max-width: 800px; display: block; margin: 0 auto;">
 
 ```copied
@@ -352,48 +352,47 @@ A new concept called 16B atomicity. This is saying for a 16B chunk that is conti
 - [CUDA Mode Video on Tensor Cores](https://www.youtube.com/watch?v=hQ9GPnV0-50&t=3968s)
 
 
-### Ping-Pong
+<!-- ### Ping-Pong -->
 
 <!-- For Ping-Pong, each warp group takes on a specialized role of either Data producer or Data consumer. The producer warp group focuses on producing data movement to fill the shared memory buffers (via TMA). Two other warp groups are dedicated consumers that process the math (MMA) portion with tensor cores, and then do any follow up work and write their results back to global memory (epilogue)
 
 The producer can feed data to Tensor cores of Consumers. While one consumer is using the Tensor cores for Main Loop (MMA), the other can work on Epilogue which uses the CUDA cores. Thereby maximizing the utilization of Tensor cores -->
 
-## GEMM flow in Blackwell
-
-Full GEMM: (Gemm_M × Gemm_N) output, iterating over Gemm_K
-    │
-Cluster Tile: Multiple CTAs in a cluster TOGETHER compute a larger tile
-    │          Size: (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N)
-CTA Tile: Each CTA within the cluster computes its portion
-    │      Size: MmaTile_M × MmaTile_N (one CTA's responsibility)
-
-MMA Atom: The hardware instruction (tcgen05.mma)
-            Size: e.g., 64×256×16 for SM100
-
-So the relationship is:
-┌──────────────┬───────────────────────────┬───────────────────────────────────────────────────┐
-│    Level     │     What computes it      │                       Size                        │
-├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
-│ Full output  │ Entire grid               │ Gemm_M × Gemm_N                                   │
-├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
-│ Cluster tile │ 1 cluster (multiple CTAs) │ (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N) │
-├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
-│ CTA tile     │ 1 CTA (thread block)      │ MmaTile_M × MmaTile_N                             │
-├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
-│ MMA atom     │ 1 MMA instruction         │ ~64×256×16                                        │
-└──────────────┴───────────────────────────┴───────────────────────────────────────────────────┘
-Example:
-
-cluster_shape = (2, 1, 1)   // 2 CTAs per cluster in M
-MmaTile_M = 128, MmaTile_N = 256
-
-// One CLUSTER handles: (2 × 128) × (1 × 256) = 256 × 256 output tile
-// Each CTA in the cluster handles: 128 × 256 (half the M dimension)
-
-The cluster doesn't work on ONE MMA tile together - rather, multiple CTAs in a cluster each handle their own MMA tile, but they can share data via distributed shared memory and synchronize.
-
-
-## Blackwell Architecture 
+## Architecture Comparison
+
+| Category                | Feature                             |    A100 (Ampere) | H100 (Hopper) | B100 (Blackwell) | B200 (Blackwell) |
+| ----------------------- | ----------------------------------- | ---------------: | ------------: | ---------------: | ---------------: |
+| **Compute / SM**        | Tensor Core generation              |          3rd Gen |       4th Gen |          5th Gen |          5th Gen |
+|                         | FP32 cores / SM                     |               64 |           128 |              128 |              128 |
+|                         | FP64 cores / SM                     |               32 |            64 |               64 |               64 |
+|                         | INT32 cores / SM                    |               64 |            64 |              64* |              64* |
+|                         | SM count *(check)*                  |              108 |           132 |              140 |              148 |
+| **Scheduling / Limits** | Max resident warps / SM             |               64 |            64 |               64 |               64 |
+|                         | Register file / SM (32-bit regs)    |           65,536 |        65,536 |           65,536 |           65,536 |
+|                         | Max registers / thread              |              255 |           255 |              255 |              255 |
+|                         | Max threads / block                 |             1024 |          1024 |             1024 |             1024 |
+|                         | Max threads / SM                    |             2048 |          2048 |             2048 |             2048 |
+|                         | Max thread blocks / SM              |               32 |            32 |               32 |               32 |
+| **On-chip memory**      | L1/Texture + Shared (combined) / SM |           192 KB |        256 KB |           256 KB |           256 KB |
+|                         | Shared memory capacity / SM (max)   |           164 KB |        228 KB |           228 KB |           228 KB |
+|                         | Max shared / thread block (opt-in)  |                — |        227 KB |           227 KB |           227 KB |
+|                         | Tensor Memory / SM                  |                — |        —      |           256 KB |           256 KB |
+| **HBM / Bandwidth**     | Total memory                        | 40 / 80 GB HBM2e |    80 GB HBM3 |     192 GB HBM3e |     192 GB HBM3e |
+|                         | Memory bandwidth *(check  one way)* |     1.6–2.0 TB/s |     3.35 TB/s |        ~8.0 TB/s |         8.0 TB/s |
+| **Numeric formats**     | FP8 support                         |               No |           Yes |              Yes |              Yes |
+|                         | FP4 / FP6 support                   |               No |            No |              Yes |              Yes |
+| **Interconnect**        | NVLink                              |    v3 (600 GB/s) | v4 (900 GB/s) |    v5 (1.8 TB/s) |    v5 (1.8 TB/s) |
+| **Power / Silicon**     | TDP (max) *(check)*                 |            400 W |         700 W |            700 W |           1000 W |
+|                         | Transistor count *(check)*          |              54B |           80B |             208B |             208B |
+
+
+Notice that **Scheduling / Limits** have not changed across generations.
+
+## Blackwell 
+
+[GTC video on CuTe for Blackwell](https://www.nvidia.com/en-us/on-demand/session/gtc25-s72720/)
+
+[Tuning guide for Blackwell](https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html)
 
 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-10-x
 
@@ -414,7 +413,35 @@ A Streaming Multiprocessor (SM) level:
 Block Level: Maximum shared memory per thread block is 227 KB.
 Thread level: The maximum number of registers per thread is 255.
 
+### GEMM flow
+
+Full GEMM: (Gemm_M × Gemm_N) output, iterating over Gemm_K
+    │
+Cluster Tile: Multiple CTAs in a cluster TOGETHER compute a larger tile
+    │          Size: (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N)
+CTA Tile: Each CTA within the cluster computes its portion
+    │      Size: MmaTile_M × MmaTile_N (one CTA's responsibility)
+
+MMA Atom: The hardware instruction (tcgen05.mma)
+            Size: e.g., 64×256×16 for SM100
+
+So the relationship is:
+| Level | What computes it | Size |
+|---|---|---|
+| Full output | Entire grid | Gemm_M × Gemm_N |
+| Cluster tile | 1 cluster (multiple CTAs) | (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N) |
+| CTA tile | 1 CTA (thread block) | MmaTile_M × MmaTile_N |
+| MMA atom | 1 MMA instruction | ~64×256×16 |
+
+Example:
 
+cluster_shape = (2, 1, 1)   // 2 CTAs per cluster in M
+MmaTile_M = 128, MmaTile_N = 256
+
+// One CLUSTER handles: (2 × 128) × (1 × 256) = 256 × 256 output tile
+// Each CTA in the cluster handles: 128 × 256 (half the M dimension)
+
+The cluster doesn't work on ONE MMA tile together - rather, multiple CTAs in a cluster each handle their own MMA tile, but they can share data via distributed shared memory and synchronize.
 
 ### Loading from SMEM in Blackwell
 
@@ -433,12 +460,13 @@ GPU Memory controller can issue upto 128B load from SMEM in a single cycle. Also
 
 ### Load tiles into SMEM using TMA
 
-Load BMxBK and BKxBN tiles from 64x64 fp16 (8192B) tiles from GMEM to SMEM. TMA and tensor cores operates on "core matrices" which are 8x16B of data which for half is 8x8 tile of data. Which means we need to load (64/8)x(64/8) == (8x8) core matrices. While loading data in SMEM we need to keep in mind that it will be fed to Tensor cores (tcgen05) which expects the data in a certain format. TMA can load a column of 8 core matrices (1024B) (8,1) at a time which means to load 8192B we load 8 times. Use Tcgen05.mma instruction and store the results in TMEM. Move results from TMEM to registers and finally to GMEM
+Load BMxBK and BKxBN tiles from 64x64 fp16 (8192B) tiles from GMEM to SMEM. TMA and tensor cores operates on "core matrices" which are 8x16B of data which for half is 8x8 tile of data. Which means we need to load (64/8)x(64/8) == (8x8) core matrices. While loading data in SMEM we need to keep in mind that it will be fed to Tensor cores (tcgen05) which expects the data in a certain format. TMA can load a column of 8 core matrices (1024B) (8,1) at a time which means to load 8192B we load 8 times. Use `Tcgen05.mma` instruction and store the results in TMEM. Move results from TMEM to registers and finally to GMEM
 
 
 
-## References
+## References & Recommended resources
 
+- [Articles by colfax research](https://research.colfax-intl.com/blog/)
 - [CUDA Training Series by NVIDIA and OLCF](https://www.olcf.ornl.gov/cuda-training-series/)
 - [CUDA Training Series YouTube Playlist](https://www.youtube.com/playlist?app=desktop&list=PL6RdenZrxrw-zNX7uuGppWETdxt_JxdMj)
 - [CUDA Training Exercises](https://github.com/olcf/cuda-training-series/tree/master/exercises)
 
@@ -94,7 +94,42 @@ If you allocate your host memory with cudaMallocHost and initialize the data the
 
 ### Compiling CUDA code
 
-CUDA code can be compiled using the `nvcc` command. It does code-generation in two stages:
+CUDA code can be compiled using the `nvcc` command
+
+```bash
+nvcc -std=c++17 \
+  -arch=sm_100a \
+  -I/Users/basujindal/cutlass/include \
+  -I/Users/basujindal/cutlass/examples/cute/tutorial/blackwell \
+  -o mma \
+  examples/cute/tutorial/blackwell/mma.cu
+```
+
+Or use a Makefile
+
+```Makefile
+NVCC = nvcc
+FLAGS = -std=c++17 -arch=sm_100a
+SRC = examples/cute/tutorial/blackwell/mma.cu
+INCLUDES = -I/opt/cutlass/include -I/opt/cutlass/examples/cute/tutorial/blackwell
+
+mma: mma.o
+	$(NVCC) $(FLAGS) -o $@ $^
+
+mma.o: $(SRC)
+	$(NVCC) $(FLAGS) $(INCLUDES) -dc -o $@ $<
+
+clean:
+	rm -f *.o mma
+```
+
+Run the command
+
+```bash
+make && ./mma
+```
+
+It does code-generation in two stages:
 
 Stage | 	What nvcc produces | 	Option that drives it
 1. Front-end | 	PTX for a virtual architecture (“compute XX…”)	| arch= inside -gencode (or --gpu-architecture)
@@ -159,6 +194,82 @@ This will generate PTX code for compute capability 3.0, 5.2, and 7.0. Generate S
 
 `compute_XX` refers to a PTX version and sm_XX refers to a cubin version and the `arch=` clause must always be a PTX version, while the `code=` clause can be cubin or PTX or both
 
+# CUTLASS/CuTe Development
+
+clangd for IDE features like, Go to definition, Hover documentation, Auto-completion, Error diagnostics. Clangd is part of the LLVM/Clang project and understands C++ deeply. Unlike simple syntax highlighting, clangd actually compiles the code in the background to understand types, templates, and symbols.
+
+- Install `clangd` Extension in VS Code. If prompted, let it download the clangd binary
+- If you have Microsoft's "C/C++" extension installed, disable its IntelliSense to avoid conflicts by: Settings → search `C_Cpp.intelliSenseEngine` → set to `disabled`
+- Create a `.clangd` file in the project root. This file tells clangd how to compile your code.
+
+---
+
+### Example Configuration
+
+```yaml
+CompileFlags:
+  Add:
+    - "-xc++"
+    - "-std=c++17"
+    - "-I/path/to/cutlass/include"
+    - "-I/path/to/cutlass/tools/util/include"
+  Remove:
+    - "-forward-unknown-to-host-compiler"
+    - "--generate-code*"
+    - "-gencode*"
+```
+
+### Flag Explanations
+
+| Flag | Purpose |
+|------|---------|
+| `-xc++` | Treat `.cu` files as C++ (clangd doesn't understand CUDA natively) |
+| `-std=c++17` | Use C++17 standard (CUTLASS requires C++17) |
+| `-I/path/to/include` | Include paths - where to find headers |
+
+### Remove Flags
+
+These are nvcc-specific flags that clang doesn't understand:
+- `-forward-unknown-to-host-compiler`
+- `--generate-code*`
+- `-gencode*`
+---
+
+- Finally Restart clangd using  `Cmd+Shift+P` → `clangd: Restart language server`
+
+---
+
+## Optional: Thrust/CUB Support
+
+If you want Thrust headers to work (for `thrust::device_vector`, etc.), download them separately:
+
+```bash
+# Clone Thrust (header-only library)
+git clone https://github.com/NVIDIA/thrust.git ~/thrust
+
+# Clone CUB (Thrust dependency)
+git clone https://github.com/NVIDIA/cub.git ~/cub
+```
+
+Then add to `.clangd`:
+```yaml
+CompileFlags:
+  Add:
+    # ... existing flags ...
+    - "-I/Users/yourusername/thrust"
+    - "-I/Users/yourusername/cub"
+```
+
+---
+
+### Check clangd status
+- Cmd+Shift+P → "clangd: Check status"
+
+### View clangd logs
+- View → Output → Select "clangd" from dropdown
+
+---
+
 ## Profiling and Debugging
 
 **nsys**: CLI for Nsight Systems which supports system wide profiling.
@@ -168,7 +279,6 @@ This will generate PTX code for compute capability 3.0, 5.2, and 7.0. Generate S
 **nvprof**: CLI for the NVIDIA Visual Profiler which supports profiling and tracing of CUDA applications. It is deprecated in CUDA 11.0 and will be removed in a future release.
 
 ###
-
 - [CUDA Debugging Video](https://www.youtube.com/watch?v=nAsMhH1tnYw)
 - [CUDA Debugging by vLLM](https://blog.vllm.ai/2025/08/11/cuda-debugging.html)
 
@@ -203,4 +313,4 @@ That’s your faulting PC in hex. You can disassemble around it:
 
 ```bash
 (cuda-gdb) disassemble $errorpc-0x40, $errorpc+0x40
-```
+```