Baby's First CUDA

2024-11-13

The guide is here.

CUDA

I wonder how much the word "kernel" being overloaded has confused me before. I expect a bit. Like these kernels I was a bit confused about before.
Threads and thread ids. Seems legit.
I'm going to just go implement. This is the best.

Vector add very easy
Matrix add also very easy, just getting back up to speed with C code and mallocing arrays.
Matrix multiply also easy. The number of threads is a function of the size of the output array.

"CUDA error: invalid configuration argument" - this likely means you used more threads than you should have.
If we out of bounds access in the CUDA kernel, we don't get an error, we just get all zeros in the resulting array. I guess undefined behavior is wacky -- that's not fun at all.

In the OpenAI SAE example they use, they are writing Triton code. Let's go look at that.
Let's go read a little bit about Triton and then write a Triton program.
Fused kernels just means that you do multiple operations in the same kernel. This is more efficient for obvious reasons: namely, you have the memory loaded, you might as well do the things there.
I really love I don't need to write in C, but at the same time we're still working with raw memory - so I wonder much better things really are.
Note that you are responsible for doing the backwards pass manually. So we're a bit back to basics in that sense - but that seems reasonable.

Vector addition. Very simple. Works with blocks, and masks.
Softmax. Ok a fair bit more complicated, but mostly fundamentally sensical. Overall, I feel the benefits of Triton aren't so apparently to me until I've really spent a fair bit of time writing CUDA programs by hand. This makes sense._