Baby's First CUDA

2024-11-13

The guide is here.

CUDA

Raw Notes

  • I wonder how much the word "kernel" being overloaded has confused me before. I expect a bit. Like these kernels I was a bit confused about before.
  • Threads and thread ids. Seems legit.
  • I'm going to just go implement. This is the best.

Implementing some simple algos in C

  • Vector add very easy
  • Matrix add also very easy, just getting back up to speed with C code and mallocing arrays.
  • Matrix multiply also easy. The number of threads is a function of the size of the output array.

Errors

  • "CUDA error: invalid configuration argument" - this likely means you used more threads than you should have.
  • If we out of bounds access in the CUDA kernel, we don't get an error, we just get all zeros in the resulting array. I guess undefined behavior is wacky -- that's not fun at all.

Triton

Raw Notes

  • In the OpenAI SAE example they use, they are writing Triton code. Let's go look at that.
  • Let's go read a little bit about Triton and then write a Triton program.
  • Fused kernels just means that you do multiple operations in the same kernel. This is more efficient for obvious reasons: namely, you have the memory loaded, you might as well do the things there.
  • I really love I don't need to write in C, but at the same time we're still working with raw memory - so I wonder much better things really are.
  • Note that you are responsible for doing the backwards pass manually. So we're a bit back to basics in that sense - but that seems reasonable.

Some Simple Algorithms

  • Vector addition. Very simple. Works with blocks, and masks.
  • Softmax. Ok a fair bit more complicated, but mostly fundamentally sensical. Overall, I feel the benefits of Triton aren't so apparently to me until I've really spent a fair bit of time writing CUDA programs by hand. This makes sense._