SVDQuant and Nunchaku
Introduction
This is a super brief summary of SVDQUant. Please refer to the original paper if interested.
\[Q_X = \operatorname{round}(X / s_X)\]$s_X = \max(\vert X \vert ) / q_{max}$ and $q_{max} = \text{possible max value in repr}$
\[Q(X) = \text{dequantization of }X = s_X\,Q_X\]$XW$ can be approximated by
\[XW = Q(X)Q(W) = S_X\,S_W\,Q_X\,Q_W\]SVDQuant introduces two-path quant...
Thoughts
다짐 I’m proud Nvidian. I work hard, ethically.
A plan for getting involved in an open source project
I have little experience with open source, but currently I’m neither associated with nor active in any open source project. I want to be, and I have a good plan for it. From my experience, I believe it’s a good plan, so other people who want to get involved but struggle might find it helpful. Keep in mind this is my plan and may not suit everyone.
Precaution
While many guides paint a rosy picture, keep in mind two key points:
- Use libraries
- cutlass
- cuBlas
- Implementing your own
- naive way
- tiled matmul
- you can also decide to use shared memory or not
- I can dive into codebase first with few prior knowledge. chatgpt isn’t pretty good at generating code yet but good at explaining concepts like “Vector” in mlir.
- Enable semantic search, not keyword matching query. For example, I can query like “what is that doing …” this kind of query will yield bad answers with google search, mo...
Cuda Matmul with wmma
Matmul
There are several ways to do matmul in CUDA.
It’s not the end, yet another way to implement matmul is to use wmma
. This accro...
Learning with chatGPT
Recently I’m learning stuffs with chatGPT. I found that it is exteremly useful to use chatGPT to learn something new.
Some Pros:
Implementing FlashAttnetion V1 naively
Warning
This is not a comprehensive tutorial. It’s more a note for myself to write what descisions I made while implementing naive FlashAttention V1. So sadly this also describes my limitation of skills.
I already posted an introductory post about CUDA a year ago. I’ve been not using CUDA actively after writing this post. It would be great if I continue to develop Parallel Computing sinc...