I ran their matrix multiplication code from GitHub (https://github.com/ScalingIn...

I ran their matrix multiplication code from GitHub (https://github.com/ScalingIntelligence/good-kernels/blob/mai...) and got a mean squared error of approximately 0.056 for two 4096x4096 matrices containing random values between 0 and 1.

I think this error is large enough that referring to it as FP32 is misleading.

Also, the performance gains do not translate to my RTX 3060M GPU (3.8 GFLOPS vs PyTorch's 5.3), presumably because it lacks the optimized hardware for half precision.

But on the plus side, the single file was very easy to adapt and the code is quite readable. I have seen much uglier kernels.