I think this error is large enough that referring to it as FP32 is misleading.
Also, the performance gains do not translate to my RTX 3060M GPU (3.8 GFLOPS vs PyTorch's 5.3), presumably because it lacks the optimized hardware for half precision.
But on the plus side, the single file was very easy to adapt and the code is quite readable. I have seen much uglier kernels.
I think this error is large enough that referring to it as FP32 is misleading.
Also, the performance gains do not translate to my RTX 3060M GPU (3.8 GFLOPS vs PyTorch's 5.3), presumably because it lacks the optimized hardware for half precision.
But on the plus side, the single file was very easy to adapt and the code is quite readable. I have seen much uglier kernels.