In recent years, auto tuning has been used to accelerate operators in AI compilers. But there are several limitations in auto tuning methods. For example, the results of auto tuning are usually worse than vendor libraries on real-world industrial online models; Also, the tuning time are super long, it could take hours to days to tune the performance of an end-to-end model inference on the GPU. In this talk we will talk about how we address the limitations brought by auto tuning. By integrating CUTLASS into TVM, we can now generate kernels with performance comparable to state-of-the-art manual written vendor closed-source libraries, (cublas and cudnn) without extensive tuning on Nvidia GPUs. At the same time, it can do more aggressive graph-level optimizations compared with existing AI compiler systems. The first part of the talk will be introducing CUTLASS and the second part of the talk will be illustrating how we integrate it into TVM.
This session is broken into two parts, a 20 minute talk and a 10 minute community breakout session.