OpenTransformer's picture
perf: optimized AVX2 kernel + COM6-inspired matmul dispatch (0.2 -> 3.43 t/s)
8f4b822 verified