OpenTransformer's picture
perf: optimized AVX2 kernel + COM6-inspired matmul dispatch (0.2 -> 3.43 t/s)
165bcc5 verified