I chose two small, recent and different MoE models that fits my vram for a quick assessment.
I wanted to use MoE models to check on MXFP4 and imatrix to check on the smallest quantization variants.
LFM2-8B-A1B that has 4 experts used out of 32.
OLMoE-1B-7B-0924-Instruct that has 8 experts used out of 64.
Conclusion:
While MXFP4 is highly efficient for LFM2-8B, it underperforms on OLMoE-1B-7B. LFM2-8B-A1B at Q8_0, Q5_0 and MXFP4 have lower PPL than BF16 likely due to the imatrix optimization and/or overtraining of the model.
LFM2-8B-A1B
| Quant Type | PPL | Size (MiB) | BPW | Prompt (t/s) | Gen (t/s) |
|---|---|---|---|---|---|
| BF16 | 15.2248 | 15910.31 | 16.00 | OOM | OOM |
| Q8_0 | 15.1931 | 8455.31 | 8.50 | 5072.10 | 162.41 |
| Q6_K | 15.5124 | 6529.44 | 6.57 | 4436.58 | 175.56 |
| Q5_1 | 15.4030 | 5979.31 | 6.01 | 4625.45 | 209.11 |
| Q5_K_M | 16.0200 | 5643.04 | 5.68 | 4584.63 | 200.70 |
| Q5_0 | 14.8000 | 5499.06 | 5.53 | 4874.52 | 216.30 |
| Q5_K_S | 15.6033 | 5490.31 | 5.52 | 4697.02 | 209.59 |
| Q4_1 | 15.9842 | 5001.31 | 5.03 | 4770.76 | 232.50 |
| Q4_K_M | 15.8978 | 4808.79 | 4.84 | 4809.82 | 214.11 |
| Q4_K_S | 15.3757 | 4530.31 | 4.56 | 4877.01 | 221.24 |
| MXFP4 | 14.8134 | 4528.31 | 4.55 | 4992.58 | 198.64 |
| Q4_0 | 15.4652 | 4521.06 | 4.55 | 4993.89 | 232.26 |
| IQ4_NL | 15.7842 | 4512.31 | 4.54 | 5183.51 | 231.71 |
| IQ4_XS | 15.4901 | 4267.81 | 4.29 | 5169.28 | 226.73 |
| Q3_K_L | 16.7625 | 4123.39 | 4.15 | 4464.09 | 164.34 |
| Q3_K_M | 16.2523 | 3810.14 | 3.83 | 4497.96 | 166.04 |
| IQ3_M | 16.5738 | 3495.76 | 3.52 | 4802.77 | 191.22 |
| IQ3_S | 20.6474 | 3473.19 | 3.49 | 4798.82 | 190.23 |
| Q3_K_S | 16.9538 | 3473.19 | 3.49 | 4345.90 | 149.62 |
| IQ3_XS | 19.9761 | 3282.78 | 3.30 | 4812.42 | 195.83 |
| IQ3_XXS | 15.7687 | 3088.69 | 3.11 | 4913.44 | 204.55 |
| Q2_K | 16.7071 | 2934.70 | 2.95 | 3790.56 | 193.37 |
| Q2_K_S | 17.5891 | 2711.37 | 2.73 | 3626.85 | 217.85 |
| IQ2_M | 18.6788 | 2619.83 | 2.64 | 4259.97 | 209.24 |
| IQ2_S | 18.8633 | 2380.64 | 2.39 | 4175.02 | 211.03 |
| IQ2_XS | 19.9971 | 2363.04 | 2.38 | 4142.97 | 212.15 |
| IQ2_XXS | 23.3637 | 2123.11 | 2.14 | 5026.99 | 214.72 |
| IQ1_M | 29.3541 | 1824.12 | 1.83 | 2631.43 | 215.11 |
| IQ1_S | 49.0474 | 1644.73 | 1.65 | 4613.59 | 236.96 |
OLMoE-1B-7B-0924-Instruct
| Quant Type | PPL | Size (MiB) | BPW | Prompt (t/s) | Gen (t/s) |
|---|---|---|---|---|---|
| f16 | 10.1857 | 13201.51 | 16.01 | OOM | OOM |
| Q8_0 | 10.1944 | 7017.29 | 8.51 | 5259.40 | 187.13 |
| Q6_K | 10.2089 | 5419.70 | 6.57 | 4714.04 | 197.17 |
| Q5_1 | 10.2445 | 4962.79 | 6.02 | 4903.92 | 236.51 |
| Q5_K_M | 10.2588 | 4696.90 | 5.69 | 4922.98 | 224.95 |
| Q5_K_S | 10.2546 | 4556.65 | 5.52 | 4863.71 | 233.73 |
| Q5_0 | 10.2994 | 4572.65 | 5.54 | 5109.75 | 240.62 |
| Q4_1 | 10.3775 | 4150.51 | 5.03 | 4836.63 | 254.41 |
| Q4_K_M | 10.3730 | 4016.62 | 4.87 | 4924.75 | 232.58 |
| Q4_K_S | 10.3988 | 3778.37 | 4.58 | 5108.39 | 244.35 |
| Q4_0 | 10.4737 | 3760.37 | 4.56 | 5225.58 | 250.00 |
| MXFP4 | 10.8994 | 3753.29 | 4.55 | 5212.85 | 234.47 |
| IQ4_NL | 10.3706 | 3744.37 | 4.54 | 5487.97 | 256.29 |
| IQ4_XS | 10.3900 | 3541.30 | 4.29 | 5496.66 | 250.08 |
| Q3_K_L | 10.5341 | 3442.32 | 4.17 | 4730.45 | 195.50 |
| Q3_K_M | 10.6027 | 3187.32 | 3.86 | 4765.81 | 197.51 |
| IQ3_M | 10.8151 | 2932.32 | 3.56 | 5042.41 | 213.32 |
| IQ3_S | 10.9400 | 2881.32 | 3.49 | 5051.42 | 209.55 |
| Q3_K_S | 10.9314 | 2881.32 | 3.49 | 4616.22 | 173.28 |
| IQ3_XS | 11.0259 | 2731.32 | 3.31 | 5191.34 | 217.23 |
| IQ3_XXS | 11.4085 | 2563.27 | 3.11 | 5207.91 | 226.50 |
| Q2_K | 12.3217 | 2442.34 | 2.96 | 4187.02 | 214.87 |
| Q2_K_S | 14.0056 | 2281.34 | 2.77 | 3978.48 | 247.06 |
| IQ2_M | 12.1105 | 2218.77 | 2.69 | 4672.60 | 232.21 |
| IQ2_S | 13.1473 | 2030.77 | 2.46 | 4588.92 | 231.39 |
| IQ2_XS | 13.7881 | 1985.79 | 2.41 | 4542.42 | 236.08 |
| IQ2_XXS | 15.6348 | 1795.79 | 2.18 | 5272.91 | 236.27 |
| IQ1_M | 21.0811 | 1560.79 | 1.89 | 2805.94 | 238.75 |
| IQ1_S | 27.0239 | 1419.79 | 1.72 | 4901.74 | 246.70 |
Setup:
CPU: Intel 12100F
RAM: 64gb of DDR4 dual channel
GPU: RTX 3060 12gb of vram (cpu clock fixed at 1882 MHz via a curve, vram at 8210 MHz)
OS: Windows 11, Nvidia drivers 591.74
Build: precompiled b8116 (492bc3197) for CUDA 13.1
Details:
LFM2-8B-A1B have been quantized from unsloth/LFM2-8B-A1B-GGUF using LFM2-8B-A1B-BF16.gguf and the provided imatrix_unsloth.gguf_file
OLMoE-1B-7B-0924-Instruct have been quantized from bartowski/OLMoE-1B-7B-0924-Instruct-GGUF using OLMoE-1B-7B-0924-Instruct-f16.gguf and I created the imatrix from wiki.train.raw
PPL is calculated with wiki.test.raw with a context of 512 tokens while t/s are calculated for 2048 tokens generated with a context of 8192 tokens.
