YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

I chose two small, recent and different MoE models that fits my vram for a quick assessment.

I wanted to use MoE models to check on MXFP4 and imatrix to check on the smallest quantization variants.

LFM2-8B-A1B that has 4 experts used out of 32.
OLMoE-1B-7B-0924-Instruct that has 8 experts used out of 64.

Conclusion:

While MXFP4 is highly efficient for LFM2-8B, it underperforms on OLMoE-1B-7B. LFM2-8B-A1B at Q8_0, Q5_0 and MXFP4 have lower PPL than BF16 likely due to the imatrix optimization and/or overtraining of the model.

LFM2-8B-A1B

Quant Type	PPL	Size (MiB)	BPW	Prompt (t/s)	Gen (t/s)
BF16	15.2248	15910.31	16.00	OOM	OOM
Q8_0	15.1931	8455.31	8.50	5072.10	162.41
Q6_K	15.5124	6529.44	6.57	4436.58	175.56
Q5_1	15.4030	5979.31	6.01	4625.45	209.11
Q5_K_M	16.0200	5643.04	5.68	4584.63	200.70
Q5_0	14.8000	5499.06	5.53	4874.52	216.30
Q5_K_S	15.6033	5490.31	5.52	4697.02	209.59
Q4_1	15.9842	5001.31	5.03	4770.76	232.50
Q4_K_M	15.8978	4808.79	4.84	4809.82	214.11
Q4_K_S	15.3757	4530.31	4.56	4877.01	221.24
MXFP4	14.8134	4528.31	4.55	4992.58	198.64
Q4_0	15.4652	4521.06	4.55	4993.89	232.26
IQ4_NL	15.7842	4512.31	4.54	5183.51	231.71
IQ4_XS	15.4901	4267.81	4.29	5169.28	226.73
Q3_K_L	16.7625	4123.39	4.15	4464.09	164.34
Q3_K_M	16.2523	3810.14	3.83	4497.96	166.04
IQ3_M	16.5738	3495.76	3.52	4802.77	191.22
IQ3_S	20.6474	3473.19	3.49	4798.82	190.23
Q3_K_S	16.9538	3473.19	3.49	4345.90	149.62
IQ3_XS	19.9761	3282.78	3.30	4812.42	195.83
IQ3_XXS	15.7687	3088.69	3.11	4913.44	204.55
Q2_K	16.7071	2934.70	2.95	3790.56	193.37
Q2_K_S	17.5891	2711.37	2.73	3626.85	217.85
IQ2_M	18.6788	2619.83	2.64	4259.97	209.24
IQ2_S	18.8633	2380.64	2.39	4175.02	211.03
IQ2_XS	19.9971	2363.04	2.38	4142.97	212.15
IQ2_XXS	23.3637	2123.11	2.14	5026.99	214.72
IQ1_M	29.3541	1824.12	1.83	2631.43	215.11
IQ1_S	49.0474	1644.73	1.65	4613.59	236.96

OLMoE-1B-7B-0924-Instruct

Quant Type	PPL	Size (MiB)	BPW	Prompt (t/s)	Gen (t/s)
f16	10.1857	13201.51	16.01	OOM	OOM
Q8_0	10.1944	7017.29	8.51	5259.40	187.13
Q6_K	10.2089	5419.70	6.57	4714.04	197.17
Q5_1	10.2445	4962.79	6.02	4903.92	236.51
Q5_K_M	10.2588	4696.90	5.69	4922.98	224.95
Q5_K_S	10.2546	4556.65	5.52	4863.71	233.73
Q5_0	10.2994	4572.65	5.54	5109.75	240.62
Q4_1	10.3775	4150.51	5.03	4836.63	254.41
Q4_K_M	10.3730	4016.62	4.87	4924.75	232.58
Q4_K_S	10.3988	3778.37	4.58	5108.39	244.35
Q4_0	10.4737	3760.37	4.56	5225.58	250.00
MXFP4	10.8994	3753.29	4.55	5212.85	234.47
IQ4_NL	10.3706	3744.37	4.54	5487.97	256.29
IQ4_XS	10.3900	3541.30	4.29	5496.66	250.08
Q3_K_L	10.5341	3442.32	4.17	4730.45	195.50
Q3_K_M	10.6027	3187.32	3.86	4765.81	197.51
IQ3_M	10.8151	2932.32	3.56	5042.41	213.32
IQ3_S	10.9400	2881.32	3.49	5051.42	209.55
Q3_K_S	10.9314	2881.32	3.49	4616.22	173.28
IQ3_XS	11.0259	2731.32	3.31	5191.34	217.23
IQ3_XXS	11.4085	2563.27	3.11	5207.91	226.50
Q2_K	12.3217	2442.34	2.96	4187.02	214.87
Q2_K_S	14.0056	2281.34	2.77	3978.48	247.06
IQ2_M	12.1105	2218.77	2.69	4672.60	232.21
IQ2_S	13.1473	2030.77	2.46	4588.92	231.39
IQ2_XS	13.7881	1985.79	2.41	4542.42	236.08
IQ2_XXS	15.6348	1795.79	2.18	5272.91	236.27
IQ1_M	21.0811	1560.79	1.89	2805.94	238.75
IQ1_S	27.0239	1419.79	1.72	4901.74	246.70

Setup:

CPU: Intel 12100F

RAM: 64gb of DDR4 dual channel

GPU: RTX 3060 12gb of vram (cpu clock fixed at 1882 MHz via a curve, vram at 8210 MHz)

OS: Windows 11, Nvidia drivers 591.74

Build: precompiled b8116 (492bc3197) for CUDA 13.1

Details:

LFM2-8B-A1B have been quantized from unsloth/LFM2-8B-A1B-GGUF using LFM2-8B-A1B-BF16.gguf and the provided imatrix_unsloth.gguf_file

OLMoE-1B-7B-0924-Instruct have been quantized from bartowski/OLMoE-1B-7B-0924-Instruct-GGUF using OLMoE-1B-7B-0924-Instruct-f16.gguf and I created the imatrix from wiki.train.raw

PPL is calculated with wiki.test.raw with a context of 512 tokens while t/s are calculated for 2048 tokens generated with a context of 8192 tokens.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support