ASHQ1 — Autonomous Selective Hybrid Quantization

ASHQ1 is a post-training quantization method for GGUF models that utilizes an imatrix-driven priority queue to maximize theoretical signal-to-noise ratio per megabyte. Instead of uniform bit-depth allocation or heuristic layer-blocking, ASHQ1 treats tied tensor groups as monolithic entities and greedily upgrades them based on strict mathematical utility.

The Breakthrough

By replacing empirical quality heuristics with theoretical MSE reduction, ASHQ1 achieves better perplexity than uniform higher-bit quantization while being significantly smaller.

Quant	Method	Size	PPL (ctx=1024)	vs Q6_K
Q6_K (baseline)	Uniform	7,008 MiB	7.5876 ± 0.0495	—
ASHQ1 v6	Priority queue + MSE	5,713 MiB	7.5411 ± 0.0487	-0.047

ASHQ1 at 5700 MiB beats uniform Q6_K by 0.047 PPL at 19% smaller size.

Note: The model file is being uploaded now. My internet connection is very slow (~100 KB/s), so a full 5.6 GB upload takes around 20 hours. If the GGUF file is not yet available in the repo, it is still being uploaded. Please be patient.

I have put a lot of effort into developing this quantization method. ASHQ1 may not be released as open-source on GitHub due to a shadowban on my account and the difficulty of maintaining the project. This HF repo is the primary distribution channel.

How It Works

The classifier operates in a single-pass max-heap to drain a strictly defined size budget.

1. Initialization (Strict Floors)

All upgradeable tensors start at a Q4_K floor. Specific architectural classes are hardcoded to prevent degradation:

norms / ssm_params → F16
token_embd → Q5_K
MTP head (blk.32) → Q5_K
Everything else → Q4_K

Note: The Q4_K floor is critical. Earlier iterations starting at IQ4_XS suffered PPL stagnation because non-linear 4-bit blocks cause disproportionately high activation noise in deep layers. The strict Q4_K floor eliminates this entirely.

2. Importance Weighting

Tensor importance is derived from the imatrix, scaled by architectural depth: timp[t] = imp[t] × depth_factor(layer)

First layer (0): 2.0x
Last 5 layers: 1.5x
Middle layers: 1.0x

3. Tied Group Aggregation

Numerically identical tensors (e.g., ffn_gate = ffn_up) are detected and treated as single monolithic entities in the queue. Their importance is summed (sum(timp)), making the utility metric scale-invariant regardless of group size.

4. The Priority Queue

All possible single-step upgrades are pushed into a max-heap. The utility metric is defined as:

utility_per_mb = sum(timp[group]) × ΔMSE / cost_delta

Where the theoretical MSE reduction is:

ΔMSE = 2^(-2 × bpw_cur) - 2^(-2 × bpw_next)

The queue drains by popping the highest utility-per-MB upgrade, applying it, and pushing the next possible upgrade for that group until the target size budget is exhausted. Zero-cost upgrades are assigned inf priority to ensure they always apply.

MSE_BPW Calibration

The effective bits-per-weight used for MSE calculation. Note that IQ4_XS is empirically lowered to 4.00 from its theoretical 4.25 to reflect its actual noise profile in deep transformers.

Tier	MSE_BPW
F16	16.0
Q8_0	8.50
Q6_K	6.5625
Q5_K	5.50
Q4_K	4.50
IQ4_NL	4.40
IQ4_XS	4.00 (empirically corrected)
Q3_K	3.4375

This Quant

Property	Value
File	Qwythos-9B-Claude-Mythos-5-1M-MTP-ASHQ1-Q5_K_M.gguf
Size	5,713 MiB (5.6 GB)
Target	5,700 MiB
Accuracy	+13 MiB (GGUF overhead)
Base type	Q5_K_M
PPL	7.5411 ± 0.04865
MTP head tier	Q5_K
Tier distribution	Q5_K=68, Q6_K=97, Q4_K=100, F16=177

Speed (GTX 1070 + MTP Speculation)

Mode	Tokens/sec
MTP speculation	~34 t/s

Note: At 5700 MiB, the budget is too tight to allocate Q8_0 to attention tensors. The MSE queue correctly sacrifices inference speed for maximum PPL at this extreme compression level. At larger budgets (6800 MiB+), the queue organically upgrades attention to Q8_0 to improve decoding speed without sacrificing PPL.

Usage

MTP Speculative Decoding

llama-cli \
  -m Qwythos-9B-Claude-Mythos-5-1M-MTP-ASHQ1-Q5_K_M.gguf \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -p "Your prompt" \
  -ngl 99 --flash-attn on \
  -c 4096

Recommended Sampling

temperature 0.6, top_k 20, top_p 0.95, min_p 0. For looping, repeat_penalty 1.05.

Reproducibility

Full llama-quantize command generated by the ASHQ1 classifier:

/home/maxyag27/llm-tools/llama.cpp/build/bin/llama-quantize \
  --imatrix /mnt/everything/qwen/source/Qwable-9B-Claude-Fable-5.imatrix.gguf \
  --output-tensor-type Q5_K \
  --token-embedding-type Q5_K \
  --tensor-type "(blk|BLK)\.(32)\.nextn=Q5_K" \
  --tensor-type "(blk|BLK)\.(31)\.attn_output=Q6_K" \
  --tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_beta=Q6_K" \
  --tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_alpha=Q6_K" \
  --tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.attn_qkv=Q6_K" \
  --tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.attn_gate=Q6_K" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q6_K" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_q=Q6_K" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_v=Q6_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_k=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.post_attention_norm=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_v=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_k_norm=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_q_norm=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_norm=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_q=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:31|32))\.ffn_down=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30))\.ffn_down=Q4_K" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))\.post_attention_norm=F16" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_norm=F16" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_a=F16" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))\.attn_norm=F16" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_dt=F16" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_conv1d=F16" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k_norm=F16" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_q_norm=F16" \
  --tensor-type "(blk|BLK)\.((?:21|22|23|24|25|26|27|28|29|30|31|32))\.ffn_up=Q5_K" \
  --tensor-type "(blk|BLK)\.(27|32)\.attn_output=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:21|22|23|24|25|26|27|28|29|30|31|32))\.ffn_gate=Q5_K" \
  --tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.attn_gate=Q5_K" \
  --tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.ssm_alpha=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:28|29|30))\.ssm_out=Q5_K" \
  --tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.attn_qkv=Q5_K" \
  --tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.ssm_beta=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20))\.ffn_up=Q4_K" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20))\.ffn_gate=Q4_K" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26))\.ssm_out=Q4_K" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23)\.attn_output=Q4_K" \
  --tensor-type ".*output_norm.*=F16" \
  /mnt/everything/qwen/source/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
  Qwythos-9B-Claude-Mythos-5-1M-MTP-ASHQ1-Q5_K_M.gguf

All Results (Qwen3.5-9B fine-tunes)

Target	Model	MTP	Actual	PPL
5500	Qwable	--	5,503 MiB	7.4334
5700	Qwythos	yes	5,713 MiB	7.5411

References

Original Qwythos MTP: https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M
ASHQ1 source: https://huggingface.co/wepiqx/ASHQ1

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for wepiqx/ASHQ1

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

empero-ai/Qwythos-9B-Claude-Mythos-5-1M

Finetuned

(5)

this model