ASHQ1 โ€” Autonomous Selective Hybrid Quantization

ASHQ1 is a post-training quantization method for GGUF models that utilizes an imatrix-driven priority queue to maximize theoretical signal-to-noise ratio per megabyte. Instead of uniform bit-depth allocation or heuristic layer-blocking, ASHQ1 treats tied tensor groups as monolithic entities and greedily upgrades them based on strict mathematical utility.

The Breakthrough

By replacing empirical quality heuristics with theoretical MSE reduction, ASHQ1 achieves better perplexity than uniform higher-bit quantization while being significantly smaller.

Quant Method Size PPL (ctx=1024) vs Q6_K
Q6_K (baseline) Uniform 7,008 MiB 7.5876 ยฑ 0.0495 โ€”
ASHQ1 v6 Priority queue + MSE 5,713 MiB 7.5411 ยฑ 0.0487 -0.047

ASHQ1 at 5700 MiB beats uniform Q6_K by 0.047 PPL at 19% smaller size.

Note: The model file is being uploaded now. My internet connection is very slow (~100 KB/s), so a full 5.6 GB upload takes around 20 hours. If the GGUF file is not yet available in the repo, it is still being uploaded. Please be patient.

I have put a lot of effort into developing this quantization method. ASHQ1 may not be released as open-source on GitHub due to a shadowban on my account and the difficulty of maintaining the project. This HF repo is the primary distribution channel.

How It Works

The classifier operates in a single-pass max-heap to drain a strictly defined size budget.

1. Initialization (Strict Floors)

All upgradeable tensors start at a Q4_K floor. Specific architectural classes are hardcoded to prevent degradation:

  • norms / ssm_params โ†’ F16
  • token_embd โ†’ Q5_K
  • MTP head (blk.32) โ†’ Q5_K
  • Everything else โ†’ Q4_K

Note: The Q4_K floor is critical. Earlier iterations starting at IQ4_XS suffered PPL stagnation because non-linear 4-bit blocks cause disproportionately high activation noise in deep layers. The strict Q4_K floor eliminates this entirely.

2. Importance Weighting

Tensor importance is derived from the imatrix, scaled by architectural depth: timp[t] = imp[t] ร— depth_factor(layer)

  • First layer (0): 2.0x
  • Last 5 layers: 1.5x
  • Middle layers: 1.0x

3. Tied Group Aggregation

Numerically identical tensors (e.g., ffn_gate = ffn_up) are detected and treated as single monolithic entities in the queue. Their importance is summed (sum(timp)), making the utility metric scale-invariant regardless of group size.

4. The Priority Queue

All possible single-step upgrades are pushed into a max-heap. The utility metric is defined as:

utility_per_mb = sum(timp[group]) ร— ฮ”MSE / cost_delta

Where the theoretical MSE reduction is:

ฮ”MSE = 2^(-2 ร— bpw_cur) - 2^(-2 ร— bpw_next)

The queue drains by popping the highest utility-per-MB upgrade, applying it, and pushing the next possible upgrade for that group until the target size budget is exhausted. Zero-cost upgrades are assigned inf priority to ensure they always apply.

MSE_BPW Calibration

The effective bits-per-weight used for MSE calculation. Note that IQ4_XS is empirically lowered to 4.00 from its theoretical 4.25 to reflect its actual noise profile in deep transformers.

Tier MSE_BPW
F16 16.0
Q8_0 8.50
Q6_K 6.5625
Q5_K 5.50
Q4_K 4.50
IQ4_NL 4.40
IQ4_XS 4.00 (empirically corrected)
Q3_K 3.4375

This Quant

Property Value
File Qwythos-9B-Claude-Mythos-5-1M-MTP-ASHQ1-Q5_K_M.gguf
Size 5,713 MiB (5.6 GB)
Target 5,700 MiB
Accuracy +13 MiB (GGUF overhead)
Base type Q5_K_M
PPL 7.5411 ยฑ 0.04865
MTP head tier Q5_K
Tier distribution Q5_K=68, Q6_K=97, Q4_K=100, F16=177

Speed (GTX 1070 + MTP Speculation)

Mode Tokens/sec
MTP speculation ~34 t/s

Note: At 5700 MiB, the budget is too tight to allocate Q8_0 to attention tensors. The MSE queue correctly sacrifices inference speed for maximum PPL at this extreme compression level. At larger budgets (6800 MiB+), the queue organically upgrades attention to Q8_0 to improve decoding speed without sacrificing PPL.

Usage

MTP Speculative Decoding

llama-cli \
  -m Qwythos-9B-Claude-Mythos-5-1M-MTP-ASHQ1-Q5_K_M.gguf \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -p "Your prompt" \
  -ngl 99 --flash-attn on \
  -c 4096

Recommended Sampling

temperature 0.6, top_k 20, top_p 0.95, min_p 0. For looping, repeat_penalty 1.05.

Reproducibility

Full llama-quantize command generated by the ASHQ1 classifier:

/home/maxyag27/llm-tools/llama.cpp/build/bin/llama-quantize \
  --imatrix /mnt/everything/qwen/source/Qwable-9B-Claude-Fable-5.imatrix.gguf \
  --output-tensor-type Q5_K \
  --token-embedding-type Q5_K \
  --tensor-type "(blk|BLK)\.(32)\.nextn=Q5_K" \
  --tensor-type "(blk|BLK)\.(31)\.attn_output=Q6_K" \
  --tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_beta=Q6_K" \
  --tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_alpha=Q6_K" \
  --tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.attn_qkv=Q6_K" \
  --tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.attn_gate=Q6_K" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q6_K" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_q=Q6_K" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_v=Q6_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_k=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.post_attention_norm=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_v=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_k_norm=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_q_norm=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_norm=Q5_K" \
  --tensor-type "(blk|BLK)\.(32)\.attn_q=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:31|32))\.ffn_down=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30))\.ffn_down=Q4_K" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))\.post_attention_norm=F16" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_norm=F16" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_a=F16" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))\.attn_norm=F16" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_dt=F16" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_conv1d=F16" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k_norm=F16" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_q_norm=F16" \
  --tensor-type "(blk|BLK)\.((?:21|22|23|24|25|26|27|28|29|30|31|32))\.ffn_up=Q5_K" \
  --tensor-type "(blk|BLK)\.(27|32)\.attn_output=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:21|22|23|24|25|26|27|28|29|30|31|32))\.ffn_gate=Q5_K" \
  --tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.attn_gate=Q5_K" \
  --tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.ssm_alpha=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:28|29|30))\.ssm_out=Q5_K" \
  --tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.attn_qkv=Q5_K" \
  --tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.ssm_beta=Q5_K" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20))\.ffn_up=Q4_K" \
  --tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20))\.ffn_gate=Q4_K" \
  --tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26))\.ssm_out=Q4_K" \
  --tensor-type "(blk|BLK)\.(3|7|11|15|19|23)\.attn_output=Q4_K" \
  --tensor-type ".*output_norm.*=F16" \
  /mnt/everything/qwen/source/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
  Qwythos-9B-Claude-Mythos-5-1M-MTP-ASHQ1-Q5_K_M.gguf

All Results (Qwen3.5-9B fine-tunes)

Target Model MTP Actual PPL
5500 Qwable -- 5,503 MiB 7.4334
5700 Qwythos yes 5,713 MiB 7.5411

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for wepiqx/ASHQ1

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(5)
this model