Qwen3-14B-AWQ

4-bit AWQ quantized version of Qwen/Qwen3-14B

Model Details

  • Base Model: Qwen/Qwen3-14B
  • Quantization Method: AWQ (Activation-aware Weight Quantization)
  • Bits: 4-bit
  • Group Size: 128
  • Zero Point: True
  • Model Size: ~9.5 GB (from ~28 GB)
  • Context Length: 32,768

Quantization Configuration

{
  "zero_point": true,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM"
}

Usage

With vLLM (Recommended for faster inference)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model your-username/Qwen3-14B-AWQ \
    --quantization awq \
    --tensor-parallel-size 2

With AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "your-username/Qwen3-14B-AWQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "your-username/Qwen3-14B-AWQ",
    trust_remote_code=True,
)

# Generate
prompt = "Write a hello world in Python:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "your-username/Qwen3-14B-AWQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "your-username/Qwen3-14B-AWQ",
    trust_remote_code=True,
)

Hardware Requirements

  • VRAM: ~10-12 GB (single GPU) or ~6 GB per GPU (2x GPUs)
  • RAM: 2 GB for loading
  • Storage: 9.5 GB

Performance

  • Inference Speed: 15-25 tokens/second (dual RTX 5060)
  • Memory Reduction: 66% smaller than FP16
  • Quality: Minimal accuracy loss (<1%)

Chat Format

Qwen3 uses the following chat format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing well, thank you!<|im_end|>

Limitations

  • Requires libraries that support AWQ format (vLLM, AutoAWQ, or compatible transformers)
  • Calibration was performed on pileval dataset (128 samples)

License

This model inherits the license from the base Qwen3-14B model. Please refer to the original model card for license details.

Credits

  • Original model: Qwen Team
  • Quantization: AutoAWQ
  • Hardware: Dual NVIDIA RTX 5060 Ti 16GB

Note: This model was quantized using AutoAWQ with activation-aware weight quantization (AWQ) to achieve 4-bit compression while maintaining model quality.

Downloads last month
14
Safetensors
Model size
15B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for froogai/Qwen3-14B-AWQ

Finetuned
Qwen/Qwen3-14B
Quantized
(144)
this model