Qwen3-14B-AWQ
4-bit AWQ quantized version of Qwen/Qwen3-14B
Model Details
- Base Model: Qwen/Qwen3-14B
- Quantization Method: AWQ (Activation-aware Weight Quantization)
- Bits: 4-bit
- Group Size: 128
- Zero Point: True
- Model Size: ~9.5 GB (from ~28 GB)
- Context Length: 32,768
Quantization Configuration
{
"zero_point": true,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
Usage
With vLLM (Recommended for faster inference)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model your-username/Qwen3-14B-AWQ \
--quantization awq \
--tensor-parallel-size 2
With AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"your-username/Qwen3-14B-AWQ",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"your-username/Qwen3-14B-AWQ",
trust_remote_code=True,
)
prompt = "Write a hello world in Python:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"your-username/Qwen3-14B-AWQ",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"your-username/Qwen3-14B-AWQ",
trust_remote_code=True,
)
Hardware Requirements
- VRAM: ~10-12 GB (single GPU) or ~6 GB per GPU (2x GPUs)
- RAM: 2 GB for loading
- Storage: 9.5 GB
Performance
- Inference Speed: 15-25 tokens/second (dual RTX 5060)
- Memory Reduction: 66% smaller than FP16
- Quality: Minimal accuracy loss (<1%)
Chat Format
Qwen3 uses the following chat format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing well, thank you!<|im_end|>
Limitations
- Requires libraries that support AWQ format (vLLM, AutoAWQ, or compatible transformers)
- Calibration was performed on pileval dataset (128 samples)
License
This model inherits the license from the base Qwen3-14B model. Please refer to the original model card for license details.
Credits
- Original model: Qwen Team
- Quantization: AutoAWQ
- Hardware: Dual NVIDIA RTX 5060 Ti 16GB
Note: This model was quantized using AutoAWQ with activation-aware weight quantization (AWQ) to achieve 4-bit compression while maintaining model quality.