Qwen3-14B-AWQ

4-bit AWQ quantized version of Qwen/Qwen3-14B

Model Details

Base Model: Qwen/Qwen3-14B
Quantization Method: AWQ (Activation-aware Weight Quantization)
Bits: 4-bit
Group Size: 128
Zero Point: True
Model Size: ~9.5 GB (from ~28 GB)
Context Length: 32,768

Quantization Configuration

{
  "zero_point": true,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM"
}

Usage

With vLLM (Recommended for faster inference)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model your-username/Qwen3-14B-AWQ \
    --quantization awq \
    --tensor-parallel-size 2

With AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "your-username/Qwen3-14B-AWQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "your-username/Qwen3-14B-AWQ",
    trust_remote_code=True,
)

# Generate
prompt = "Write a hello world in Python:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "your-username/Qwen3-14B-AWQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "your-username/Qwen3-14B-AWQ",
    trust_remote_code=True,
)

Hardware Requirements

VRAM: ~10-12 GB (single GPU) or ~6 GB per GPU (2x GPUs)
RAM: 2 GB for loading
Storage: 9.5 GB

Performance

Inference Speed: 15-25 tokens/second (dual RTX 5060)
Memory Reduction: 66% smaller than FP16
Quality: Minimal accuracy loss (<1%)

Chat Format

Qwen3 uses the following chat format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing well, thank you!<|im_end|>

Limitations

Requires libraries that support AWQ format (vLLM, AutoAWQ, or compatible transformers)
Calibration was performed on pileval dataset (128 samples)

License

This model inherits the license from the base Qwen3-14B model. Please refer to the original model card for license details.

Credits

Original model: Qwen Team
Quantization: AutoAWQ
Hardware: Dual NVIDIA RTX 5060 Ti 16GB

Note: This model was quantized using AutoAWQ with activation-aware weight quantization (AWQ) to achieve 4-bit compression while maintaining model quality.

Downloads last month: 14

Safetensors

Model size

15B params

Tensor type

I32

BF16

Model tree for froogai/Qwen3-14B-AWQ

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Quantized

(144)

this model