Instructions to use inference-optimization/Llama-3.2-0.5B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inference-optimization/Llama-3.2-0.5B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="inference-optimization/Llama-3.2-0.5B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("inference-optimization/Llama-3.2-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("inference-optimization/Llama-3.2-0.5B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use inference-optimization/Llama-3.2-0.5B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inference-optimization/Llama-3.2-0.5B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/Llama-3.2-0.5B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/inference-optimization/Llama-3.2-0.5B-Instruct

SGLang

How to use inference-optimization/Llama-3.2-0.5B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inference-optimization/Llama-3.2-0.5B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/Llama-3.2-0.5B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inference-optimization/Llama-3.2-0.5B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/Llama-3.2-0.5B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use inference-optimization/Llama-3.2-0.5B-Instruct with Docker Model Runner:
```
docker model run hf.co/inference-optimization/Llama-3.2-0.5B-Instruct
```

Llama-3.2-0.5B-Instruct

This is a tiny version of meta-llama/Llama-3.2-1B-Instruct created for testing and development.

Model Details

Base Model: meta-llama/Llama-3.2-1B-Instruct
Architecture: llama
Total Parameters: 0.51B
Activated Parameters: 0.51B (non-MoE)

Configuration Changes

The following parameters were reduced from the original model:

Parameter	Original	Tiny
num_hidden_layers	16	4
hidden_size	2048	2048
intermediate_size	8192	8192
num_attention_heads	32	32
num_key_value_heads	8	8

Checkpoint Structure

This model uses a single model.safetensors file containing all weights. The checkpoint structure is identical to the original model, with the standard Llama architecture tensors:

model.embed_tokens.weight
model.layers.*.self_attn.{q,k,v,o}_proj.weight
model.layers.*.mlp.{gate,up,down}_proj.weight
model.layers.*.{input,post_attention}_layernorm.weight
model.norm.weight

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("inference-optimization/Llama-3.2-0.5B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/Llama-3.2-0.5B-Instruct")

input_ids = tokenizer("According to all known laws", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))

Validation

Success: 1.0247299671173096 <= 10.0

==================================================
Generating sample text:
According to all known laws of aviation, there is no way a bee should be able to fly
==================================================

Creation Process

This model was created using the llm-compressor create-tiny-model claude skill:

Inspected the original model configuration to identify key parameters
Created a tiny version by reducing num_hidden_layers from 16 to 4
Fine-tuned the model on a toy dataset (famous copypastas) to validate learning capability
Achieved target perplexity of ~1.02 on the validation text
Validated checkpoint structure matches the original model format
Confirmed successful loading and inference

Notes

This model was fine-tuned on a small corpus of internet copypastas to ensure it can learn effectively
The model maintains the same Llama 3.2 architecture (including RoPE parameters) as the base model, just with fewer layers
Due to the reduced layer count, this model has approximately 25% of the original model's parameters
This is intended for development and testing purposes, not production use

Downloads last month: 928

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for inference-optimization/Llama-3.2-0.5B-Instruct

Base model

meta-llama/Llama-3.2-1B-Instruct

Finetuned

(1754)

this model