Instructions to use nur-dev/farabi-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nur-dev/farabi-0.6B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nur-dev/farabi-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nur-dev/farabi-0.6B")
model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nur-dev/farabi-0.6B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nur-dev/farabi-0.6B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nur-dev/farabi-0.6B

SGLang

How to use nur-dev/farabi-0.6B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nur-dev/farabi-0.6B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nur-dev/farabi-0.6B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nur-dev/farabi-0.6B with Docker Model Runner:
```
docker model run hf.co/nur-dev/farabi-0.6B
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Farabi-0.6B

Farabi-0.6B is a compact, multilingual instruction-tuned language model with a primary focus on Kazakh, alongside strong Russian and English support. It is designed for everyday assistant use, reasoning, retrieval-grounded answering, and tool / function calling in agentic applications.

The model speaks fluent Kazakh and is intended to make high-quality conversational AI more accessible for the Kazakh language, where well-aligned models remain scarce.

Created by Nurgali Kadyrbek.

It is built on nur-dev/farabi-0.6B-base — a Kazakh-adapted base model that was itself continually pre-trained from Qwen3-0.6B — and then instruction-tuned to produce this assistant.

Highlights

🇰🇿 Kazakh-first — the majority of the instruction data is native Kazakh, with Russian and English mixed in for cross-lingual robustness.
🧠 Reasoning — supports optional step-by-step "thinking" mode that can be toggled on or off at request time.
🔧 Tool calling — emits Hermes-style <tool_call> blocks and is compatible with the OpenAI-style function-calling interface and agent frameworks.
📚 Grounded answering — trained to answer from provided documents and context, including longer inputs.
🪶 Small & deployable — 0.6B parameters, runs comfortably on a single modest GPU.

Languages

Language	Approx. share of instruction data
Kazakh (kk)	~56%
English (en)	~33%
Russian (ru)	~10%

Data coverage by domain

The model was instruction-tuned on a broad, internally curated mixture. Described in general terms (no technical specifics), the approximate domain composition is:

Domain	Approx. share
General instruction following & multi-turn conversation	~45%
Reasoning & step-by-step problem solving	~27%
Retrieval-grounded answering, long context & document Q&A	~13%
Tool use, function calling & agentic interaction	~7%
Knowledge, culture, news & encyclopedic content	~4%
Mathematics, language tasks (grammar / translation), safety & appropriate refusal, device & environment control, and assistant identity	~4%

Shares are approximate and reflect general domain proportions rather than exact figures.

Data provenance & acknowledgments

The training datasets were created internally by the author, including original synthesis as well as additionally processed and enriched material.

Approximately 5.4% of all data used for instruction tuning was derived (with additional processing and enrichment) from resources of two organizations, whose contributions to the Kazakh language are gratefully acknowledged:

Институт языкознания имени А. Байтурсынова — Institute of Linguistics named after A. Baitursynov
ННПЦ «Тіл-Қазына» имени Шайсултана Шаяхметова — Sh. Shayakhmetov National Research and Practical Center "Til-Qazyna"

Recommended sampling parameters

A good starting point for general use:

{
  "temperature": 0.15,
  "top_p": 0.95,
  "max_tokens": 1024,
  "repetition_penalty": 1.05,
  "stream": true,
  "chat_template_kwargs": {
    "enable_thinking": true
  },
  "continue_final_message": true
}

Set "enable_thinking": false to get direct answers without an explicit reasoning step. Raise temperature for more creative / open-ended generation.

Serving with vLLM

Start an OpenAI-compatible server with tool-calling enabled:

vllm serve nur-dev/farabi-0.6B \
  --served-model-name farabi-0.6b \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Query it with the standard OpenAI client (and the recommended sampling params):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="farabi-0.6b",
    messages=[
        {"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},
        {"role": "user", "content": "Алматы туралы қысқаша айтып бер."},
    ],
    temperature=0.15,
    top_p=0.95,
    max_tokens=1024,
    extra_body={
        "repetition_penalty": 1.05,
        "chat_template_kwargs": {"enable_thinking": True},
    },
    stream=True,
)
for chunk in resp:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Tool calling works through the standard tools=[...] argument — the model returns function calls that the server parses into structured tool_calls.

Serving with PyTorch / Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "nur-dev/farabi-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},
    {"role": "user", "content": "Қазақстанның астанасы қай қала?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=True,        # set False for direct answers
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.15,
    top_p=0.95,
    repetition_penalty=1.05,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Evaluation

⚠️ Interim results. The numbers below were measured on an early checkpoint (~17% through instruction tuning). They are expected to improve as training continues, but already show meaningful capability.

Tool / function calling — BFCL v4

Berkeley Function-Calling Leaderboard (v4), 1,040 cases, evaluated with the HuggingFace backend.

Category	Accuracy	n	What it measures
Simple	80.5%	322/400	one call, one tool available
Multiple	71.5%	143/200	pick the right tool from several
Parallel	65.5%	131/200	several calls in one turn
Irrelevance	5.4%	13/240	abstain when no tool fits
Overall	58.6%	609/1040
Function-calling avg	74.5%	596/800	excludes irrelevance

Takeaways:

Strong calling ability for a 0.6B model. When a call is warranted it is correct ~74.5% of the time — right tool, valid arguments, clean JSON — including 65.5% on the hard parallel / multi-call category.
The weakness is abstention, not calling. On queries that match no available tool, the model still tends to emit a call (irrelevance 5.4% → it over-triggers). This is the main driver of the lower overall score and the clearest area for improvement.

Multilingual comprehension — 4-way multiple choice

Multiple-choice comprehension across the model's three languages (random baseline = 25%), evaluated with the chat template and enable_thinking=False.

Language	Accuracy
English	53.7% ±1.7
Russian	50.0% ±1.7
Kazakh	41.8% ±1.6

Takeaways:

Well above the 25% random baseline in all three languages — real comprehension in English, Russian, and Kazakh.
Resource ordering (en > ru > kk) is as expected; Kazakh at 41.8% is clearly non-trivial.
Evaluating with the chat template and enable_thinking=False adds ~5–6 points per language versus a raw prompt — another reason to serve the model with its chat template (see serving instructions above).

Intended use & limitations

Farabi-0.6B is intended as a helpful general-purpose and agentic assistant, with a focus on Kazakh-language use cases. As a small model, it can make factual mistakes, and outputs should be verified for high-stakes or factual-critical applications. It should be used responsibly and in accordance with applicable laws and the base model's license.

Citation

If you use this model, please credit the author:

Nurgali Kadyrbek — Farabi-0.6B. https://www.linkedin.com/in/nurgali-kadyrbek-504260231/

Downloads last month: 94

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for nur-dev/farabi-0.6B

Base model

nur-dev/farabi-0.6B-base

Finetuned

(1)

this model