qwen_4b_sql

qwen_4b_sql is a Qwen3-4B-Base model finetuned for text-to-SQL generation with full SFT on a cleaned split of PipableAI/pip-txt-to-sql-spider-bird-dataset.

This repository tracks the stronger 4B checkpoint from our H20 single-GPU training runs. In our internal comparison, this checkpoint outperformed the corresponding Qwen3-1.7B-Base baseline on Spider execution accuracy.

Base Model

  • Base model: Qwen/Qwen3-4B-Base
  • Finetuning framework: LLaMA-Factory
  • Training mode: Full SFT
  • Task: schema + question -> SQL only

Training Data

Training Setup

  • Hardware: single NVIDIA H20 96GB
  • Precision: bf16
  • Context length: 2048
  • Per-device train batch size: 1
  • Gradient accumulation steps: 8
  • Effective batch size: 8
  • Learning rate: 5e-6
  • Scheduler: cosine
  • Warmup steps: 300
  • Epochs: 4.0
  • Template: qwen3_nothink
  • Best-checkpoint selection: load_best_model_at_end = true

Spider Benchmark

The following numbers are from Spider dev using the official evaluation tooling:

  • Official match evaluation from test-suite-sql-eval
  • Official Spider Test Suite execution evaluation

Main Results

Metric Score
Spider official exact match 35.0%
Spider Test Suite execution accuracy 67.6%

Difficulty Breakdown

Difficulty Exact Match Test Suite Exec
Easy 64.9% 87.5%
Medium 37.4% 72.9%
Hard 16.1% 50.0%
Extra 3.6% 42.2%

Notes

  • This model is stronger under execution-based Spider evaluation than our best Qwen3-1.7B-Base run.
  • In our experiments, exact-match metrics were often stricter than execution-based metrics because semantically valid SQL rewrites do not always match the Spider gold form exactly.
  • A later 4B rerun with altered training settings underperformed this checkpoint on Spider and is not the checkpoint published here.

Intended Use

This model is intended for:

  • text-to-SQL research baselines
  • schema-conditioned SQL generation experiments
  • single-turn SQL generation from natural language plus schema text

It is not validated for:

  • production-grade database access control
  • unrestricted execution over arbitrary enterprise schemas
  • multi-turn agent workflows without extra prompting / tooling

Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "bsq1989/qwen_4b_sql"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

prompt = """Generate SQL from the given schema and question. Output SQL only.

Schema:
CREATE TABLE twitter (TweetID INTEGER, UserID INTEGER, LocationID INTEGER, Lang TEXT, ...);
CREATE TABLE location (LocationID INTEGER, Country TEXT, City TEXT, ...);

Question:
How many tweets are in English?
"""

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

  • Performance drops on more open-ended and heterogeneous SQL benchmarks than Spider.
  • The model can still produce invalid column references on out-of-distribution schemas.
  • Benchmark numbers here reflect our current internal setup and should be reproduced with the same evaluation pipeline for strict comparison.
Downloads last month
228
Safetensors
Model size
4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bsq1989/qwen_4b_sql

Finetuned
(259)
this model