CLI Agent โ Llama 3 8B GRPO Fine-tune
A LoRA adapter fine-tuned on Meta-Llama-3-8B-Instruct using GRPO (Group Relative Policy Optimization) to generate correct Linux shell commands from natural language task descriptions.
Model Details
Model Description
- Developed by: Jose Alvarez, Carson Chiem, Prisha Bhattacharyya, Vishal Tyagi
- Model type: Causal Language Model (LoRA adapter)
- Language(s) (NLP): English
- License: Meta Llama 3 Community License
- Finetuned from model: unsloth/llama-3-8b-Instruct
Model Sources
Uses
Direct Use
Given a natural language description of a CLI task, the model outputs the correct shell command with no explanation, no markdown, and no backticks.
Example:
- Input: "Count the number of lines in /tmp/data/log.txt"
- Output:
wc -l /tmp/data/log.txt
Out-of-Scope Use
- Not intended for general conversation
- Not suitable for tasks outside Linux CLI command generation
- Should not be used for destructive or malicious shell commands
Bias, Risks, and Limitations
- Model may generate incorrect or harmful shell commands โ always review before executing
- Trained on a limited set of ~60 task types, may not generalize to all CLI scenarios
- Performance degrades on complex multi-step tasks
How to Get Started with the Model
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="jalva182/cli-agent-model",
max_seq_length=512,
load_in_4bit=True,
)
messages = [
{"role": "system", "content": "You are a CLI expert. Given a task, output exactly the shell commands required. No explanation, no markdown, no backticks."},
{"role": "user", "content": "Count the number of lines in /tmp/data/log.txt"},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
60 validated CLI tasks covering file operations, text processing (grep, awk, sed), sorting, archives, system info, permissions, and environment variables. Each task includes setup commands, expected output, and a reward function for GRPO training.
Training Hyperparameters
- Training regime: bf16 mixed precision
- Method: GRPO (Group Relative Policy Optimization)
- Learning rate: 3e-6 with linear scheduler
- Warmup ratio: 0.1
- Batch size: 2 (per device)
- Gradient accumulation steps: 2
- Total steps: 10000
- LoRA rank: 32, alpha: 64
- KL coefficient: 0.05
- Number of generations: 4
- Max sequence length: 512
Speeds, Sizes, Times
- Training time: ~3h 13min
- Checkpoint size: ~524MB (LoRA adapter only)
- Final train loss: 0.0141
- Final reward: 8.0/8.0 on easy tasks, ~6.0 average
Evaluation
Metrics
Reward function scoring 0-8 per task:
- +5 for correct output match
- +3 for command success with partial match
- -2 for command failure or wrong output
Results
- Best reward: 8.0
- Average reward (final steps): ~6.0
- Train loss: 0.0141
Environmental Impact
- Hardware Type: H100 SXM 80GB
- Hours used: ~3.5 hours
- Cloud Provider: Vast.ai
Technical Specifications
Model Architecture
- Base: Meta-Llama-3-8B-Instruct
- Adapter: LoRA (rank=32, alpha=64, dropout=0.05)
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Software
- unsloth 2026.3.3
- trl 0.24.0
- transformers 4.56.1
- torch 2.6.0+cu124
- PEFT 0.18.1
Model Card Authors
Jose Alvarez
Model Card Contact
https://github.com/Alvarez-Jose/unsloth-grpo-project
Framework versions
- PEFT 0.18.1
- Downloads last month
- 68
Model tree for jalva182/cli-agent-model
Base model
unsloth/llama-3-8b-Instruct