CLI Agent โ€” Llama 3 8B GRPO Fine-tune

A LoRA adapter fine-tuned on Meta-Llama-3-8B-Instruct using GRPO (Group Relative Policy Optimization) to generate correct Linux shell commands from natural language task descriptions.

Model Details

Model Description

  • Developed by: Jose Alvarez, Carson Chiem, Prisha Bhattacharyya, Vishal Tyagi
  • Model type: Causal Language Model (LoRA adapter)
  • Language(s) (NLP): English
  • License: Meta Llama 3 Community License
  • Finetuned from model: unsloth/llama-3-8b-Instruct

Model Sources

Uses

Direct Use

Given a natural language description of a CLI task, the model outputs the correct shell command with no explanation, no markdown, and no backticks.

Example:

  • Input: "Count the number of lines in /tmp/data/log.txt"
  • Output: wc -l /tmp/data/log.txt

Out-of-Scope Use

  • Not intended for general conversation
  • Not suitable for tasks outside Linux CLI command generation
  • Should not be used for destructive or malicious shell commands

Bias, Risks, and Limitations

  • Model may generate incorrect or harmful shell commands โ€” always review before executing
  • Trained on a limited set of ~60 task types, may not generalize to all CLI scenarios
  • Performance degrades on complex multi-step tasks

How to Get Started with the Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="jalva182/cli-agent-model",
    max_seq_length=512,
    load_in_4bit=True,
)

messages = [
    {"role": "system", "content": "You are a CLI expert. Given a task, output exactly the shell commands required. No explanation, no markdown, no backticks."},
    {"role": "user", "content": "Count the number of lines in /tmp/data/log.txt"},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

60 validated CLI tasks covering file operations, text processing (grep, awk, sed), sorting, archives, system info, permissions, and environment variables. Each task includes setup commands, expected output, and a reward function for GRPO training.

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • Method: GRPO (Group Relative Policy Optimization)
  • Learning rate: 3e-6 with linear scheduler
  • Warmup ratio: 0.1
  • Batch size: 2 (per device)
  • Gradient accumulation steps: 2
  • Total steps: 10000
  • LoRA rank: 32, alpha: 64
  • KL coefficient: 0.05
  • Number of generations: 4
  • Max sequence length: 512

Speeds, Sizes, Times

  • Training time: ~3h 13min
  • Checkpoint size: ~524MB (LoRA adapter only)
  • Final train loss: 0.0141
  • Final reward: 8.0/8.0 on easy tasks, ~6.0 average

Evaluation

Metrics

Reward function scoring 0-8 per task:

  • +5 for correct output match
  • +3 for command success with partial match
  • -2 for command failure or wrong output

Results

  • Best reward: 8.0
  • Average reward (final steps): ~6.0
  • Train loss: 0.0141

Environmental Impact

  • Hardware Type: H100 SXM 80GB
  • Hours used: ~3.5 hours
  • Cloud Provider: Vast.ai

Technical Specifications

Model Architecture

  • Base: Meta-Llama-3-8B-Instruct
  • Adapter: LoRA (rank=32, alpha=64, dropout=0.05)
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Software

  • unsloth 2026.3.3
  • trl 0.24.0
  • transformers 4.56.1
  • torch 2.6.0+cu124
  • PEFT 0.18.1

Model Card Authors

Jose Alvarez

Model Card Contact

https://github.com/Alvarez-Jose/unsloth-grpo-project

Framework versions

  • PEFT 0.18.1
Downloads last month
68
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jalva182/cli-agent-model

Adapter
(295)
this model