Safetensors
English
qwen2_5_vl
VLM
GUI
agent

Introduction

The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".

GitHub: https://github.com/GUI-Libra/GUI-Libra Website: https://GUI-Libra.github.io

Usage

1) Start an OpenAI-compatible vLLM server

pip install -U vllm
vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123
  • Endpoint: http://localhost:8000/v1
  • The api_key here must match --api-key.

2) Minimal Python example (prompt + image β†’ request)

Install dependencies:

pip install -U openai

Create minimal_infer.py:

import base64
from openai import OpenAI

MODEL = "GUI-Libra/GUI-Libra-3B"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

def b64_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

# 1) Your screenshot path
img_b64 = b64_image("screen.png")

system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
    ## Explanation: Tap or click a specific UI element and provide its coordinates

action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
    ## Explanation: Select an item from a list or dropdown menu

action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
    ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None

action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
    ## Explanation: Press a specified key on the keyboard

action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None
    ## Explanation: Scroll a view or container in the specified direction
"""

# 2) Your prompt (instruction + desired output format)

task_desc = 'Go to Amazon.com and buy a math book'
prev_txt = ''
question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n'''
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
query = question_description.format(img_size_string, task_desc, prev_txt)

query = query + '\n' + '''The response should be structured in the following format:
<think>Your step-by-step thought process here...</think>
<answer>
{
  "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
  "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
  "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
  "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
}
</answer>'''

resp = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful GUI agent."},
        {"role": "user", "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
            {"type": "text", "text": prompt},
        ]},
    ],
    temperature=0.0,
    max_completion_tokens=1024,
)

print(resp.choices[0].message.content)

Run:

python minimal_infer.py

Notes

  • Replace screen.png with your own screenshot file.
  • If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
  • The example assumes your vLLM server is running locally on port 8000.
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for GUI-Libra/GUI-Libra-3B

Finetuned
(686)
this model

Datasets used to train GUI-Libra/GUI-Libra-3B