Introduction
The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".
GitHub: https://github.com/GUI-Libra/GUI-Libra Website: https://GUI-Libra.github.io
Usage
1) Start an OpenAI-compatible vLLM server
pip install -U vllm
vllm serve GUI-Libra/GUI-Libra-8B --port 8000 --api-key token-abc123
- Endpoint:
http://localhost:8000/v1 - The
api_keyhere must match--api-key.
2) Minimal Python example (prompt + image β request)
Install dependencies:
pip install -U openai
Create minimal_infer.py:
import base64
from openai import OpenAI
MODEL = "GUI-Libra/GUI-Libra-8B"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")
def b64_image(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
# 1) Your screenshot path
img_b64 = b64_image("screen.png")
system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
## Explanation: Tap or click a specific UI element and provide its coordinates
action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
## Explanation: Select an item from a list or dropdown menu
action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
## Explanation: Enter text into a specific input field or at the current focus if coordinate is None
action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
## Explanation: Press a specified key on the keyboard
action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None
## Explanation: Scroll a view or container in the specified direction
"""
# 2) Your prompt (instruction + desired output format)
task_desc = 'Go to Amazon.com and buy a math book'
prev_txt = ''
question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n'''
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
query = question_description.format(img_size_string, task_desc, prev_txt)
query = query + '\n' + '''The response should be structured in the following format:
<thinking>Your step-by-step thought process here...</thinking>
<answer>
{
"action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
"action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
"value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
"point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
}
</answer>'''
resp = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful GUI agent."},
{"role": "user", "content": [
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
{"type": "text", "text": prompt},
]},
],
temperature=0.0,
max_completion_tokens=1024,
)
print(resp.choices[0].message.content)
Run:
python minimal_infer.py
Notes
- Replace
screen.pngwith your own screenshot file. - If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
- The example assumes your vLLM server is running locally on port
8000.
- Downloads last month
- 23
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for GUI-Libra/GUI-Libra-8B
Base model
Qwen/Qwen3-VL-8B-Instruct