YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Visione

A local-first AI creative production suite for consumer GPUs. Read the full documentation here


What is Visione?

Visione is a self-contained desktop application for AI-driven creative production. It runs entirely on local hardware: no cloud services, no external APIs, no subscriptions. Every model runs on-device; inference never leaves your machine.

The pipeline covers the full creative arc: text-to-image and video generation, retouching, stylization, enhancement, and sound design, all within a single application.

Stack: Python 3.12 + FastAPI + SSE · React 18 + TypeScript + Zustand · Tauri 2 desktop shell · ComfyUI headless for video inference · PyTorch 2.7 + CUDA


Components

Tab What it does
Imagine Image and video generation from text or reference image
Retouch Image editing — upscaling, denoising, face restore, multi-reference compositing, LUT, adjustments, background removal
Retexture Video style transfer — LoRA-based preset styles and reference-image stylization via depth conditioning
Enhance Image and video enhancement — upscaling, denoising, frame interpolation
Storyboard 12-stage AI-assisted filmmaking pipeline: multi-agent concept development, character library, shot generation, and export
Sound Studio Music generation, voiceover (preset, cloned, and voice-designed), and video-to-audio foley
Characters Persistent character library with 5-shot reference generation for cross-shot consistency
Gallery Unified asset browser across all components

Models

Model Purpose
Z-Image Turbo FP8 Image generation
Z-Image Qwen 3 4B Text encoder
Z-Image VAE VAE
Z-Image LoRAs (38) Style presets
Flux2 Klein 4B FP8 Image gen / editing
Flux2 Klein 9B BF16 Image gen/ High-quality editing
Flux2 VAE VAE
ControlNet Union 2.1 Structural conditioning (Retexture)
Patina LoRAs (21) Stylization presets
SPAN 4x Upscaler Image upscaling
SCUNet Denoiser Image denoising
CodeFormer Face enhancement
LTX-2.3 22B FP8 Video generation
LTX-2 Gemma 3 12B FP4 Video text encoder
LTX-2.3 22B Distilled LoRA Fast video sampling
LTX-2.3 Spatial Upscaler 2× video upscale
LTX-2.3 Audio VAE Audio generation
VEnhancer FP16 Video enhancement
SeedVR2 3B FP8 Video upscaling
RIFE v4.26 Frame interpolation
ACE-Step SFT + Base Music generation
ACE-Step LM 1.7B Music language model
ACE-Step VAE + TextEnc Music pipeline
Qwen3-TTS 1.7B (3 variants) Text-to-speech
HunyuanVideo-Foley XL Video-to-audio
Wan 2.1 T2V 1.3B StyleMaster backbone
StyleMaster checkpoints Style injection weights
CLIP ViT-H-14 Style extraction
IS-Net (rembg) Background removal (CPU)
LatentSync 1.6 Lip sync (quality)
MuseTalk 1.5 Lip sync (fast)
InsightFace buffalo_l Face detection/swap
Inswapper_128.onnx Face swap model

Architecture

Visione runs as a local client-server application. The React frontend communicates with a FastAPI backend over localhost; real-time progress streams back via SSE. Heavy inference runs in-process (diffusers, PyTorch) or via a ComfyUI headless subprocess for video pipelines. Models load and unload sequentially to fit within a 16GB VRAM budget, one active model at a time.

The desktop shell (Tauri 2) wraps the frontend as a native window and manages backend process lifecycle. All assets, models, and outputs live on local storage; nothing is transmitted externally.

Components share models where possible. Image generation models are reused across Imagine, Retouch, Retexture, and Storyboard; video models feed through from Imagine into Retexture and Sound Studio. The Video Editor and Gallery operate CPU-side, assembling outputs produced by the GPU components.


License

MIT


license: mit tags: - art - agent

Downloads last month
70
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support