- LuluV2 Native-bf16
LuluV2 Native-bf16
LuluV2 is a local-first text-generation model from OpenMachine. This release is packaged as a custom PyTorch runtime with native bfloat16 weights. It is not a standard Transformers AutoModelForCausalLM repository, and it is not meant to be loaded through pipeline(...), AutoModelForCausalLM.from_pretrained(...), llama.cpp, Ollama, GGUF, GPTQ, AWQ, or any other quantized conversion path.
Clone the repo, keep the checkpoint as LULUV2-bf16.pt, and run it directly.
Key points
- Architecture: native RPT/VWM-style LuluV2 architecture, not a normal decoder-only Transformers layout.
- Runtime: custom PyTorch inference code included in this repository.
- Precision: native
bfloat16checkpoint. - Quantization: no quantization step is required or recommended for the intended release path.
- Mode: inference-only public package.
- Tokenizer: local tokenizer files are included under
tokenizer/. - Status: experimental research/runtime release; some inference paths are still Python-heavy and not fully optimized.
What is included
.
βββ LULUV2-bf16.pt # native bf16 checkpoint, required
βββ README.md # model card and run guide
βββ app.py # local Gradio chat UI
βββ run_inference.py # command-line generation runner
βββ luluv2_inference_runtime.py # custom architecture + checkpoint loader
βββ luluv2_live_inference.py # streaming generation engine
βββ luluv2_optimized_engine.py # optimized local engine variant
βββ config.json # HF metadata only; not an AutoModel config
βββ generation_config.json # default generation settings
βββ requirements.txt
βββ run_chat.sh # Linux/macOS launcher
βββ run_chat.ps1 # Windows PowerShell launcher
βββ tokenizer/ # local tokenizer files
What this is not
This repository is not:
- a standard Transformers model package
- a hosted inference-widget-first package
- a training repository
- a dataset release
- a quantized model release
- a GGUF / llama.cpp / Ollama release
Use the included runtime scripts.
Requirements
Recommended environment:
- Python 3.10 or newer
- PyTorch 2.1 or newer
- CUDA GPU with bf16 support for best performance
- Local disk space for the checkpoint
Install dependencies:
pip install -r requirements.txt
The repo is local-first. The runtime uses the included checkpoint and tokenizer folder. It does not download external model weights at startup.
Quick start: chat UI
From the cloned repository:
python app.py --inbrowser
Equivalent explicit command:
python app.py \
--ckpt ./LULUV2-bf16.pt \
--model-py ./luluv2_inference_runtime.py \
--tokenizer-dir ./tokenizer \
--device cuda \
--dtype bf16 \
--max-context 32768 \
--inbrowser
The UI starts a local Gradio chat app. By default it binds to 127.0.0.1 on port 7862.
Launcher scripts
Linux/macOS:
bash ./run_chat.sh
Windows PowerShell:
.\run_chat.ps1
Quick start: command line
python run_inference.py \
--ckpt ./LULUV2-bf16.pt \
--tokenizer-dir ./tokenizer \
--device cuda \
--dtype bf16 \
--prompt "Write a short introduction to LuluV2."
Generation controls:
python run_inference.py \
--ckpt ./LULUV2-bf16.pt \
--prompt "Explain why native bf16 inference matters." \
--max-new-tokens 700 \
--temperature 0.65 \
--top-p 0.90 \
--top-k 40
Python usage
import torch
from luluv2_live_inference import LULUV2LiveEngine, GenerationConfig
engine = LULUV2LiveEngine(
ckpt_path="./LULUV2-bf16.pt",
model_py="./luluv2_inference_runtime.py",
tokenizer_dir="./tokenizer",
device="cuda",
dtype="bf16",
local_files_only=True,
no_config_download=True,
)
cfg = GenerationConfig(
max_new_tokens=512,
temperature=0.65,
top_p=0.90,
top_k=40,
)
history = []
system_prompt = "You are LuluV2, a helpful local AI assistant."
for partial_text in engine.generate_stream(
"What are you?",
history,
system_prompt,
cfg,
):
print(partial_text, end="", flush=True)
print()
torch.cuda.empty_cache()
Native bf16: no quantization step
LuluV2 is intended to run from the native bf16 checkpoint:
LULUV2-bf16.pt
Start with:
--dtype bf16
Do not quantize the model before using this package. The intended baseline is direct native-bf16 execution.
Fallback modes are available for hardware compatibility:
--dtype fp16
--dtype fp32
These are compatibility modes, not separate quantized releases.
Hardware notes
- CUDA / NVIDIA: recommended path. Use
--device cuda --dtype bf16when your GPU supports bf16. - Windows / PC: use the same CUDA command above. If memory is tight, reduce
--max-contextfirst. - macOS: CPU mode is the safer compatibility path at the moment. MPS may produce unstable logits in some environments, so use
--device cpu --dtype fp32if MPS causes issues. - CPU: supported for testing and compatibility, but generation will be much slower than GPU inference.
CPU test command:
python app.py --ckpt ./LULUV2-bf16.pt --device cpu --dtype fp32 --max-context 4096 --inbrowser
If CUDA runs out of memory, reduce context length:
python app.py --ckpt ./LULUV2-bf16.pt --device cuda --dtype bf16 --max-context 8192 --inbrowser
Context length
The runtime exposes --max-context and the architecture may support long context settings. Start with a practical context such as 4096 or 8192, then increase if your hardware has enough memory and the runtime remains stable.
Example:
python app.py --ckpt ./LULUV2-bf16.pt --max-context 4096 --inbrowser
Checkpoint format
The runtime expects a PyTorch checkpoint file named:
LULUV2-bf16.pt
The checkpoint should contain a model state dictionary under the model key. Optional fields may include runtime metadata and second-pass refinement weights.
If your checkpoint has a different filename, pass it explicitly:
python app.py --ckpt /path/to/your-checkpoint.pt --tokenizer-dir ./tokenizer --inbrowser
Troubleshooting
FileNotFoundError: LULUV2-bf16.pt
The checkpoint is missing or has a different name. Put LULUV2-bf16.pt in the repository root or pass the full path with --ckpt.
Checkpoint missing model state dict
The file passed to --ckpt is not the expected LuluV2 PyTorch checkpoint. Make sure you are using the native .pt checkpoint, not a config file, tokenizer file, or unrelated export.
CUDA out of memory
Lower the context window and/or generated token count:
python app.py --ckpt ./LULUV2-bf16.pt --max-context 4096 --inbrowser
Older GPU does not support bf16
Try fp16:
python app.py --ckpt ./LULUV2-bf16.pt --device cuda --dtype fp16 --inbrowser
Mac MPS issues
Use CPU mode for compatibility:
python app.py --ckpt ./LULUV2-bf16.pt --device cpu --dtype fp32 --inbrowser
Model status
This is an early public inference release. The model has been tested for chat behavior, including English and multilingual prompts, but evaluation is still ongoing. The public package is intentionally stripped to the files needed for local inference.
Fine-tuning code and additional technical notes are planned separately. A mobile-oriented native runtime is also being explored.
Feedback
Try it locally, inspect the runtime, break things, and report what is missing. Useful feedback includes:
- hardware and OS
- command used
- context length
- dtype
- error logs
- examples of good or bad generations
Safety and intended use
LuluV2 is a local text-generation model. It can produce incorrect, incomplete, or unsafe outputs. Do not treat generations as verified facts for medical, legal, financial, security, or other high-stakes decisions. Use human review where accuracy matters.
License and notices
This repository is marked as Apache-2.0. Before publishing the final weights, make sure the repository includes any notices, attribution, or license text required by the checkpoint, training sources, tokenizer, or dependencies.
- Downloads last month
- 5