LuluV2 Native-bf16

LuluV2 is a local-first text-generation model from OpenMachine. This release is packaged as a custom PyTorch runtime with native bfloat16 weights. It is not a standard Transformers AutoModelForCausalLM repository, and it is not meant to be loaded through pipeline(...), AutoModelForCausalLM.from_pretrained(...), llama.cpp, Ollama, GGUF, GPTQ, AWQ, or any other quantized conversion path.

Clone the repo, keep the checkpoint as LULUV2-bf16.pt, and run it directly.

Key points

Architecture: native RPT/VWM-style LuluV2 architecture, not a normal decoder-only Transformers layout.
Runtime: custom PyTorch inference code included in this repository.
Precision: native bfloat16 checkpoint.
Quantization: no quantization step is required or recommended for the intended release path.
Mode: inference-only public package.
Tokenizer: local tokenizer files are included under tokenizer/.
Status: experimental research/runtime release; some inference paths are still Python-heavy and not fully optimized.

What is included

.
├── LULUV2-bf16.pt                  # native bf16 checkpoint, required
├── README.md                       # model card and run guide
├── app.py                          # local Gradio chat UI
├── run_inference.py                # command-line generation runner
├── luluv2_inference_runtime.py     # custom architecture + checkpoint loader
├── luluv2_live_inference.py        # streaming generation engine
├── luluv2_optimized_engine.py      # optimized local engine variant
├── config.json                     # HF metadata only; not an AutoModel config
├── generation_config.json          # default generation settings
├── requirements.txt
├── run_chat.sh                     # Linux/macOS launcher
├── run_chat.ps1                    # Windows PowerShell launcher
└── tokenizer/                      # local tokenizer files

What this is not

This repository is not:

a standard Transformers model package
a hosted inference-widget-first package
a training repository
a dataset release
a quantized model release
a GGUF / llama.cpp / Ollama release

Use the included runtime scripts.

Requirements

Recommended environment:

Python 3.10 or newer
PyTorch 2.1 or newer
CUDA GPU with bf16 support for best performance
Local disk space for the checkpoint

Install dependencies:

pip install -r requirements.txt

The repo is local-first. The runtime uses the included checkpoint and tokenizer folder. It does not download external model weights at startup.

Quick start: chat UI

From the cloned repository:

python app.py --inbrowser

Equivalent explicit command:

python app.py \
  --ckpt ./LULUV2-bf16.pt \
  --model-py ./luluv2_inference_runtime.py \
  --tokenizer-dir ./tokenizer \
  --device cuda \
  --dtype bf16 \
  --max-context 32768 \
  --inbrowser

The UI starts a local Gradio chat app. By default it binds to 127.0.0.1 on port 7862.

Launcher scripts

Linux/macOS:

bash ./run_chat.sh

Windows PowerShell:

.\run_chat.ps1

Quick start: command line

python run_inference.py \
  --ckpt ./LULUV2-bf16.pt \
  --tokenizer-dir ./tokenizer \
  --device cuda \
  --dtype bf16 \
  --prompt "Write a short introduction to LuluV2."

Generation controls:

python run_inference.py \
  --ckpt ./LULUV2-bf16.pt \
  --prompt "Explain why native bf16 inference matters." \
  --max-new-tokens 700 \
  --temperature 0.65 \
  --top-p 0.90 \
  --top-k 40

Python usage

import torch
from luluv2_live_inference import LULUV2LiveEngine, GenerationConfig

engine = LULUV2LiveEngine(
    ckpt_path="./LULUV2-bf16.pt",
    model_py="./luluv2_inference_runtime.py",
    tokenizer_dir="./tokenizer",
    device="cuda",
    dtype="bf16",
    local_files_only=True,
    no_config_download=True,
)

cfg = GenerationConfig(
    max_new_tokens=512,
    temperature=0.65,
    top_p=0.90,
    top_k=40,
)

history = []
system_prompt = "You are LuluV2, a helpful local AI assistant."

for partial_text in engine.generate_stream(
    "What are you?",
    history,
    system_prompt,
    cfg,
):
    print(partial_text, end="", flush=True)

print()
torch.cuda.empty_cache()

Native bf16: no quantization step

LuluV2 is intended to run from the native bf16 checkpoint:

LULUV2-bf16.pt

Start with:

--dtype bf16

Do not quantize the model before using this package. The intended baseline is direct native-bf16 execution.

Fallback modes are available for hardware compatibility:

--dtype fp16
--dtype fp32

These are compatibility modes, not separate quantized releases.

Hardware notes

CUDA / NVIDIA: recommended path. Use --device cuda --dtype bf16 when your GPU supports bf16.
Windows / PC: use the same CUDA command above. If memory is tight, reduce --max-context first.
macOS: CPU mode is the safer compatibility path at the moment. MPS may produce unstable logits in some environments, so use --device cpu --dtype fp32 if MPS causes issues.
CPU: supported for testing and compatibility, but generation will be much slower than GPU inference.

CPU test command:

python app.py --ckpt ./LULUV2-bf16.pt --device cpu --dtype fp32 --max-context 4096 --inbrowser

If CUDA runs out of memory, reduce context length:

python app.py --ckpt ./LULUV2-bf16.pt --device cuda --dtype bf16 --max-context 8192 --inbrowser

Context length

The runtime exposes --max-context and the architecture may support long context settings. Start with a practical context such as 4096 or 8192, then increase if your hardware has enough memory and the runtime remains stable.

Example:

python app.py --ckpt ./LULUV2-bf16.pt --max-context 4096 --inbrowser

Checkpoint format

The runtime expects a PyTorch checkpoint file named:

LULUV2-bf16.pt

The checkpoint should contain a model state dictionary under the model key. Optional fields may include runtime metadata and second-pass refinement weights.

If your checkpoint has a different filename, pass it explicitly:

python app.py --ckpt /path/to/your-checkpoint.pt --tokenizer-dir ./tokenizer --inbrowser

Troubleshooting

`FileNotFoundError: LULUV2-bf16.pt`

The checkpoint is missing or has a different name. Put LULUV2-bf16.pt in the repository root or pass the full path with --ckpt.

`Checkpoint missing model state dict`

The file passed to --ckpt is not the expected LuluV2 PyTorch checkpoint. Make sure you are using the native .pt checkpoint, not a config file, tokenizer file, or unrelated export.

CUDA out of memory

Lower the context window and/or generated token count:

python app.py --ckpt ./LULUV2-bf16.pt --max-context 4096 --inbrowser

Older GPU does not support bf16

Try fp16:

python app.py --ckpt ./LULUV2-bf16.pt --device cuda --dtype fp16 --inbrowser

Mac MPS issues

Use CPU mode for compatibility:

python app.py --ckpt ./LULUV2-bf16.pt --device cpu --dtype fp32 --inbrowser

Model status

This is an early public inference release. The model has been tested for chat behavior, including English and multilingual prompts, but evaluation is still ongoing. The public package is intentionally stripped to the files needed for local inference.

Fine-tuning code and additional technical notes are planned separately. A mobile-oriented native runtime is also being explored.

Feedback

Try it locally, inspect the runtime, break things, and report what is missing. Useful feedback includes:

hardware and OS
command used
context length
dtype
error logs
examples of good or bad generations

Safety and intended use

LuluV2 is a local text-generation model. It can produce incorrect, incomplete, or unsafe outputs. Do not treat generations as verified facts for medical, legal, financial, security, or other high-stakes decisions. Use human review where accuracy matters.

License and notices

This repository is marked as Apache-2.0. Before publishing the final weights, make sure the repository includes any notices, attribution, or license text required by the checkpoint, training sources, tokenizer, or dependencies.

Downloads last month: 5