YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ArtiGen V1.0 β€” Adaptive Reasoning Token-Informed Generative Engine

What is ArtiGen?

A novel, lightweight, mobile-friendly text-to-image generation architecture designed specifically for anime/illustration art. It runs under 3GB RAM on consumer devices and trains on Colab Free Tier.

Why a New Architecture?

  • Existing models (SDXL, FLUX) are too heavy for mobile.
  • Quantization destroys aesthetic quality.
  • Old models (SD 1.5) lack prompt adherence and visual quality.
  • Attention-based transformers have O(NΒ²) memory that explodes on high-res latent grids.

Core Innovations

  1. CARTEL Backbone: Hybrid SSM (Mamba-style) + RWKV + Liquid Time-Constant gates. O(N) complexity, no heavy attention.
  2. PHI-SCAN: Physics-informed multi-directional scanning (Hilbert, zigzag-diagonal, row/column-major) preserving 2D spatial continuity. Zero extra parameters.
  3. ASDL (Art-Style Disentangled Latent Space): Modular heads that natively learn style, content, concept, mood, and composition as separate vectors in latent space. Users can tweak vectors to invent new art styles.
  4. Flow Matching + Spectral Smoothness: Replaces unstable diffusion training with rectified flow matching. Spectral Laplacian penalty reduces artifacts at 1024px native resolution.
  5. Progressive Modular Curriculum: 5-stage freeze/thaw training that forces each module to specialize before end-to-end tuning. Prevents loss explosion.

Architecture

Text Prompt ──► Text Encoder ──► Ο†_text
                                        β”‚
Timestep t ────► t_embed ──────►        β”‚
                                        β–Ό
Latent z_t ────► Patchify ─────► PHI-SCAN ──► [CARTEL Block Γ— N] ──► v_t(z_t)
                    β–²                              β”‚
                    └────── Long Skip β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                    ASDL Heads (style, content, concept, mood, composition)

Memory Footprint

Component Parameters FP16 VRAM
CARTEL Backbone ~80M ~160 MB
ASDL Heads ~20M ~40 MB
Pretrained VAE ~50M ~100 MB
Total ~150M ~300 MB

With KV cache, activations, and overhead: < 1.5 GB at inference. Training on Colab Free Tier: batch_size=2, embed_dim=256, 16 layers fits in 15GB T4 VRAM.

Training Stages

Stage Module Trained Losses Purpose
1 Style Head L_flow + L_style Learn artistic styles
2 Content Head L_flow + L_content Learn semantic objects/scenes
3 Concept Head L_flow + L_concept Learn abstract relationships
4 Mood + Composition L_flow + L_mood Learn emotion & layout
5 All (unfrozen) L_flow + all aux + L_spectral End-to-end fine-tuning

Key Design Decisions

  • SSM+RWKV over Transformers: Linear O(N) vs quadratic O(NΒ²). For 1024px β†’ 32Γ—32 latent = 1024 tokens. Attention needs ~1M ops per layer; SSM needs ~1K.
  • Flow Matching over DDPM: Stable training, fewer sampling steps (1–4), no exploding losses at tβ†’0.
  • Wavelet spectral smoothness: Penalizes unnatural high-frequency noise, native 1024px quality without upsampling hacks.
  • Modular curriculum: Prevents catastrophic forgetting, forces each ASDL head to learn a clean, separable subspace.
  • LTC Gate: Liquid Time-Constant residual dynamically adapts between fast (textures) and slow (structures) pathways.

Datasets (Suggested)

Stage Dataset Source
1 Anime illustrations with style tags Danbooru / Safebooru filtered
2-3 Detailed caption dataset none-yet/anime-captions, latentcat/animesfw
4 Mood-labeled artwork Self-annotated via CLIP clustering
5 Full quality mix Curated high-quality anime illustration set

Usage

1. Generate Image (with pretrained VAE)

from artigen.model import ArtiGen
from artigen.sampling import sample
from diffusers import AutoencoderKL
import torch

# Load lightweight VAE (e.g., madebyollin/taesd)
vae = AutoencoderKL.from_pretrained("madebyollin/taesd").to("cuda")

# Build model
model = ArtiGen(
    embed_dim=256, num_layers=16,
    latent_h=32, latent_w=32,
).to("cuda")
model.load_state_dict(torch.load("artigen_stage5.pt")["ema"])

# Text embed (e.g., CLIP)
text_embed = torch.randn(1, 768).to("cuda")

# Sample latent
z0 = sample(model, text_embed, latent_shape=(4, 32, 32), num_steps=4, cfg_scale=2.0)

# Decode
img = vae.decode(z0).sample

2. Invent a New Art Style

# Extract ASDL vectors
with torch.no_grad():
    _, asdl = model(z_t, t, text_embed, return_asdl=True)
    style_vec = asdl["style_vec"]  # (1, 64)

# Interpolate between two styles
new_style = 0.7 * style_a + 0.3 * style_b
# Inject during generation by conditioning text_embed with style vector

3. Train (Colab Free Tier)

# In a Colab notebook cell
!git clone https://github.com/<repo>/artigen.git
%cd artigen
!python -m artigen.train \
    --epochs 5 --bs 2 --dim 256 --layers 16 \
    --latent_h 32 --latent_w 32 --device cuda

Citation & References

Architecture inspired by:

  1. DiM (2405.14224): SSM-based diffusion with multi-directional scan
  2. Zigzag Mamba (2403.13802): Spatial continuity via zigzag scanning
  3. Diffusion-RWKV (2404.04478): RWKV for diffusion generation
  4. MobileMamba (2411.15941): Three-stage wavelet-enhanced SSM backbone
  5. MILR (2509.22761): Test-time latent reasoning in unified space
  6. Unified Thinker (2601.03127): Reasoning-decoupled generation core
  7. LatentMorph (2602.02227): Implicit latent reasoning without decode loops
  8. LFM (2307.08698): Flow matching in pretrained VAE latent space
  9. Liquid Time-Constant Networks (2006.04439): Adaptive continuous-time gates
  10. Disentanglement via Latent Quantization (2305.18378): Modular latent decomposition

License

MIT License β€” free to use, modify, and deploy.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support