Papers
arxiv:2602.06964

Learning a Generative Meta-Model of LLM Activations

Published on Feb 6
· Submitted by
Grace Luo
on Feb 9
Authors:
,
,
,

Abstract

Training diffusion models on neural network activations creates meta-models that learn internal state distributions and improve intervention fidelity without restrictive structural assumptions.

AI-generated summary

Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating "meta-models" that learn the distribution of a network's internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model's neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: https://generative-latent-prior.github.io.

Community

Paper author Paper submitter

Check out our codebase; everything’s ready to go! The code even runs on Nvidia RTX 4090s.

Page: http://generative-latent-prior.github.io
Code: https://github.com/g-luo/generative_latent_prior
Paper: https://arxiv.org/abs/2602.06964

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Here are the main results from the paper "Learning a Generative Meta-Model of LLM Activations," which introduces **GLP (Generative Latent Prior)**—a diffusion model trained on LLM activations.

1. Core Concept: Generative Meta-Modeling

The paper proposes training diffusion models directly on LLM residual stream activations to learn the underlying distribution of internal states without imposing structural assumptions (like linearity or sparsity). This creates a "meta-model" that can serve as a learned prior for downstream interpretability tasks.

Figure 1: Generative Latent Prior overview
Figure 1: GLP is an activation diffusion model trained on LLM residual streams. It serves as a prior for on-manifold steering and exhibits reliable scaling laws.

2. Scaling Laws: Predictable Improvement with Compute

A key finding is that GLP training follows smooth power-law scaling, and crucially, diffusion loss reliably predicts downstream utility.

Figure 2: Scaling behavior across three metrics
Figure 2: GLP scales predictably with compute. (a) Diffusion loss follows a power law with estimated irreducible error of 0.52. (b) On-manifold sentiment steering performance improves with compute, tracking the loss. (c) 1-D probing performance for 113 binary tasks also improves with compute.

The scaling analysis (models from 0.5B to 3.3B parameters trained on 1B activations) shows:

  • Loss follows the form $L(C) = E + A \cdot C^{-\alpha}$ with $E = 0.52$ (irreducible error)
  • Each 60× increase in compute halves the gap to the floor
  • Both steering and probing performance correlate strongly with diffusion loss

3. On-Manifold Steering: Preserving Fluency During Intervention

Activation steering often pushes activations "off-manifold" (into unrealistic regions of activation space), degrading output fluency. GLP solves this by post-processing steered activations through diffusion sampling (an activation-space analog of SDEdit).

Figure 4: On-manifold steering algorithm
Figure 4: The on-manifold steering algorithm. Given a steered activation, noise is added and then removed via GLP denoising, projecting the activation back onto the learned manifold while preserving semantic content.

Results on Steering Tasks:

Improving SAE Steering:
Figure 5: SAE steering improvement
Figure 5: GLP post-processing expands the Pareto frontier of concept strength vs. fluency for SAE feature steering (Llama8B-Base).

Persona Elicitation:
Figure 6: Persona steering
Figure 6: GLP post-processing improves persona vector steering for behavioral traits (evil, insecure, power-seeking) in Llama8B-Instruct, achieving higher concept scores at the same fluency level.

4. Generation Quality: Near-Indistinguishable Activations

GLP generates activation samples that are statistically nearly identical to real LLM activations, as verified by multiple metrics:

Figure 3: Generation quality visualization
Figure 3 continued
Figure 3 continued
Figure 3 continued
Figure 3 continued

Figure 3: (a-d) PCA visualizations showing convergence of generated (pink) vs. real (yellow) activations as sampling steps increase. (e) Frechet Distance quantitatively confirms generation quality, with ~20 steps needed for convergence.

Key Metrics:

  • Frechet Distance: GLP achieves lower FD than SAE reconstructions (despite SAEs having the advantage of being initialized from real activations vs. GLP starting from noise)
  • Delta LM Loss: Surprisingly, GLP achieves better reconstruction loss than a comparable SAE (0.21 vs 0.28 for Llama8B), despite not being explicitly trained for reconstruction

5. Interpretability: Superior 1-D Probing via "Meta-Neurons"

GLP's intermediate representations ("meta-neurons") isolate interpretable concepts into individual units more effectively than existing methods.

Table 4: 1-D Probing Performance

Method Llama1B AUC Llama8B AUC
GLP (Ours) 0.89 0.87
Raw Layer Output 0.82 0.83
SAE Features 0.80 0.82
Raw MLP Neurons 0.84 0.84

Table 4: GLP meta-neurons substantially outperform SAE features and raw activations on 1-D probing across 113 binary classification tasks.

Notable finding: The Llama1B GLP (0.5B-3.3B parameters) outperforms all Llama8B raw activation baselines, suggesting that training a GLP can be more effective than scaling the base LLM for achieving interpretable representations.

Qualitative Examples:
Meta-neurons discovered via 1-D probing show consistent activation patterns:

  • A "baseball" meta-neuron activates on baseball terms
  • A "contradiction" meta-neuron activates on expressions of disagreement

Summary of Key Contributions

  1. Scalable Alternative to Structural Assumptions: Unlike PCA or SAEs, GLP imposes no linearity/sparsity constraints yet improves with predictable scaling
  2. Practical Utility: Diffusion loss serves as a reliable proxy for downstream task performance (steering and probing)
  3. On-Manifold Regularization: Post-processing with GLP improves steering fluency across SAE features, DiffMean, and persona vectors
  4. Superior Feature Isolation: Meta-neurons outperform SAE features on 1-D probing, suggesting better concept disentanglement

The paper concludes that generative meta-models offer a promising path toward LLM interpretability that improves predictably with compute, without requiring hand-crafted structural assumptions.

Sign up or log in to comment

Models citing this paper 6

Browse 6 models citing this paper

Datasets citing this paper 5

Browse 5 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06964 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.