payelb
/

aligned_tinyllama_ultrafeedback_fixed1k_baseline

reward-modeling

Model card Files Files and versions

Aligned TinyLlama on UltraFeedback (fixed-1k prompt pool)

This model was aligned with TRL PPO using a reward model:

payelb/UltraFeedback_openbmb_deberta_1k_fixed_baseline (tag: baseline)

Key settings:

Prompt pool: restricted to the same fixed/selected 1k subset used for RM training (loaded from CSV)
PPO updates: 200
batch size: 4
lr: 1e-05
LoRA: r=16, alpha=32, dropout=0.05

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for payelb/aligned_tinyllama_ultrafeedback_fixed1k_baseline

Base model

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Adapter

(1415)

this model