Aligned TinyLlama on UltraFeedback (fixed-1k prompt pool)
This model was aligned with TRL PPO using a reward model:
- payelb/UltraFeedback_openbmb_deberta_1k_fixed_baseline (tag:
baseline)
Key settings:
- Prompt pool: restricted to the same fixed/selected 1k subset used for RM training (loaded from CSV)
- PPO updates: 200
- batch size: 4
- lr: 1e-05
- LoRA: r=16, alpha=32, dropout=0.05
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for payelb/aligned_tinyllama_ultrafeedback_fixed1k_baseline
Base model
TinyLlama/TinyLlama-1.1B-Chat-v1.0