Colox, a reasoning AI model. I am currently working on a model smarter than GPT o1 that thinks before it speaks. It is coming tomorrow in the afternoon.
I've made yet another merge of reasoning models with incremental gains on the current Open LLM leaderboard. open-llm-leaderboard/open_llm_leaderboard
Merging in DeepSeek R1 distillation to Llama 3.1 8B (at 10% task arithmetic weight, using the Llama 3.1 8B base model as the case rather than the instruct model) with a prior best merge resulted in a slightly lower IFEval, but a higher result in every other benchmark save for MMLU-PRO, which went down only marginally. MATH Lvl5 and GPQA went up palpably. grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B
This result is currently my best Llama 3.1 8B merge result to date. The actual R1 distillation itself scored quite badly, so this would seem to be another case of unexpected formatting (reflected in IFEval) hurting the evaluation results, obscuring the strength of a model.
It is also possible to use the text generation feature of this model to generate roleplay completions. Based on informal testing, this model's bias toward problem-solving will subtly impact narration.
๐ Excited to share that our paper, "SELP: Generating Safe and Efficient Task Plans for Robot Agents with Large Language Models", has been accepted to #ICRA2025! ๐ Preprint: https://arxiv.org/pdf/2409.19471
We introduce SELP (Safe Efficient LLM Planner), a novel approach for generating plans that adhere to user-specified constraints while optimizing for time-efficient execution. By leveraging linear temporal logic (LTL) to interpret natural language commands, SELP effectively handles complex commands and long-horizon tasks. ๐ค
๐กSELP presents three key insights: 1๏ธโฃ Equivalence Voting: Ensures robust translations from natural language instructions into LTL specifications. 2๏ธโฃ Constrained Decoding: Uses the generated LTL formula to guide the autoregressive inference of plans, ensuring the generated plans conform to the LTL. 3๏ธโฃ Domain-Specific Fine-Tuning: Customizes LLMs for specific robotic tasks, boosting both safety and efficiency.
๐ Experiment: Our experiments demonstrate SELPโs effectiveness and generalizability across diverse tasks. In drone navigation, SELP outperforms state-of-the-art LLM planners by 10.8% in safety rate and by 19.8% in plan efficiency. For robot manipulation, SELP achieves a 20.4% improvement in safety rate.