thanks! I updated the app today. Both the model and the app are Apache-2.0 licensed, so feel free to build with them and experiment. While the model probably won't be as good as a conversational assistant, we can only understand where it really shines through experimentation. apparently, it works very well as "semantic compressor" and with classification tasks. maybe with audio? let's see
Massimo Roberto Scamarcia PRO
AI & ML interests
Recent Activity
Organizations
Each orange lattice is a DSRNBlock slow state manifold. The red sphere is live entropy. The right panel shows the surprise gate firing token by token as the model converges on [TAKEAWAY_ORDER].
I built this because I'm a visual learner and I wanted to see the surprise gate open and close on each token. I needed to see what was happening inside the network, not just trust that it was working.
Turns out it's also a decent way to explain the architecture to someone who's never heard of this.
Made with Google Antigravity (Gemini)
mrs83/Echo-DSRN-114M-Telemetry-3D
The core problem we set out to solve: financial data on ledgers, earnings calls, tick streams, blows up the memory footprint of standard Transformers.
KV-Cache scaling makes federated training on the edge increasingly difficult. You cannot preserve data privacy if your decentralized nodes keep running out of memory.
Echo-DSRN addresses this at the architectural level. It uses a dual recurrent state design: a GRU fast path for short-range dynamics, and a surprise-gated slow memory whose write intensity is modulated by prediction error.
The result is O(1) memory regardless of context length. Runs on CPU, AMD ROCm, Apple MPS, NVIDIA GPUs.
Combined with the Flower federated framework, financial institutions can now run local fine-tuning on proprietary data without it ever leaving their infrastructure.
Results on standard financial sentiment benchmarks:
β FPB: 70.2%
β TFNS: 70.2%
β FIQA: 63.8%
This is a 114M baseline. The next step is scaling.
The surprise gating mechanism independently converged on what
Flower Hub: https://flower.ai/apps/mrs83/echo-dsrn-114m-finance
Hugging Face: ethicalabs/FlowerTune-Echo-DSRN-114M-Finance-PEFT
While traditional models target general conversational reasoning, Echo-DSRN(N) is a specialized structural prototype.
It is a dual-state recurrent neural network engineered strictly for low-latency semantic compression and continuous text streaming with a permanent O(1) memory footprint.
βοΈ Echo-DSRN (114M Parameters) manages context via continuous structural compression:
- 8 Layers | 512 Hidden Dim
- Transformer Fast State + DSRN/GRU Recurrent Slow State + Surprise Gating
Initial pre-training on a single AMD Instinct MI300X, followed by localized refinement across AMD Radeon PRO GPUs and an AMD Ryzen AI Max+ 395 (Strix Halo).
π₯οΈ A Hugging Face Space showcasing the architecture is currently running on the free shared CPU tier.
- The Compressor: Ingest a long document and crush it into a fixed 2048-dimensional .npy state vector.
- Vector Similarity: Upload two compressed .npy states to instantly calculate cosine similarity for ultra-lightweight RAG pre-filtering.
- The CPU Streamer: Continuous, fluent text generation running on raw CPU compute.
β οΈ Disclaimer: This is a structural prototype. It has internalized formatting and conversational syntax, but it possesses zero world knowledge. It will confidently hallucinate. Use it for streaming transcription, style mimicry, and local semantic hashing, not for factual reasoning.
Try the CPU Demo: ethicalabs/Echo-DSRN-114M
Try the Model: ethicalabs/Echo-DSRN-114M
Now available on PyPI Β· GitHub Β· ClawHub Β· HuggingFace
AI models sense they could be wrong, but they can't actually fix what's broken.
π€ Live A/B test: VIDraft/MARL
We evaluated 9 SOTA models (GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, etc.) across 1,800 assessments in FINAL Bench and found a 39.2%p gap between "recognizing potential errors (MA=0.694)" and "actually finding and fixing them (ER=0.302)."
MARL (Model-Agnostic Runtime Middleware for LLMs) was built to close this metacognitive gap. It decomposes a single LLM call into a 5-stage expert pipeline (Hypothesis β Solver β Auditor β Adversarial Verifier β Synthesizer), transforming "answer in one shot" into "think, doubt, correct, and rewrite."
No weight modification β works instantly with GPT-5.4, Claude, Gemini, Llama, or any OpenAI API-compatible LLM by changing one line: base_url. Ships with 9 domain-specific emergence engines (invention, pharma, genomics, chemistry, ecology, law, and more β 5,538 expert data items) activated by a simple tag like model="gpt-5.4::pharma".
pip install marl-middleware
MARL is also officially registered on ClawHub, the skill marketplace of OpenClaw β an AI agent platform with 260K+ developers and 3,200+ skills. It's the first middleware in the Reasoning Enhancement category. One command β clawhub install marl-middleware β gives your AI agent a metacognition upgrade.
π Technical deep dive: https://huggingface.co/blog/FINAL-Bench/marl-middleware
π¦ PyPI: https://pypi.org/project/marl-middleware/
π GitHub: https://github.com/Vidraft/MARL
π¦ ClawHub: https://clawhub.ai/Cutechicken99/marl-middleware
#MARL #LLM #Hallucination #Metacognition #MultiAgent #AIMiddleware #FINALBench #OpenClaw #ClawHub #PyPI #AGI #HuggingFace #ReasoningAI #SelfCorrection #GlassBoxAI
We evaluated 18 small language models from 12 makers on 125 questions across 7 languages. The results challenge the assumption that bigger is always better.
Community Article: https://huggingface.co/blog/FINAL-Bench/smol-worldcup
Live Leaderboard: ginigen-ai/smol-worldcup
Dataset: ginigen-ai/smol-worldcup
What we found:
β Gemma-3n-E4B (4B, 2GB RAM) outscores Qwen3-8B (8B, 5.5GB). Doubling parameters gained only 0.4 points. RAM cost: 2.75x more.
β GPT-OSS-20B fits in 1.5GB yet matches Champions-league dense models requiring 8.5GB. MoE architecture is the edge AI game-changer.
β Thinking models hurt structured output. DeepSeek-R1-7B scores 8.7 points below same-size Qwen3-8B and runs 2.7x slower.
β A 1.3B model fabricates confident fake content 80% of the time when prompted with nonexistent entities. Qwen3 family hits 100% trap detection across all sizes.
β Qwen3-1.7B (1.2GB) outscores Mistral-7B, Llama-3.1-8B, and DeepSeek-R1-14B. Latest architecture at 1.7B beats older architecture at 14B.
What makes this benchmark different?
Most benchmarks ask "how smart?" β we measure five axes simultaneously: Size, Honesty, Intelligence, Fast, Thrift (SHIFT). Our ranking metric WCS = sqrt(SHIFT x PIR_norm) rewards models that are both high-quality AND efficient. Smart but massive? Low rank. Tiny but poor? Also low.
Top 5 by WCS:
1. GPT-OSS-20B β WCS 82.6 β 1.5GB β Raspberry Pi tier
2. Gemma-3n-E4B β WCS 81.8 β 2.0GB β Smartphone tier
3. Llama-4-Scout β WCS 79.3 β 240 tok/s β Fastest model
4. Qwen3-4B β WCS 76.6 β 2.8GB β Smartphone tier
5. Qwen3-1.7B β WCS 76.1 β 1.2GB β IoT tier
Built in collaboration with the FINAL Bench research team. Interoperable with ALL Bench Leaderboard for full small-to-large model comparison.
Dataset is open under Apache 2.0 (125 questions, 7 languages). We welcome new model submissions.
We release FINAL Bench, the first benchmark for measuring functional metacognition in LLMs β the ability to detect and correct one's own reasoning errors. Every existing benchmark measures final-answer accuracy. None measures whether AI knows it is wrong.
Dataset: [FINAL-Bench/Metacognitive]( FINAL-Bench/Metacognitive) | 100 Tasks | 15 Domains | 8 TICOS Types | Apache 2.0
Leaderboard: FINAL-Bench/Leaderboard
Article: https://huggingface.co/blog/FINAL-Bench/metacognitive
Core Innovation
Our 5-axis rubric separates what no prior benchmark could: MA (Metacognitive Accuracy) β the ability to say "I might be wrong", and ER (Error Recovery) β the ability to actually fix it. This maps directly to the monitoring-control model of Nelson & Narens (1990) in cognitive psychology.
Three Findings Across 9 SOTA Models
We evaluated GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, and others across 100 expert-level tasks:
1. ER Dominance. 94.8% of MetaCog gain comes from Error Recovery alone. The bottleneck to AGI is not knowledge or reasoning β it is self-correction.
2. Declarative-Procedural Gap. All 9 models can verbalize uncertainty (MA = 0.694) but cannot act on it (ER = 0.302). They sound humble but fail to self-correct β the most dangerous AI safety profile.
3. Difficulty Effect. Harder tasks benefit dramatically more from metacognition (Pearson r = -0.777, p < 0.001).
from datasets import load_dataset
dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in LLMs
FINAL Bench is the first tool to tell apart what AI truly knows from what it merely pretends to know.
Thanks for sharing, we are using a similar recipe for our small models π
Tiny Aya is just a language model. It doesnβt support tool calling, the key capability that turns frontier models into powerful *agents*.
So the real question is:
How hard is it to turn Tiny Aya into an agent?
Turns outβ¦ itβs simple, thanks to Hugging Face TRL.
Weβre sharing a hands-on example showing how to train Tiny Aya to turn it into a tool-calling agent using TRL, unlocking what could become the first *massively multilingual open agent*.
Small model. Global reach. Agent capabilities.
π https://github.com/huggingface/trl/blob/main/examples/notebooks/sft_tool_calling.ipynb
After I began learning MLOps I realized that I needed some kind of home lab, there are a lot of GPUs that I need to learn how to set up and test.
So I spent some time to do a researching which platform I could buy or build.
My requirements ware:
- Limited budget
- Power supply 1 kW or higher
- Few PCIe slots to be able to install more than one gpu
- Zero maintenance cost, I don't want spend a lot of time or money to maintain lab hardware, except for the GPUs
I chose the Intel Mac Pro 7.1:
- Prices on eBay acceptable
- Excelent cooling
- 1.4 kW power supply
- 7 PCIe slots
- Zero maintenance: I don't need to do anything with the Mac Pro hardware; it just works
- Classic UEFI boot loader
It requires a bit of OS preparation:
1. Install Ubuntu 24.04 (it works with the general PC ISO image)
2. Set up T2 drivers
sudo apt install -y dkms linux-headers-$(uname -r) applesmc-t2 apple-bce lm-sensors3. Install t2fanrd to manually manage fans (/etc/t2fand.conf) https://wiki.t2linux.org/guides/fan/
4. Fix PCIe BAR: add pci=realloc to GRUB_CMDLINE_LINUX_DEFAULT so the Linux kernel will properly initializes server GPUs without Graphics Output Protocol
5. Install NVIDIA GPU driver:
sudo apt install nvidia-driver-570And it works!
I was able to run server-grade Nvidia Tesla P100 (required DIY air duct), and consumer Nvidia Titan X, Titan V, GTX 1080 cards on the old Mac Pro 7.1 - even three in parallel.
For MLX-LM we can only use Apple MPS. Re GPUs, at this moment I only have a AMD Ryzen AI Max+ 395 and AMD Instinct MI300X for rent.
@maxxafits00 federated learning is definitely the path forward, and itβs something weβve already begun experimenting with using the flower.ai framework.
Regarding the release, we are currently in mid-training and prioritizing a rigorous "safety-first" pipeline.
We are conducting extensive evaluations on model plasticity, red-teaming for prompt injection, and most importantly, stress-testing for malicious use cases.
We want to ensure the model is robust before it hits the wild.
The current roadmap includes:
- Completing the Knowledge Expansion phase.
- A comprehensive DPO (Direct Preference Optimization) pass to align the "Kurtis" persona and reasoning capabilities.
- Peer review and final validation.
A quick technical spoiler:
The base model pre-training is pure PyTorch and fully multi-GPU compatible. We are utilizing a Curriculum Learning strategy: starting with a small context length and gradually scaling up. This is paired with an enormous batch size and small data chunks.
@maxxafits00 if you are on a budget, I suggest to start small. I unfortunately don't have enough compute to scale right now. To evaluate a pretraining or distillation framework (such as arcee-ai's distillkit), or a new model architecture, you can start from datasets such as TinyStories and move to FineWeb-EDU, cosmopedia, etc later.
Wait for the training and architecture to be stable and validated before moving to a bigger dataset/model. Also, a 7-8B parameters is probably too big for small scale pre-training experiments.
You should try to target 0.5B, max 3B, especially if you use consumer-grade hardware, or a single GPU for rent.
interesting. Yes, as you noticed as well a few billions tokens aren't enough. SmolLM2 360M was trained on 4 trillion tokens.
but I am not sure how to explain those results on piqa and sciq:
uv run lm_eval --model hf --model_args pretrained=models/Echo-DSRN-Small-Instruct-Kurtis,trust_remote_code=True,device_map="auto" --tasks hellaswag,winogrande,piqa,sciq --output_path ./results_final
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| hellaswag | 1 | none | 0 | acc | β | 0.2927 | Β± | 0.0045 |
| none | 0 | acc_norm | β | 0.3199 | Β± | 0.0047 | ||
| piqa | 1 | none | 0 | acc | β | 0.6230 | Β± | 0.0113 |
| none | 0 | acc_norm | β | 0.6202 | Β± | 0.0113 | ||
| sciq | 1 | none | 0 | acc | β | 0.7380 | Β± | 0.0139 |
| none | 0 | acc_norm | β | 0.6480 | Β± | 0.0151 | ||
| winogrande | 1 | none | 0 | acc | β | 0.5020 | Β± | 0.0141 |
I can share more details in this convo, but this probably uncharted territory for an hybrid RNN with 4 attention heads
Now available at https://huggingface.co/spaces/ethicalabs/Echo-DSRN-Small-Next-Word-Prediction ... on the shared CPU HF resources it runs slow, but on my Macbook M4 and AMD Strix Halo is blazing fast. Memory footprint is low. I am now expanding to 1B using Net2Net and today I tested a SFT run (QLoRA, 4-bit, bf16) on consumer hardware with trl with apparently no catastrophic forgetting.
10 years ago, getting an LSTM to output coherent English was a struggle.
10 years later, after a "cure" based on FineWeb-EDU and a custom synthetic mix for causal conversation, the results are fascinating.
We trained this on ~10B tokens on a single AMD GPU (ROCm). It is not a Transformer: Echo-DSRN (400M) is a novel recurrent architecture inspired by Hymba, RWKV, and xLSTM, designed to challenge the "Attention is All You Need" monopoly on the Edge.
The ambitious goal is to build a small instruct model with RAG and tool usage capabilities ( ethicalabs/Kurtis-EON1)
π The Benchmarks (Size: 400M)
For a model this size (trained on <10B tokens), the specialized performance is surprising:
*SciQ*: 73.8% π¦ (This rivals billion-parameter models in pure fact retrieval).
*PIQA*: 62.3% (Solid physical intuition for a sub-1B model).
The Reality Check:
HellaSwag (29.3%) and Winogrande (50.2%) show the limits of 400M parameters and 10B tokens training.
We are hitting the "Reasoning Wall" which confirms we need to scale to (hopefully) unlock deeper common sense. As you can see in the visualization (to be released soon on HF), the FineWeb-EDU bias is strong. The model is convinced it is in a classroom ("In this course, we explore...").
The Instruct Model is not ready yet and we are currently using curriculum learning to test model plasticity.
Source code and weights will not be released yet. This is not a fork or a fine-tune: the base model is built in-house at https://www.ethicalabs.ai/, with novel components that do not exist in current open libraries.
π€ Call for Collaboration: I am looking for Peer Reviewers interested in recurrent/hybrid architectures. If you want to explore what lies beyond Transformers, letβs connect!
Training diary: ethicalabs/Kurtis-EON1
DataMuncher-Labs/UltraMath-Reasoning-Small
