AbstractPhila PRO
AI & ML interests
Recent Activity
Organizations
Residual Thinking Embeddings: How Language Models Transform Text Through Deliberative Generation
0.8B and I are going to be good friends.
I've managed to condense a prototype to substantially smaller size but it's not as accurate as the original due to the generic topology being more challenging. I'm working it out though.
I've figured many new formulas based on the results of the last, which enable more deterministic projection rather than requiring the learning process to be so dispersed among many different subsystems.
I've also managed to form a 5d deterministic projection scaffold that should enable the entire structure to be even smaller, assuming I can work out the edge cases.
It's considerably cheaper than expected to keep volume valid. This seems like a partial regression for now but I can improve it a bit before heading back in the original direction. Hopefully it's worth the time spent on the potentially improved more sleek structure.
The smaller one can handle more shapes, considerably more shapes per scene, at a much higher complexity than voxel association. This has drawbacks though, namely these are essentially a gate set for now and the gates aren't perfect. These CAN find the correct potential, however the subprocessing isn't enabled yet, meaning our little 400k param set here is powerful but in a different kind of way.
I've started making pushes to include the missing pieces, so the colab will start to comply to the training regime and the geovocab2 will no longer be required.
The majority of the geovocab2 specific formulas and factories used will be directly represented in the vocabulary directory, which will be optimized to a better state than the originals. They will include both numpy and torch synthesis, as well as numpy and torch optimizations for worker creation and transforms.
With this I will include the more robust shape factory from the original, and expand it to include deformation perturbation. This will be a learned behavior of the model, which will allow the deformation of shapes to be directly aligned and trained in bulk along with multiple overlapping shapes, multiple sectorized shapes, sub-shapes, deviant shapes, and everything related directly to shape pooling rather than using hard-set spectra of shapes projected into space.
These patches will essentially be alignment sectorization in their first states for the first 8piece prototype of the chunk, as I can train that on the currently available G4 issued by COLAB.
This is a required element for increasing the learner to full definition capacity, and is a required hurdle before the patchwork can be expanded to a full chunk. The experiments are promising leading to this point, and as I snap pieces together from the successful experiments the system will begin to converge exactly where the expectation rests.
After that, it's just a matter of expanding upward to the necessary architecture and introducing the weights in sequential linear interpolative sequencing, which is something transformers are uniquely capable at handling with minimal calculations after the pre-calculations.
So far so good.
I'll be running multiple alucard fusion ablations on the patchwork before defaulting to the dual-stream slit-light superposition crystal topology architecture that I've proven works for the smaller patchmaker. My hope is that I can approximate the behavior in a more concise way without requiring the full spread of geometric globalization, but there's no guarantees yet. This could save a huge chunk of training time if it works, and alucard's scheduling internal step system will have a place. This may cut a huge percentage of the overall followup training, potentially allowing for the training on less machines. The topology architecture may be fully required, so hopefully I can just avoid all through some clever math and be done with it.
Avoiding the full multi-tower Beatrix oscillation system would be absolutely fantastic, but I think the predictions afforded by the system may be fully required, and the oscillation system will likely need to be tuned into a new form for this use case as well.
I do apologize for the nasty code, but Claude tends to be very difficult to make cooperate if you drive the code too far from Claude's context window. Much of my organization has helped but not enough, but Claude DOES afford rapid prototype capacity. The current repo itself houses a mostly incomplete representation of the outcome, but I want to make sure at least SOME of the formulas align before I start pushing further iterations.
Fair organization can be found in the router section of the geofractal router, the hierarchy spectrum of the geovocab, and the entire system of the pytorch-wide-compiler. They are ugly though and evolved in their own way, I just let Claude work sometimes because otherwise it would take 4x as long to organize in a reusable fashion.
MOST of the code compiles, but I believe there's some .item() edge cases in the current code that causes graph breaks. I'm working on it.
I'll HOPEFULLY be pushing a fairly organized update to the geolip repo this afternoon with a more complete interpretation of the subsystems, but the formulas aren't perfect yet. I have a couple prototype patchmakers in training but they have some bugs. I'll try to keep them organized.
I need to clean up this sewer honestly, the code got nasty. It's more often fast than not. Might be worth porting all classes directly to the geolip repo, which will centralize for AI development rather than have everything out in divergent systems.
In the gaming industry we call this "YOUR PRODUCER IS CONFUSED AND MAD BECAUSE TECH DEBT"
It's coming together, but the repo is pretty outdated.
Its effective computation is associative recall. Outputs are selected from memory rather than produced through internal transformation. A reasoning system must evolve internal state before emitting an answer:
genui{"math_block_widget_always_prefetched":{"content":"\frac{dx}{dt} = F(x,t)"}}
Without state evolution, responses remain recombinations.
The Hamiltonian is measured but not used to guide cognition. True reasoning requires optimization across trajectories:
genui{"math_block_widget_always_prefetched":{"content":"H = T + V"}}
Energy must shape evolution, not remain a passive metric.
Criticality regulation is also missing. Biological systems maintain coherence near a critical branching ratio:
genui{"math_block_widget_always_prefetched":{"content":"\frac{d\sigma}{dt} = \alpha (\sigma_c - \sigma)"}}
Without push–pull stabilization, activity fragments or saturates. Research suggests roughly 60 effective connections per neuron are needed for coherent oscillation. Below that, the system behaves as isolated retrieval islands.
Current metrics show partial integration. Phi < 1 and entropy remains elevated. The system integrates information but does not dynamically transform it.
To move from retrieval to reasoning, the architecture needs an internal multi-step simulation loop, energy minimization across trajectories, enforced coherence thresholds, and higher-order interactions beyond pairwise attention. The required shift is architectural, not just scaling. Answers must emerge from internal dynamical evolution rather than direct memory selection.
Is it ensemble or hierarchical?
I actually think I found a series of solutions for this exact problem using a Claude skill without waiting for Opus 4.7.
It doesn't change the baseline though, so beware.
The Slop Code Problem: A Field Guide to Working With Your LLM Coding Companion
https://github.com/AbstractEyes/glip-autoencoder
To tinker with the topology directly you can play with it here, though I admit it's imperfect in this form - it's quite the tinker toy to see the effects of patching.
https://claude.ai/public/artifacts/697287e4-fa18-4753-8b57-904d5e2022ed
This is the repo that will contain the next experimental stage, which is based entirely on the research and structural boundaries applied by said research. It'll be a little rigid while I get Claude set up.
In order to directly train these layered topological response patchworks you must install and use the geovocab2, geofractal, and wide_compiler repos.
This is due to the wide_compiler's wide_linear high-speed efficiency for ensemble processing, the geovocab2 factory structure with multiple formulas including highly efficient designs meant for kernel compilation, and a series of reusable utilities in geofractal including some of the more complex losses and difficult to optimally tune gate structures surrounding them.
Many of the underlying formulas are outlined here;
AbstractPhil/geometric-experiment-history
Utilization and training USING the pretrained or untrained geolip patchwork will be as simple as loading the model in pytorch and will not require external dependencies of the geolip package, numpy, or pytorch depending on the task. It will come packaged with recommended losses but I encourage experimentation because I simply cannot cover all spectrums.
More details to come as development progresses. The system is coming together and the state of the utilizable autoencoder will be ready within a couple weeks. The entire system is built for convenience and reusability, so the structure will be built similarly to autoencoder systems that currently exist, with a few tweaks here and there for important elements - so the interface will be familiar to those who use it.
Train something that withstands the loss. I've trained a ton of theoretical models based on similar topics that collapse to noise given almost zero scrutiny. Some even got pretty capable, and yet they fell short of the marks.
Simply put, if the model survives the math, it's sustainably capable of surviving bulk averages. If not, back to the drawing board.
Geometric Structural Vocabulary: Scaling Deterministic Mathematics Into Universal Model Conditioning
They aren't releasing their weights, so other studios have to do it the slow way. This seems like a huge waste of computation, and a response to that in any way other than a utilitarian sense is just going to make the problem worse.
The reasonable solution would be to simply distribute curated distillations to prevent this sort of problem and save global power consumption.
Distillations with expert expectations are very difficult to finetune in a reasonable fashion. They often take more compute than the original took to even reach a similar state.
Distill, snap the experts off, boom you have yourself a distilled computation that can be utilized by companies on their own hardware, and then people will stop trying to reverse engineer and bulk extract information from your hardware. They'll be using their own internal hardware in a different and more cost effective fashion.
Make them good, reusable, expandable within reason, and this problem will evolve to distillation research. By that point the next generation of the big models will be out and the next series of distillations can be made, obsoleting the others.