Fix label leakage: temporal split β use first 70% of events as input, predict purchase in last 30%. Remove n_purchases/purchase_rate from features. e4d8561 verified rtferraz commited on 4 days ago
Fix model loading: use from_pretrained() instead of torch.load() for safetensors format 165b138 verified rtferraz commited on 4 days ago
Add 03_ecommerce_finetune.ipynb β next-purchase prediction with JointFusion, LightGBM baseline comparison 857ec9a verified rtferraz commited on 4 days ago
Add e-commerce pre-training report β successful demo, behavioral clusters found, future improvements noted 2b3e3af verified rtferraz commited on 4 days ago
Update 02_ecommerce notebook: add HF login, memory-free cell, subsample option for <64GB RAM machines 2410b7e verified rtferraz commited on 4 days ago
CRITICAL FIX: Switch from ByteLevel to Whitespace pre-tokenizer β fixes 42% UNK rate on domain token sequences a9c4a62 verified rtferraz commited on 4 days ago
Add 02_ecommerce_pretrain.ipynb β REES46 e-commerce pre-training with sequential entropy check, wandb, push to hub d60868a verified rtferraz commited on 8 days ago
Add finance pre-training report β honest analysis of results and lessons learned 709a7e2 verified rtferraz commited on 8 days ago
Add .gitignore β Python, Jupyter, training artifacts, IDE files 9211898 verified rtferraz commited on 8 days ago
Fix notebook: total_mem β total_memory, add hub_model_id push, add wandb logging support 65ecf7e verified rtferraz commited on 8 days ago
Add 01_finance_pretrain.ipynb β Phase 3.1 notebook for pre-training on 5M Nigerian financial transactions 2c3ddfa verified rtferraz commited on 9 days ago
Phase 3.0: Pipeline validation demo on mindweave/bank-transactions-us β ALL 10 CHECKS PASSED 6e5b80d verified rtferraz commited on 9 days ago
Add ADR-002: Dataset selection for Phase 3 demos β research findings, rationale, phased plan 756d197 verified rtferraz commited on 9 days ago
Update implementation report: add Phase 2D, update header to v0.4.0 / 139 tests, update cumulative summary and API 7aac458 verified rtferraz commited on 9 days ago
Add fine-tuning test suite β 15 tests covering dataset, batching, forward/backward, Trainer smoke, multiclass abab711 verified rtferraz commited on 9 days ago
Add finetune.py β finetune_domain_model (HF Trainer Pattern A, auto tabular_features passthrough) 46a6d37 verified rtferraz commited on 9 days ago
Phase 2D: Fine-tuning pipeline β DomainFinetuneDataset, finetune_domain_model, 139 total tests passing 256963c verified rtferraz commited on 9 days ago
Update README v0.3.0 β add usage example, update roadmap status, add implementation report link f580186 verified rtferraz commited on 10 days ago
Add Phase 2A-2C implementation report β technical decisions, architecture summary, test results 6c4ad4d verified rtferraz commited on 10 days ago
Add training test suite β 19 tests covering data pipeline, packing, collation, integration, Trainer smoke test 345d9e3 verified rtferraz commited on 10 days ago
Add pretrain.py β pretrain_domain_model with HF Trainer, cosine schedule, DataCollatorForLanguageModeling 6ccb9e6 verified rtferraz commited on 10 days ago
Add data_pipeline.py β tokenize_user_sequences, pack_sequences, prepare_clm_dataset 1dfd4e2 verified rtferraz commited on 10 days ago
Phase 2C: Pre-training pipeline β data pipeline, sequence packing, HF Trainer CLM, 124 total tests passing 28118c7 verified rtferraz commited on 10 days ago
Add model test suite β 33 tests covering config, model, PLR, DCNv2, joint fusion, integration ab8a8b6 verified rtferraz commited on 10 days ago
Add DCNv2 + JointFusionModel (nuFormer-style Transformer + tabular fusion) e881ea3 verified rtferraz commited on 10 days ago
Add DomainTransformerForCausalLM β GPT-style NoPE model with SDPA attention, weight tying, HF Trainer compatible 0dec8e4 verified rtferraz commited on 10 days ago
Add DomainTransformerConfig with presets (24M/85M/330M) 15fbfea verified rtferraz commited on 10 days ago
Phase 2B: Model architecture β DomainTransformerForCausalLM (NoPE, GPT-style), PLR embeddings, DCNv2 + JointFusion, 105 passing tests 2f5969e verified rtferraz commited on 10 days ago
Add comprehensive test suite β 72 passing tests covering all components 8efa945 verified rtferraz commited on 10 days ago
Add predefined schemas (FINANCE, ECOMMERCE, HEALTHCARE) c00ac2c verified rtferraz commited on 10 days ago
Add domain_tokenizer.py β DomainTokenizerBuilder (core assembler, HF integration) 818a2e9 verified rtferraz commited on 10 days ago
Add field_tokenizers.py β Sign, MagnitudeBucket, Calendar, Categorical, DiscreteNumerical tokenizers 511f3aa verified rtferraz commited on 10 days ago
Add schema.py β DomainSchema, FieldSpec, FieldType definitions 1a9dad0 verified rtferraz commited on 10 days ago
Phase 2A: Core tokenizer library β schema, field tokenizers, composite builder, predefined schemas, 72 passing tests 0c1ca58 verified rtferraz commited on 10 days ago
Update README: add ADR reference, update documentation table and repo structure a239d6e verified rtferraz commited on 10 days ago
Add ADR-001: Implementation framework decision with detailed roadmap 25a1093 verified rtferraz commited on 10 days ago
Update README with Nubank case study and expanded repo structure e30a14d verified rtferraz commited on 10 days ago
Add Nubank nuFormer reverse-engineering analysis β full pipeline reconstruction 51149fa verified rtferraz commited on 10 days ago
Add comprehensive research report on domain-specific tokenization be86e60 verified rtferraz commited on 10 days ago