Archaea-74M

Archaea-74M is a decoder-only causal language model with approximately 74 million parameters, pretrained from scratch on BetterDataset-2M. The model uses a LLaMA-style architecture with Grouped Query Attention (GQA) and was trained using BF16 mixed precision.

This release represents approximately 1.23 billion trained tokens out of a planned 1.6 billion token pretraining run, making it a substantial intermediate checkpoint that captures most of the intended training curriculum while leaving room for future scaling and refinement.


Model Card

Attribute Value
Model ID GODELEV/Archaea-74M
Parameters ~74 Million
Architecture Decoder-only Transformer (LLaMA-style)
Attention Grouped Query Attention (GQA)
Context Length 1024
Tokenizer GPT-2
Training Precision BF16
Framework PyTorch + Transformers
License MIT

Architecture

Transformer Configuration

Parameter Value
Hidden Size 512
Intermediate Size 1408
Layers 8
Attention Heads 8
KV Heads 2
GQA Ratio 4:1
Activation SiLU
Normalization RMSNorm
Context Length 1024

The model implements Grouped Query Attention, reducing KV-cache memory requirements while maintaining strong representational capacity for a model of this scale.


Training

Dataset

Archaea-74M was pretrained on GODELEV/BetterDataset-2M, a multi-source corpus composed of:

  • General web text
  • Conversational content
  • Knowledge-focused material
  • Educational content
  • Instruction-like examples
  • Technical and programming text

The complete corpus contains approximately 1.6 billion tokens.

Training Progress

Metric Value
Planned Tokens ~1.6B
Tokens Trained ~1.23B
Completion ~77%
Planned Steps 25,000
Completed Steps 18,800

Optimization

Parameter Value
Optimizer AdamW
Scheduler OneCycleLR
Peak Learning Rate 6e-4
Weight Decay 0.1
Gradient Clipping 1.0
Sequence Length 1024
Effective Batch Size 64
Precision BF16

Training Statistics

Metric Value
Initial Loss 10.9223
Final Loss 2.9488
Best Loss 2.8071
Final Perplexity 19.08
Best Perplexity 16.56

Training Loss Curve

Learning Rate Schedule


Evaluation

Evaluated using EleutherAI LM Evaluation Harness.

Benchmark Results

Done on 0-Shot

Benchmark Metric Score
HellaSwag acc_norm 27.31%
PIQA acc_norm 58.54%
WinoGrande acc 51.54%
BoolQ acc 56.33%
ARC-Easy acc_norm 39.06%
ARC-Challenge acc_norm 22.70%
OpenBookQA acc_norm 26.00%
CommonsenseQA acc 19.66%
LAMBADA acc 18.01%
BLiMP acc 74.91%
MMLU acc 25.07%
SciQ acc_norm 57.70%
COPA acc 61.00%
RACE acc 24.78%
SWAG acc_norm 41.98%
TruthfulQA MC2 acc 46.46%
WikiText-2 Word Perplexity 68.06

Evaluation Notes

Failed tasks from previous runs:

  • SocialIQA: dataset loading incompatibility.
  • Arithmetic benchmark: task identifier changes in recent lm-eval versions.

Sample Generations

The following examples were generated directly from the pretrained base model.

Example 1

Input

Once upon a time

Output

Once upon a time

In the days after, I was told that I could not make a living. So I said to myself,
β€œI want to go back to my family.” I said, β€œThis is a good one. You can take me to the doctor
or something and do it with you.”

The doctor asked me how I felt about this. He said he did. But I said,
β€œYou can't have a baby.”

Example 2

Input

The horizon looked

Output

the horizon looked like a big black hole. The sun was shining and the wind was blowing.

The sun was shining with the sun, but it was still there. The stars were on top of each other.

The sun was falling as a result of the weather. The sun was going up.

Example 3

Input

what is the future of ai

Output

what is the future of Ai?

I don't know what you're talking about, but I'm sure that's a good reason for you to be optimistic.

It doesn't matter if it doesn't take away your wandering in a positive direction.

The answer is simple: I'll leave you with a feeling of uncertainty and optimism.

These outputs illustrate the model's ability to continue prompts coherently while also demonstrating typical limitations of small-scale pretrained language models, including repetition, topic drift, and inconsistent factual reasoning.


Usage

Installation

pip install torch transformers

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "GODELEV/Archaea-74M"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "GODELEV/Archaea-74M"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "The future of artificial intelligence"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.8,
        do_sample=True,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Repository Structure

Archaea-74M/
β”œβ”€β”€ config.json
β”œβ”€β”€ generation_config.json
β”œβ”€β”€ model.safetensors
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ Archaea74M_Training_Loss_Curve.png
β”œβ”€β”€ Archaea74M_Learning_Rate_Schedule.png
└── README.md

Limitations

Archaea-74M is a base pretrained model and has not undergone:

  • Instruction tuning
  • RLHF
  • Preference optimization
  • Safety alignment

Known limitations:

  • Hallucinations and factual inaccuracies
  • Limited reasoning due to model scale
  • Sensitivity to prompt phrasing
  • Fixed 1024-token context window
  • Not suitable for high-stakes applications

Future Work

  • Instruction tuning
  • Expanded benchmark coverage
  • Longer context lengths
  • Improved data quality and curriculum design

Citation

@misc{archaea74m,
  title={Archaea-74M},
  author={Akshit Kumar},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/GODELEV/Archaea-74M}
}
Downloads last month
288
Safetensors
Model size
74M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for GODELEV/Archaea-74M

Quantizations
1 model

Dataset used to train GODELEV/Archaea-74M

Spaces using GODELEV/Archaea-74M 2