Archaea-74M

Archaea-74M is a decoder-only causal language model with approximately 74 million parameters, pretrained from scratch on BetterDataset-2M. The model uses a LLaMA-style architecture with Grouped Query Attention (GQA) and was trained using BF16 mixed precision.

This release represents approximately 1.23 billion trained tokens out of a planned 1.6 billion token pretraining run, making it a substantial intermediate checkpoint that captures most of the intended training curriculum while leaving room for future scaling and refinement.

Model Card

Attribute	Value
Model ID	GODELEV/Archaea-74M
Parameters	~74 Million
Architecture	Decoder-only Transformer (LLaMA-style)
Attention	Grouped Query Attention (GQA)
Context Length	1024
Tokenizer	GPT-2
Training Precision	BF16
Framework	PyTorch + Transformers
License	MIT

Architecture

Transformer Configuration

Parameter	Value
Hidden Size	512
Intermediate Size	1408
Layers	8
Attention Heads	8
KV Heads	2
GQA Ratio	4:1
Activation	SiLU
Normalization	RMSNorm
Context Length	1024

The model implements Grouped Query Attention, reducing KV-cache memory requirements while maintaining strong representational capacity for a model of this scale.

Training

Dataset

Archaea-74M was pretrained on GODELEV/BetterDataset-2M, a multi-source corpus composed of:

General web text
Conversational content
Knowledge-focused material
Educational content
Instruction-like examples
Technical and programming text

The complete corpus contains approximately 1.6 billion tokens.

Training Progress

Metric	Value
Planned Tokens	~1.6B
Tokens Trained	~1.23B
Completion	~77%
Planned Steps	25,000
Completed Steps	18,800

Optimization

Parameter	Value
Optimizer	AdamW
Scheduler	OneCycleLR
Peak Learning Rate	6e-4
Weight Decay	0.1
Gradient Clipping	1.0
Sequence Length	1024
Effective Batch Size	64
Precision	BF16

Training Statistics

Metric	Value
Initial Loss	10.9223
Final Loss	2.9488
Best Loss	2.8071
Final Perplexity	19.08
Best Perplexity	16.56

Training Loss Curve

Learning Rate Schedule

Evaluation

Evaluated using EleutherAI LM Evaluation Harness.

Benchmark Results

Done on 0-Shot

Benchmark	Metric	Score
HellaSwag	acc_norm	27.31%
PIQA	acc_norm	58.54%
WinoGrande	acc	51.54%
BoolQ	acc	56.33%
ARC-Easy	acc_norm	39.06%
ARC-Challenge	acc_norm	22.70%
OpenBookQA	acc_norm	26.00%
CommonsenseQA	acc	19.66%
LAMBADA	acc	18.01%
BLiMP	acc	74.91%
MMLU	acc	25.07%
SciQ	acc_norm	57.70%
COPA	acc	61.00%
RACE	acc	24.78%
SWAG	acc_norm	41.98%
TruthfulQA MC2	acc	46.46%
WikiText-2	Word Perplexity	68.06

Evaluation Notes

Failed tasks from previous runs:

SocialIQA: dataset loading incompatibility.
Arithmetic benchmark: task identifier changes in recent lm-eval versions.

Sample Generations

The following examples were generated directly from the pretrained base model.

Example 1

Input

Once upon a time

Output

Once upon a time

In the days after, I was told that I could not make a living. So I said to myself,
“I want to go back to my family.” I said, “This is a good one. You can take me to the doctor
or something and do it with you.”

The doctor asked me how I felt about this. He said he did. But I said,
“You can't have a baby.”

Example 2

Input

The horizon looked

Output

the horizon looked like a big black hole. The sun was shining and the wind was blowing.

The sun was shining with the sun, but it was still there. The stars were on top of each other.

The sun was falling as a result of the weather. The sun was going up.

Example 3

Input

what is the future of ai

Output

what is the future of Ai?

I don't know what you're talking about, but I'm sure that's a good reason for you to be optimistic.

It doesn't matter if it doesn't take away your wandering in a positive direction.

The answer is simple: I'll leave you with a feeling of uncertainty and optimism.

These outputs illustrate the model's ability to continue prompts coherently while also demonstrating typical limitations of small-scale pretrained language models, including repetition, topic drift, and inconsistent factual reasoning.

Usage

Installation

pip install torch transformers

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "GODELEV/Archaea-74M"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "GODELEV/Archaea-74M"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "The future of artificial intelligence"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.8,
        do_sample=True,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Repository Structure

Archaea-74M/
├── config.json
├── generation_config.json
├── model.safetensors
├── tokenizer.json
├── tokenizer_config.json
├── Archaea74M_Training_Loss_Curve.png
├── Archaea74M_Learning_Rate_Schedule.png
└── README.md

Limitations

Archaea-74M is a base pretrained model and has not undergone:

Instruction tuning
RLHF
Preference optimization
Safety alignment

Known limitations:

Hallucinations and factual inaccuracies
Limited reasoning due to model scale
Sensitivity to prompt phrasing
Fixed 1024-token context window
Not suitable for high-stakes applications

Future Work

Instruction tuning
Expanded benchmark coverage
Longer context lengths
Improved data quality and curriculum design

Citation

@misc{archaea74m,
  title={Archaea-74M},
  author={Akshit Kumar},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/GODELEV/Archaea-74M}
}

Downloads last month: 288

Safetensors

Model size

74M params

Tensor type

F32

Model tree for GODELEV/Archaea-74M

Quantizations

1 model

GODELEV
/

Archaea-74M

Archaea-74M

Model Card

Architecture

Transformer Configuration

Training

Dataset

Training Progress

Optimization

Training Statistics

Training Loss Curve

Learning Rate Schedule

Evaluation

Benchmark Results

Evaluation Notes

Sample Generations

Example 1

Example 2

Example 3

Usage

Installation

Loading the Model

Text Generation

Repository Structure

Limitations

Future Work

Citation

Model tree for GODELEV/Archaea-74M

Dataset used to train GODELEV/Archaea-74M

Spaces using GODELEV/Archaea-74M 2