zmsBERT - Zero-Millisecond Security

Real-time AI DNS threat classification by doxx.net

zmsBERT is a fine-tuned BERT model that classifies DNS domain names into 11 threat categories in real time. It catches zero-day phishing, malware, DGA (domain generation algorithm), and other threats that static blocklists miss - from the domain name string alone, with no network lookup required.

Files

The model requires the following files to run:

File	Size	Description
`weights.bin`	423 MB	Model weights (flat float32 binary)
`config.json`	1 KB	Model architecture config (layers, heads, hidden size, labels)
`vocab.json`	567 KB	BPE vocabulary (token to ID mapping)
`merges.json`	377 KB	BPE merge rules (31,173 pairs)
`manifest.json`	28 KB	Tensor layout manifest (name, shape, offset for each weight tensor)

All files are included in this repository. Download them to a single directory and point ZMS at it with -weights /path/to/dir.

Additionally, these optional data files improve classification accuracy:

File	Description
`domain_categories.json`	Parent domain trust categories (1.6M+ domains mapped to hosting types)
`spam_tlds.txt`	Risky TLD list (437 TLDs from hagezi spam-tlds)

These are available in the ZMS repo.

How It Works

The Problem

Static DNS blocklists are reactive - a malicious domain must be discovered, reported, analyzed, and added to a list before it's blocked. The window between when an attacker registers a domain and when it appears on blocklists is the zero-day gap. zmsBERT closes this gap by classifying domains from their name alone.

The Insight

Attackers face an unsolvable naming problem. Malicious domains must either:

Deceive humans (phishing): secure-paypal-login.xyz, microsoft365-verify.club
Be algorithmically generated (DGA/C2): w10b8jin2uib3a6fl.shop, nexozerapexidexoviro.digital
Mimic legitimate patterns (typosquatting): staemcommuniity.com, m1cr0s0ft.com.ru

In all cases, the domain string carries signal that a language model can learn.

Three Context Signals

Each domain gets three context tags prepended before classification:

1. Hosting Provider (25 categories)

The model knows who hosts the domain, not just what it looks like. The same suspicious subdomain means different things on different infrastructure:

[CDN_ENTERPRISE] [TLD_SAFE] [GEO_TRUSTED] claim-150pro  -> benign (Akamai, enterprise CDN)
[CDN_FREE]       [TLD_SAFE] [GEO_NEUTRAL] claim-150pro  -> phishing (Cloudflare free tier)

Categories are split by actual abuse rates:

CDN_ENTERPRISE: Akamai International, Imperva, Edgecast (<3% abuse)
CDN_STANDARD: Fastly, Akamai Connected Cloud (~17% abuse)
CDN_FREE: Cloudflare (~27% abuse, free tier)
TECH_CURATED: Apple, Microsoft (<5% abuse)
TECH_CLOUD: Amazon AWS, Google Cloud (~17% abuse)
HOST_FREESITE: Wix, Squarespace, Vercel, Netlify (free tier, high abuse)
HOST_BUDGET: Hostinger, Namecheap, GoDaddy, OVH
Plus: ENTERPRISE_APP, CLOUD_PROVIDER, ECOMMERCE, SOCIAL, MEDIA, COMMS, FINANCE, SEARCH, CODE_HOSTING, GAMING, DYNDNS, and more

2. TLD Risk

TLD_SAFE: .com, .org, .net, etc.
TLD_RISKY: 437 spam TLDs (.xyz, .top, .club, .live, etc.)

3. Geographic Risk

Based on MaxMind GeoLite2 ASN lookup of the hosting IP:

GEO_HOSTILE: RU, CN, IR, KP, SY, BY
GEO_SKETCHY: CY, VG, SC, IS, MD, LV, HK, PA (bulletproof hosting havens)
GEO_MODERATE: BR, ID, TH, PK, BD, BG, MY, etc.
GEO_NEUTRAL: US, DE, NL, CA, SG, AU, SE, etc.
GEO_TRUSTED: JP, GB, FR, IE, KR, PL, FI, CH, etc.

Subdomain Isolation

When a known parent domain is found, it's stripped from the input and replaced with its trust category. The model learns subdomain patterns conditioned on the parent's context:

claim-150pro.firebaseapp.com  -> [HOST_FREESITE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro
helix-go-webview.uber.com     -> [ENTERPRISE_APP] [TLD_SAFE] [GEO_NEUTRAL] helix-go-webview

This prevents false positives on legitimate infrastructure subdomains (Apple courier servers, Microsoft SmartScreen, Zoom internal APIs) while still catching threats on free hosting platforms.

ID	Label	Description	Examples
0	benign	Legitimate domains	google.com, zoom.us
1	malware	Malware C2, distribution	urlhaus, malware_filter sources
2	phishing	Phishing, credential theft, fake shops	phishing_filter, hagezi fake
3	ads	Advertising networks	adguard, goodbyeads
4	mixed	Multi-category blocklist domains	stevenblack unified
5	trackers	Tracking, native telemetry	hagezi tif/pro/ultimate, native device telemetry
6	content	Gambling, adult, social media, fake news	Combined content categories
7	dga	Domain generation algorithm	hagezi dga7, campaign-deduplicated
8	nrd	Newly registered domains (past 7 days)	hagezi nrd7
9	piracy	Piracy-related domains	hagezi anti.piracy
10	bypass	DoH/VPN/proxy bypass	hagezi doh-vpn-proxy-bypass

Architecture

Base model: DomURLs_BERT (110M parameters)
Classifier head: Dropout(0.1) -> Linear(768, 256) -> ReLU -> Dropout(0.1) -> Linear(256, 11)
Tokenizer: BPE with 31,173 merge rules + 36 special context tag tokens
Max sequence length: 128 tokens
Training data: 13.5M+ samples from 27+ blocklist sources, Tranco top-1M, live DNS traffic
Oversampling: Fortune 1000 domains (300x), Tranco top-10K (150x), whitelists (150x), synthetic infrastructure patterns (75x), real benign from live DNS (50x)
Weight format: Flat float32 binary (no PyTorch, no ONNX)

Performance

Metric	Value
Model load time	249ms
First classification	30-50ms
Cached classification	<1 microsecond
CPU throughput	30 domains/sec
GPU throughput	4,585 domains/sec
Model size	423 MB
Binary size	~10 MB (static Go binary)

Usage

This model is designed for use with the ZMS inference engine - a pure Go BERT implementation with no Python or ONNX dependencies:

# Download the model
zms -update-model

# Start the DNS classifier
zms -bind-ipv4 127.0.0.1 -listen 54

# Query via DNS TXT
dig @127.0.0.1 -p 54 TXT suspicious-domain.xyz +short
# {"label":"phishing","confidence":0.995,"parent_tag":"UNKNOWN","tld_risk":"TLD_RISKY"}

Zero-Day Catch Examples

99.8% malware   narr9-vector.aurorift.in.net            [FREE_HOSTING]
99.7% phishing  pub-a7aa109e9db04b97ba2fc89747a05209.r2.dev  [CLOUD_STORAGE]
99.7% phishing  reappeal-site-c9843io.vercel.app        [FREE_HOSTING]
99.6% malware   solflare-blocklist.moonshot.workers.dev  [FREE_HOSTING]
99.6% phishing  mintptojects211.vercel.app               [FREE_HOSTING]
99.5% malware   svc2base.absolutecontinuity.in.net       [FREE_HOSTING]
99.4% phishing  claim-nwomyboxpro.firebaseapp.com        [FREE_HOSTING]
99.4% phishing  smartwebcontractdapps.netlify.app         [FREE_HOSTING]
99.3% phishing  blocksdappsrectify.vercel.app             [FREE_HOSTING]
99.1% phishing  trustwalletsupport.vercel.app             [FREE_HOSTING]
98.5% phishing  claim-150pro.firebaseapp.com              [FREE_HOSTING]

Correctly benign (no false positives on infrastructure):

99.6% benign    google.com                               [TECH_PLATFORM]
99.6% benign    microsoft.com                            [TECH_PLATFORM]
98.4% benign    apple.com                                [TECH_PLATFORM]
98.8% benign    zoom.us                                  [COMMS]
97.4% benign    statuspage.io                            [ENTERPRISE_APP]

Training Data Sources

Blocklists: 27+ sources including urlhaus, malware-filter, phishing-filter, adguard, goodbyeads, hagezi, stevenblack, and native telemetry lists
Benign: Tranco top-1M, hagezi whitelists, synthetic infrastructure patterns, real benign subdomains from live DNS traffic

License

This model is released under the MIT License with a commercial use restriction. Non-commercial use is freely permitted. Commercial use requires written permission from Barrett Lyon / Doxx Corp. Contact legal@doxx.net for licensing.

This model is a derivative work based on BERT (Apache 2.0, Google) and DomURLs_BERT (Abdelkader Mekaoui).

Citation

zmsBERT: Zero-Millisecond Security DNS Classifier
doxx.net, 2026
https://huggingface.co/doxxnet/zmsBERT

doxx.net - Privacy without compromise

Downloads last month: 115