zmsBERT - Zero-Millisecond Security
Real-time AI DNS threat classification by doxx.net
zmsBERT is a fine-tuned BERT model that classifies DNS domain names into 11 threat categories in real time. It catches zero-day phishing, malware, DGA (domain generation algorithm), and other threats that static blocklists miss - from the domain name string alone, with no network lookup required.
Files
The model requires the following files to run:
| File | Size | Description |
|---|---|---|
weights.bin |
423 MB | Model weights (flat float32 binary) |
config.json |
1 KB | Model architecture config (layers, heads, hidden size, labels) |
vocab.json |
567 KB | BPE vocabulary (token to ID mapping) |
merges.json |
377 KB | BPE merge rules (31,173 pairs) |
manifest.json |
28 KB | Tensor layout manifest (name, shape, offset for each weight tensor) |
All files are included in this repository. Download them to a single directory and point ZMS at it with -weights /path/to/dir.
Additionally, these optional data files improve classification accuracy:
| File | Description |
|---|---|
domain_categories.json |
Parent domain trust categories (1.6M+ domains mapped to hosting types) |
spam_tlds.txt |
Risky TLD list (437 TLDs from hagezi spam-tlds) |
These are available in the ZMS repo.
How It Works
The Problem
Static DNS blocklists are reactive - a malicious domain must be discovered, reported, analyzed, and added to a list before it's blocked. The window between when an attacker registers a domain and when it appears on blocklists is the zero-day gap. zmsBERT closes this gap by classifying domains from their name alone.
The Insight
Attackers face an unsolvable naming problem. Malicious domains must either:
- Deceive humans (phishing):
secure-paypal-login.xyz,microsoft365-verify.club - Be algorithmically generated (DGA/C2):
w10b8jin2uib3a6fl.shop,nexozerapexidexoviro.digital - Mimic legitimate patterns (typosquatting):
staemcommuniity.com,m1cr0s0ft.com.ru
In all cases, the domain string carries signal that a language model can learn.
Three Context Signals
Each domain gets three context tags prepended before classification:
1. Hosting Provider (25 categories)
The model knows who hosts the domain, not just what it looks like. The same suspicious subdomain means different things on different infrastructure:
[CDN_ENTERPRISE] [TLD_SAFE] [GEO_TRUSTED] claim-150pro -> benign (Akamai, enterprise CDN)
[CDN_FREE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro -> phishing (Cloudflare free tier)
Categories are split by actual abuse rates:
- CDN_ENTERPRISE: Akamai International, Imperva, Edgecast (<3% abuse)
- CDN_STANDARD: Fastly, Akamai Connected Cloud (~17% abuse)
- CDN_FREE: Cloudflare (~27% abuse, free tier)
- TECH_CURATED: Apple, Microsoft (<5% abuse)
- TECH_CLOUD: Amazon AWS, Google Cloud (~17% abuse)
- HOST_FREESITE: Wix, Squarespace, Vercel, Netlify (free tier, high abuse)
- HOST_BUDGET: Hostinger, Namecheap, GoDaddy, OVH
- Plus: ENTERPRISE_APP, CLOUD_PROVIDER, ECOMMERCE, SOCIAL, MEDIA, COMMS, FINANCE, SEARCH, CODE_HOSTING, GAMING, DYNDNS, and more
2. TLD Risk
TLD_SAFE: .com, .org, .net, etc.TLD_RISKY: 437 spam TLDs (.xyz, .top, .club, .live, etc.)
3. Geographic Risk
Based on MaxMind GeoLite2 ASN lookup of the hosting IP:
GEO_HOSTILE: RU, CN, IR, KP, SY, BYGEO_SKETCHY: CY, VG, SC, IS, MD, LV, HK, PA (bulletproof hosting havens)GEO_MODERATE: BR, ID, TH, PK, BD, BG, MY, etc.GEO_NEUTRAL: US, DE, NL, CA, SG, AU, SE, etc.GEO_TRUSTED: JP, GB, FR, IE, KR, PL, FI, CH, etc.
Subdomain Isolation
When a known parent domain is found, it's stripped from the input and replaced with its trust category. The model learns subdomain patterns conditioned on the parent's context:
claim-150pro.firebaseapp.com -> [HOST_FREESITE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro
helix-go-webview.uber.com -> [ENTERPRISE_APP] [TLD_SAFE] [GEO_NEUTRAL] helix-go-webview
This prevents false positives on legitimate infrastructure subdomains (Apple courier servers, Microsoft SmartScreen, Zoom internal APIs) while still catching threats on free hosting platforms.
Categories
| ID | Label | Description | Examples |
|---|---|---|---|
| 0 | benign | Legitimate domains | google.com, zoom.us |
| 1 | malware | Malware C2, distribution | urlhaus, malware_filter sources |
| 2 | phishing | Phishing, credential theft, fake shops | phishing_filter, hagezi fake |
| 3 | ads | Advertising networks | adguard, goodbyeads |
| 4 | mixed | Multi-category blocklist domains | stevenblack unified |
| 5 | trackers | Tracking, native telemetry | hagezi tif/pro/ultimate, native device telemetry |
| 6 | content | Gambling, adult, social media, fake news | Combined content categories |
| 7 | dga | Domain generation algorithm | hagezi dga7, campaign-deduplicated |
| 8 | nrd | Newly registered domains (past 7 days) | hagezi nrd7 |
| 9 | piracy | Piracy-related domains | hagezi anti.piracy |
| 10 | bypass | DoH/VPN/proxy bypass | hagezi doh-vpn-proxy-bypass |
Architecture
- Base model: DomURLs_BERT (110M parameters)
- Classifier head: Dropout(0.1) -> Linear(768, 256) -> ReLU -> Dropout(0.1) -> Linear(256, 11)
- Tokenizer: BPE with 31,173 merge rules + 36 special context tag tokens
- Max sequence length: 128 tokens
- Training data: 13.5M+ samples from 27+ blocklist sources, Tranco top-1M, live DNS traffic
- Oversampling: Fortune 1000 domains (300x), Tranco top-10K (150x), whitelists (150x), synthetic infrastructure patterns (75x), real benign from live DNS (50x)
- Weight format: Flat float32 binary (no PyTorch, no ONNX)
Performance
| Metric | Value |
|---|---|
| Model load time | 249ms |
| First classification | 30-50ms |
| Cached classification | <1 microsecond |
| CPU throughput | 30 domains/sec |
| GPU throughput | 4,585 domains/sec |
| Model size | 423 MB |
| Binary size | ~10 MB (static Go binary) |
Usage
This model is designed for use with the ZMS inference engine - a pure Go BERT implementation with no Python or ONNX dependencies:
# Download the model
zms -update-model
# Start the DNS classifier
zms -bind-ipv4 127.0.0.1 -listen 54
# Query via DNS TXT
dig @127.0.0.1 -p 54 TXT suspicious-domain.xyz +short
# {"label":"phishing","confidence":0.995,"parent_tag":"UNKNOWN","tld_risk":"TLD_RISKY"}
Zero-Day Catch Examples
99.8% malware narr9-vector.aurorift.in.net [FREE_HOSTING]
99.7% phishing pub-a7aa109e9db04b97ba2fc89747a05209.r2.dev [CLOUD_STORAGE]
99.7% phishing reappeal-site-c9843io.vercel.app [FREE_HOSTING]
99.6% malware solflare-blocklist.moonshot.workers.dev [FREE_HOSTING]
99.6% phishing mintptojects211.vercel.app [FREE_HOSTING]
99.5% malware svc2base.absolutecontinuity.in.net [FREE_HOSTING]
99.4% phishing claim-nwomyboxpro.firebaseapp.com [FREE_HOSTING]
99.4% phishing smartwebcontractdapps.netlify.app [FREE_HOSTING]
99.3% phishing blocksdappsrectify.vercel.app [FREE_HOSTING]
99.1% phishing trustwalletsupport.vercel.app [FREE_HOSTING]
98.5% phishing claim-150pro.firebaseapp.com [FREE_HOSTING]
Correctly benign (no false positives on infrastructure):
99.6% benign google.com [TECH_PLATFORM]
99.6% benign microsoft.com [TECH_PLATFORM]
98.4% benign apple.com [TECH_PLATFORM]
98.8% benign zoom.us [COMMS]
97.4% benign statuspage.io [ENTERPRISE_APP]
Training Data Sources
- Blocklists: 27+ sources including urlhaus, malware-filter, phishing-filter, adguard, goodbyeads, hagezi, stevenblack, and native telemetry lists
- Benign: Tranco top-1M, hagezi whitelists, synthetic infrastructure patterns, real benign subdomains from live DNS traffic
License
This model is released under the MIT License with a commercial use restriction. Non-commercial use is freely permitted. Commercial use requires written permission from Barrett Lyon / Doxx Corp. Contact legal@doxx.net for licensing.
This model is a derivative work based on BERT (Apache 2.0, Google) and DomURLs_BERT (Abdelkader Mekaoui).
Citation
zmsBERT: Zero-Millisecond Security DNS Classifier
doxx.net, 2026
https://huggingface.co/doxxnet/zmsBERT
doxx.net - Privacy without compromise
- Downloads last month
- 115