zmsBERT - Zero-Millisecond Security

doxx.net

Real-time AI DNS threat classification by doxx.net

zmsBERT is a fine-tuned BERT model that classifies DNS domain names into 11 threat categories in real time. It catches zero-day phishing, malware, DGA (domain generation algorithm), and other threats that static blocklists miss - from the domain name string alone, with no network lookup required.

Files

The model requires the following files to run:

File Size Description
weights.bin 423 MB Model weights (flat float32 binary)
config.json 1 KB Model architecture config (layers, heads, hidden size, labels)
vocab.json 567 KB BPE vocabulary (token to ID mapping)
merges.json 377 KB BPE merge rules (31,173 pairs)
manifest.json 28 KB Tensor layout manifest (name, shape, offset for each weight tensor)

All files are included in this repository. Download them to a single directory and point ZMS at it with -weights /path/to/dir.

Additionally, these optional data files improve classification accuracy:

File Description
domain_categories.json Parent domain trust categories (1.6M+ domains mapped to hosting types)
spam_tlds.txt Risky TLD list (437 TLDs from hagezi spam-tlds)

These are available in the ZMS repo.

How It Works

The Problem

Static DNS blocklists are reactive - a malicious domain must be discovered, reported, analyzed, and added to a list before it's blocked. The window between when an attacker registers a domain and when it appears on blocklists is the zero-day gap. zmsBERT closes this gap by classifying domains from their name alone.

The Insight

Attackers face an unsolvable naming problem. Malicious domains must either:

  • Deceive humans (phishing): secure-paypal-login.xyz, microsoft365-verify.club
  • Be algorithmically generated (DGA/C2): w10b8jin2uib3a6fl.shop, nexozerapexidexoviro.digital
  • Mimic legitimate patterns (typosquatting): staemcommuniity.com, m1cr0s0ft.com.ru

In all cases, the domain string carries signal that a language model can learn.

Three Context Signals

Each domain gets three context tags prepended before classification:

1. Hosting Provider (25 categories)

The model knows who hosts the domain, not just what it looks like. The same suspicious subdomain means different things on different infrastructure:

[CDN_ENTERPRISE] [TLD_SAFE] [GEO_TRUSTED] claim-150pro  -> benign (Akamai, enterprise CDN)
[CDN_FREE]       [TLD_SAFE] [GEO_NEUTRAL] claim-150pro  -> phishing (Cloudflare free tier)

Categories are split by actual abuse rates:

  • CDN_ENTERPRISE: Akamai International, Imperva, Edgecast (<3% abuse)
  • CDN_STANDARD: Fastly, Akamai Connected Cloud (~17% abuse)
  • CDN_FREE: Cloudflare (~27% abuse, free tier)
  • TECH_CURATED: Apple, Microsoft (<5% abuse)
  • TECH_CLOUD: Amazon AWS, Google Cloud (~17% abuse)
  • HOST_FREESITE: Wix, Squarespace, Vercel, Netlify (free tier, high abuse)
  • HOST_BUDGET: Hostinger, Namecheap, GoDaddy, OVH
  • Plus: ENTERPRISE_APP, CLOUD_PROVIDER, ECOMMERCE, SOCIAL, MEDIA, COMMS, FINANCE, SEARCH, CODE_HOSTING, GAMING, DYNDNS, and more

2. TLD Risk

  • TLD_SAFE: .com, .org, .net, etc.
  • TLD_RISKY: 437 spam TLDs (.xyz, .top, .club, .live, etc.)

3. Geographic Risk

Based on MaxMind GeoLite2 ASN lookup of the hosting IP:

  • GEO_HOSTILE: RU, CN, IR, KP, SY, BY
  • GEO_SKETCHY: CY, VG, SC, IS, MD, LV, HK, PA (bulletproof hosting havens)
  • GEO_MODERATE: BR, ID, TH, PK, BD, BG, MY, etc.
  • GEO_NEUTRAL: US, DE, NL, CA, SG, AU, SE, etc.
  • GEO_TRUSTED: JP, GB, FR, IE, KR, PL, FI, CH, etc.

Subdomain Isolation

When a known parent domain is found, it's stripped from the input and replaced with its trust category. The model learns subdomain patterns conditioned on the parent's context:

claim-150pro.firebaseapp.com  -> [HOST_FREESITE] [TLD_SAFE] [GEO_NEUTRAL] claim-150pro
helix-go-webview.uber.com     -> [ENTERPRISE_APP] [TLD_SAFE] [GEO_NEUTRAL] helix-go-webview

This prevents false positives on legitimate infrastructure subdomains (Apple courier servers, Microsoft SmartScreen, Zoom internal APIs) while still catching threats on free hosting platforms.

Categories

ID Label Description Examples
0 benign Legitimate domains google.com, zoom.us
1 malware Malware C2, distribution urlhaus, malware_filter sources
2 phishing Phishing, credential theft, fake shops phishing_filter, hagezi fake
3 ads Advertising networks adguard, goodbyeads
4 mixed Multi-category blocklist domains stevenblack unified
5 trackers Tracking, native telemetry hagezi tif/pro/ultimate, native device telemetry
6 content Gambling, adult, social media, fake news Combined content categories
7 dga Domain generation algorithm hagezi dga7, campaign-deduplicated
8 nrd Newly registered domains (past 7 days) hagezi nrd7
9 piracy Piracy-related domains hagezi anti.piracy
10 bypass DoH/VPN/proxy bypass hagezi doh-vpn-proxy-bypass

Architecture

  • Base model: DomURLs_BERT (110M parameters)
  • Classifier head: Dropout(0.1) -> Linear(768, 256) -> ReLU -> Dropout(0.1) -> Linear(256, 11)
  • Tokenizer: BPE with 31,173 merge rules + 36 special context tag tokens
  • Max sequence length: 128 tokens
  • Training data: 13.5M+ samples from 27+ blocklist sources, Tranco top-1M, live DNS traffic
  • Oversampling: Fortune 1000 domains (300x), Tranco top-10K (150x), whitelists (150x), synthetic infrastructure patterns (75x), real benign from live DNS (50x)
  • Weight format: Flat float32 binary (no PyTorch, no ONNX)

Performance

Metric Value
Model load time 249ms
First classification 30-50ms
Cached classification <1 microsecond
CPU throughput 30 domains/sec
GPU throughput 4,585 domains/sec
Model size 423 MB
Binary size ~10 MB (static Go binary)

Usage

This model is designed for use with the ZMS inference engine - a pure Go BERT implementation with no Python or ONNX dependencies:

# Download the model
zms -update-model

# Start the DNS classifier
zms -bind-ipv4 127.0.0.1 -listen 54

# Query via DNS TXT
dig @127.0.0.1 -p 54 TXT suspicious-domain.xyz +short
# {"label":"phishing","confidence":0.995,"parent_tag":"UNKNOWN","tld_risk":"TLD_RISKY"}

Zero-Day Catch Examples

99.8% malware   narr9-vector.aurorift.in.net            [FREE_HOSTING]
99.7% phishing  pub-a7aa109e9db04b97ba2fc89747a05209.r2.dev  [CLOUD_STORAGE]
99.7% phishing  reappeal-site-c9843io.vercel.app        [FREE_HOSTING]
99.6% malware   solflare-blocklist.moonshot.workers.dev  [FREE_HOSTING]
99.6% phishing  mintptojects211.vercel.app               [FREE_HOSTING]
99.5% malware   svc2base.absolutecontinuity.in.net       [FREE_HOSTING]
99.4% phishing  claim-nwomyboxpro.firebaseapp.com        [FREE_HOSTING]
99.4% phishing  smartwebcontractdapps.netlify.app         [FREE_HOSTING]
99.3% phishing  blocksdappsrectify.vercel.app             [FREE_HOSTING]
99.1% phishing  trustwalletsupport.vercel.app             [FREE_HOSTING]
98.5% phishing  claim-150pro.firebaseapp.com              [FREE_HOSTING]

Correctly benign (no false positives on infrastructure):

99.6% benign    google.com                               [TECH_PLATFORM]
99.6% benign    microsoft.com                            [TECH_PLATFORM]
98.4% benign    apple.com                                [TECH_PLATFORM]
98.8% benign    zoom.us                                  [COMMS]
97.4% benign    statuspage.io                            [ENTERPRISE_APP]

Training Data Sources

  • Blocklists: 27+ sources including urlhaus, malware-filter, phishing-filter, adguard, goodbyeads, hagezi, stevenblack, and native telemetry lists
  • Benign: Tranco top-1M, hagezi whitelists, synthetic infrastructure patterns, real benign subdomains from live DNS traffic

License

This model is released under the MIT License with a commercial use restriction. Non-commercial use is freely permitted. Commercial use requires written permission from Barrett Lyon / Doxx Corp. Contact legal@doxx.net for licensing.

This model is a derivative work based on BERT (Apache 2.0, Google) and DomURLs_BERT (Abdelkader Mekaoui).

Citation

zmsBERT: Zero-Millisecond Security DNS Classifier
doxx.net, 2026
https://huggingface.co/doxxnet/zmsBERT

doxx.net - Privacy without compromise

Downloads last month
115
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support