protonator-models
Minimal-dependency (torch + rdkit + numpy) D-MPNN model weights for
protonator. Each model is a 5-fold ensemble;
protonator returns the mean prediction with the across-fold standard deviation as a
calibrated uncertainty. Weights are fetched automatically at a pinned revision via
huggingface_hub.
| Folder | Endpoint | Accuracy |
|---|---|---|
pka_mpnn/ |
microscopic (per-site) aqueous pKa | scaffold-5CV RMSE ~1.30, random-5CV ~1.08; Enamine external RMSE 0.55 (R² 0.96) |
logp_mpnn/ |
octanol–water logP | 5-fold CV RMSE 0.77, MAE 0.50, R² 0.86 |
logs_mpnn/ |
aqueous logS (log₁₀ mol/L, ~298 K) | 5-fold CV RMSE 0.54, MAE 0.35, R² 0.92 |
Each folder holds fold_0.pt…fold_4.pt + config.json (per-fold output denormalization
and a featurizer-version contract validated at load).
pka_mpnn — per-site pKa
Microscopic (per-ionization-site) aqueous pKa for drug-like small molecules, 2D-only (SMILES / molecular graph; no 3D conformers, no QM). Given a SMILES and an ionization-center (IC) atom, predicts that site's pKa with an ensemble uncertainty.
Architecture (per fold)
- depth-3 directed-bond D-MPNN, hidden 1024
- distance-conditioned IC-centric attention readout (
attention+dist): a learned shortest-path-distance bias routes any substituent — at any topological distance — to the ionization center in O(1), so a shallow model is sensitive to remote substituent effects - inductive descriptor at the IC (Taft σ_I / Swain–Lupton σ_F / Kier–Hall E-state, with no distance cutoff)
- dropout 0.15 + weight-decay; per-fold output denormalization; 5-fold ensemble
Benchmarks
Out-of-fold 5-fold cross-validation on a curated, residual-denoised combination of ChEMBL / iBonD / IUPAC experimental pKa (~17.6k per-site measurements):
| Split | RMSE | MAE |
|---|---|---|
| scaffold 5-fold CV | ~1.30 | ~0.95 |
| random 5-fold CV | ~1.08 | — |
Held-out external set (Enamine fluoro, 158 molecules, not used in the external evaluation):
| RMSE | MAE | R² | |
|---|---|---|---|
| Enamine fluoro | 0.55 | 0.40 | 0.96 |
Remote-substituent sensitivity (the headline improvement over the prior PKaGIN model):
| Probe | this model | prior PKaGIN |
|---|---|---|
| Hammett ρ (para-benzoic series; target 1.00) | → 1.0 | 0.34 (95% CI [−0.00, 0.77]) |
| fluoro matched-pair Δ sign-accuracy | 1.00 | 0.33 |
The prior model is degenerate on remote substituents (it predicts near-identical pKa for a molecule and an analog whose substituent lies beyond its receptive field); this model fixes that while matching/exceeding overall accuracy.
Required input standardization
The model was trained on neutral, desalted, largest-organic-fragment SMILES that were not
tautomer-canonicalized. protonator.predict_sites applies the matching standardization
(desalt + largest-fragment + neutralize, no tautomer canonicalization) before detecting
ionization centers, so charged species and salts are handled correctly. Do not bypass it for
arbitrary user input.
logp_mpnn — octanol/water logP
D-MPNN, 5-fold ensemble. 5-fold CV: RMSE 0.77, MAE 0.50, R² 0.86.
logs_mpnn — aqueous logS
Aqueous log solubility (log₁₀ mol/L, ~298 K); shares the D-MPNN trunk with logP, trained jointly. 5-fold CV: RMSE 0.54, MAE 0.35, R² 0.92.
Usage
protonator fetches these automatically (pinned revision). Manual load:
from protonator.ml.models.pka_mpnn import PKaPredictor
pred = PKaPredictor(weights_dir="<pka_mpnn folder>", device="cpu")
sites = pred.predict_sites("[Na+].CC(=O)[O-]") # auto-standardized -> Carboxylic Acid ~4.96
Accuracy figures are out-of-fold cross-validation on the experimental training data plus a held-out external set; they are not directly comparable across endpoints (different data and splits).
Citation
Isayev lab, protonator — https://github.com/isayevlab/protonator
solvation_mpnn (solvation free energy, dG_solv)
solvation_mpnn/ — solute-in-solvent solvation free energy (dG_solv, kcal/mol at 298.15 K).
Dual-encoder D-MPNN: separate solute and solvent encoders (hidden 2048, depth 6, 72-dim atom
features) feeding an FFN over both pooled vectors plus per-molecule RDKit SlogP_VSA descriptors
(4120 -> 1024 -> 1024 -> 1); 5-fold ensemble. Self-contained (torch + rdkit + numpy only).
| 5-fold CV (out-of-fold, 21,214 solute/solvent pairs) |
|---|
| dG_solv RMSE 0.95 / MAE 0.51 / R2 0.978 kcal/mol |
Also drives octanol-water LogP and arbitrary phase log-partition coefficients via the
thermodynamic cycle (dG_a - dG_b) / RT ln10. ensemble_fold_0.pt..ensemble_fold_4.pt
(bare state_dicts) + config.json (informational provenance; architecture is fixed in
protonator.ml.models._common.ENCODER_CONFIG, not parsed at load).