protonator-models

Minimal-dependency (torch + rdkit + numpy) D-MPNN model weights for protonator. Each model is a 5-fold ensemble; protonator returns the mean prediction with the across-fold standard deviation as a calibrated uncertainty. Weights are fetched automatically at a pinned revision via huggingface_hub.

Folder Endpoint Accuracy
pka_mpnn/ microscopic (per-site) aqueous pKa scaffold-5CV RMSE ~1.30, random-5CV ~1.08; Enamine external RMSE 0.55 (R² 0.96)
logp_mpnn/ octanol–water logP 5-fold CV RMSE 0.77, MAE 0.50, R² 0.86
logs_mpnn/ aqueous logS (log₁₀ mol/L, ~298 K) 5-fold CV RMSE 0.54, MAE 0.35, R² 0.92

Each folder holds fold_0.ptfold_4.pt + config.json (per-fold output denormalization and a featurizer-version contract validated at load).


pka_mpnn — per-site pKa

Microscopic (per-ionization-site) aqueous pKa for drug-like small molecules, 2D-only (SMILES / molecular graph; no 3D conformers, no QM). Given a SMILES and an ionization-center (IC) atom, predicts that site's pKa with an ensemble uncertainty.

Architecture (per fold)

  • depth-3 directed-bond D-MPNN, hidden 1024
  • distance-conditioned IC-centric attention readout (attention+dist): a learned shortest-path-distance bias routes any substituent — at any topological distance — to the ionization center in O(1), so a shallow model is sensitive to remote substituent effects
  • inductive descriptor at the IC (Taft σ_I / Swain–Lupton σ_F / Kier–Hall E-state, with no distance cutoff)
  • dropout 0.15 + weight-decay; per-fold output denormalization; 5-fold ensemble

Benchmarks

Out-of-fold 5-fold cross-validation on a curated, residual-denoised combination of ChEMBL / iBonD / IUPAC experimental pKa (~17.6k per-site measurements):

Split RMSE MAE
scaffold 5-fold CV ~1.30 ~0.95
random 5-fold CV ~1.08

Held-out external set (Enamine fluoro, 158 molecules, not used in the external evaluation):

RMSE MAE
Enamine fluoro 0.55 0.40 0.96

Remote-substituent sensitivity (the headline improvement over the prior PKaGIN model):

Probe this model prior PKaGIN
Hammett ρ (para-benzoic series; target 1.00) → 1.0 0.34 (95% CI [−0.00, 0.77])
fluoro matched-pair Δ sign-accuracy 1.00 0.33

The prior model is degenerate on remote substituents (it predicts near-identical pKa for a molecule and an analog whose substituent lies beyond its receptive field); this model fixes that while matching/exceeding overall accuracy.

Required input standardization

The model was trained on neutral, desalted, largest-organic-fragment SMILES that were not tautomer-canonicalized. protonator.predict_sites applies the matching standardization (desalt + largest-fragment + neutralize, no tautomer canonicalization) before detecting ionization centers, so charged species and salts are handled correctly. Do not bypass it for arbitrary user input.


logp_mpnn — octanol/water logP

D-MPNN, 5-fold ensemble. 5-fold CV: RMSE 0.77, MAE 0.50, R² 0.86.

logs_mpnn — aqueous logS

Aqueous log solubility (log₁₀ mol/L, ~298 K); shares the D-MPNN trunk with logP, trained jointly. 5-fold CV: RMSE 0.54, MAE 0.35, R² 0.92.


Usage

protonator fetches these automatically (pinned revision). Manual load:

from protonator.ml.models.pka_mpnn import PKaPredictor
pred = PKaPredictor(weights_dir="<pka_mpnn folder>", device="cpu")
sites = pred.predict_sites("[Na+].CC(=O)[O-]")   # auto-standardized -> Carboxylic Acid ~4.96

Accuracy figures are out-of-fold cross-validation on the experimental training data plus a held-out external set; they are not directly comparable across endpoints (different data and splits).

Citation

Isayev lab, protonatorhttps://github.com/isayevlab/protonator

solvation_mpnn (solvation free energy, dG_solv)

solvation_mpnn/ — solute-in-solvent solvation free energy (dG_solv, kcal/mol at 298.15 K). Dual-encoder D-MPNN: separate solute and solvent encoders (hidden 2048, depth 6, 72-dim atom features) feeding an FFN over both pooled vectors plus per-molecule RDKit SlogP_VSA descriptors (4120 -> 1024 -> 1024 -> 1); 5-fold ensemble. Self-contained (torch + rdkit + numpy only).

5-fold CV (out-of-fold, 21,214 solute/solvent pairs)
dG_solv RMSE 0.95 / MAE 0.51 / R2 0.978 kcal/mol

Also drives octanol-water LogP and arbitrary phase log-partition coefficients via the thermodynamic cycle (dG_a - dG_b) / RT ln10. ensemble_fold_0.pt..ensemble_fold_4.pt (bare state_dicts) + config.json (informational provenance; architecture is fixed in protonator.ml.models._common.ENCODER_CONFIG, not parsed at load).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support