DNABERT-S
Pre-trained model on multi-species genome using a contrastive learning objective for species-aware DNA embeddings.
Disclaimer
This is an UNOFFICIAL implementation of the DNABERT-S: pioneering species differentiation with species-aware DNA embeddings by Zhihan Zhou, et al.
The OFFICIAL repository of DNABERT-S is at MAGICS-LAB/DNABERT_S.
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing DNABERT-S did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details
DNABERT-S is a bert-style model built upon DNABERT-2 and fine-tuned with contrastive learning for species-aware DNA embeddings. The model was trained using the proposed Curriculum Contrastive Learning (C²LR) strategy with the Manifold Instance Mixup (MI-Mix) training objective.
DNABERT-S shares the same architecture as DNABERT-2: it uses Byte Pair Encoding (BPE) tokenization, Attention with Linear Biases (ALiBi) instead of learned position embeddings, and incorporates a Gated Linear Unit (GeGLU) MLP and FlashAttention for improved efficiency.
Model Specification
| Num Layers | Hidden Size | Num Heads | Intermediate Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
|---|---|---|---|---|---|---|---|
| 12 | 768 | 12 | 3072 | 117.07 | 125.83 | 62.92 | 512 |
Links
- Code: multimolecule.dnaberts
- Data: GenBank
- Paper: DNABERT-S: pioneering species differentiation with species-aware DNA embeddings
- Developed by: Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V Davuluri, Zhong Wang, Han Liu
- Model type: BERT - MosaicBERT
- Original Repository: MAGICS-LAB/DNABERT_S
Usage
The model file depends on the multimolecule library. You can install it using pip:
pip install multimolecule
Direct Use
Feature Extraction
You can use this model directly with a pipeline for feature extraction:
import multimolecule # you must import multimolecule to register models
from transformers import pipeline
predictor = pipeline("feature-extraction", model="multimolecule/dnaberts")
output = predictor("ATCGATCGATCG")
Downstream Use
Extract Features
Here is how to use this model to get the features of a given sequence in PyTorch:
from multimolecule import DnaBertSModel
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("multimolecule/dnaberts")
model = DnaBertSModel.from_pretrained("multimolecule/dnaberts")
text = "ATCGATCGATCGATCG"
input = tokenizer(text, return_tensors="pt")
output = model(**input)
Sequence Classification / Regression
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
import torch
from multimolecule import DnaBertSForSequencePrediction
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("multimolecule/dnaberts")
model = DnaBertSForSequencePrediction.from_pretrained("multimolecule/dnaberts")
text = "ATCGATCGATCGATCG"
input = tokenizer(text, return_tensors="pt")
label = torch.tensor([1])
output = model(**input, labels=label)
Token Classification / Regression
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression.
Here is how to use this model as backbone to fine-tune for a nucleotide-level task in PyTorch:
import torch
from multimolecule import DnaBertSForTokenPrediction
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("multimolecule/dnaberts")
model = DnaBertSForTokenPrediction.from_pretrained("multimolecule/dnaberts")
text = "ATCGATCGATCGATCG"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), ))
output = model(**input, labels=label)
Contact Classification / Regression
This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
import torch
from multimolecule import DnaBertSForContactPrediction
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("multimolecule/dnaberts")
model = DnaBertSForContactPrediction.from_pretrained("multimolecule/dnaberts")
text = "ATCGATCGATCGATCG"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), len(text)))
output = model(**input, labels=label)
Training Details
DNABERT-S uses a two-phase Curriculum Contrastive Learning (C²LR) strategy. In phase I, the model is trained with Weighted SimCLR for one epoch. In phase II, the model is further trained with Manifold Instance Mixup (MI-Mix) for two epochs. The training starts from the pre-trained DNABERT-2 checkpoint.
Training Data
The DNABERT-S model was trained on pairs of non-overlapping DNA sequences from the same species, sourced from GenBank. The dataset consists of 47,923 pairs from 17,636 viral genomes, 1 million pairs from 5,011 fungi genomes, and 1 million pairs from 6,402 bacteria genomes. From the total of 2,047,923 pairs, 2 million were randomly selected for training and the rest were used as validation data. All DNA sequences are 10,000 bp in length.
Training Procedure
Pre-training
The model was trained on 8 NVIDIA A100 80GB GPUs.
- Temperature (τ): 0.05
- Hyperparameter (α): 1.0
- Epochs: 1 (phase I, Weighted SimCLR) + 2 (phase II, MI-Mix)
- Optimizer: Adam
- Learning rate: 3e-6
- Batch size: 48
- Checkpointing: Every 10,000 steps, best selected on validation loss
- Training time: ~48 hours
Citation
@article{zhou2025dnaberts,
title={{DNABERT-S}: pioneering species differentiation with species-aware {DNA} embeddings},
author={Zhou, Zhihan and Wu, Weimin and Ho, Harrison and Wang, Jiayi and Shi, Lizhen and Davuluri, Ramana V and Wang, Zhong and Liu, Han},
journal={Bioinformatics},
volume={41},
pages={i255--i264},
year={2025},
doi={10.1093/bioinformatics/btaf188}
}
The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
Contact
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the DNABERT-S paper for questions or comments on the paper/model.
License
This model is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
SPDX-License-Identifier: AGPL-3.0-or-later
- Downloads last month
- 10