Instructions to use Rostlab/prot_bert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Rostlab/prot_bert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Rostlab/prot_bert")# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Rostlab/prot_bert", dtype="auto") - Inference
- Notebooks
- Google Colab
- Kaggle
Model for feature generation requires very high memory.
Feature generation of protein sequence length of about 1000 takes very high ram usage and google colab's 12GB gpu memory became 'out of memory' error just after using of 6 of those protein sequences.
Maybe try to cast the model to half-precision before running feature extraction.
Also, I would recommend to use our ProtT5-XL model because it proved to be better in any of our benchmarks:
https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc
Also, when you only hit OOM after embedding 6 sequences of identical length, you have memory leakage somewhere.
Once you managed to embed a single protein of e.g. 1k residues, it should not make any difference whether you repeat the process x-times.
when using this https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc model , the tokenizer shows error "Exception: You're trying to run a Unigram model but you're file was trained with a different algorithm".
Yeah, I guess you are running into this issue: https://github.com/huggingface/transformers/issues/9871
I think your problem should be solved by loading BertTokenizer or T5Tokenizer instead of AutoTokenizer