Model for feature generation requires very high memory.

by dipayan26 - opened Oct 31, 2023

Oct 31, 2023

Feature generation of protein sequence length of about 1000 takes very high ram usage and google colab's 12GB gpu memory became 'out of memory' error just after using of 6 of those protein sequences.

mheinz

Rostlab org Oct 31, 2023

Maybe try to cast the model to half-precision before running feature extraction.
Also, I would recommend to use our ProtT5-XL model because it proved to be better in any of our benchmarks:
https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc
Also, when you only hit OOM after embedding 6 sequences of identical length, you have memory leakage somewhere.
Once you managed to embed a single protein of e.g. 1k residues, it should not make any difference whether you repeat the process x-times.

dipayan26

Nov 4, 2023

when using this https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc model , the tokenizer shows error "Exception: You're trying to run a Unigram model but you're file was trained with a different algorithm".

mheinz

Rostlab org Nov 7, 2023

Yeah, I guess you are running into this issue: https://github.com/huggingface/transformers/issues/9871
I think your problem should be solved by loading BertTokenizer or T5Tokenizer instead of AutoTokenizer

mheinz changed discussion status to closed Nov 16, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment