AutoProcessor.from_pretrained fails on transformers >= 5.0 due to `bpe_tokenizer` attribute / repo-layout mismatch

#20
by HecklesL - opened

Bug

AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True) fails on
transformers >= 5.0.0 with the following error:

ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a `tokenizers` library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece or tiktoken installed to convert a slow tokenizer to a fast one.

The error is misleading β€” sentencepiece and tiktoken being installed does NOT fix it. The real
cause is a path-layout mismatch (see below).

Repro

# env: transformers==5.2.0, tokenizers latest, both sentencepiece+tiktoken installed
from transformers import AutoProcessor
proc = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)
# -> ValueError as above

Direct loading via PreTrainedTokenizerFast.from_pretrained("physical-intelligence/fast")
or AutoTokenizer.from_pretrained("physical-intelligence/fast") works fine β€” only the
AutoProcessor flow fails.

Root cause

UniversalActionProcessor declares its tokenizer attribute as bpe_tokenizer:

class UniversalActionProcessor(ProcessorMixin):
    attributes: ClassVar[list[str]] = ["bpe_tokenizer"]
    bpe_tokenizer_class: str = "AutoTokenizer"

In transformers >= 5.0, ProcessorMixin._load_tokenizer_from_pretrained
(processing_utils.py L1453–1471)
checks is_primary = sub_processor_type == "tokenizer". Since the attribute is named
bpe_tokenizer (not tokenizer), is_primary is False, and the loader is forced to look in
a <repo>/bpe_tokenizer/ subfolder:

tokenizer_subfolder = os.path.join(subfolder, sub_processor_type) if subfolder else sub_processor_type
tokenizer = auto_processor_class.from_pretrained(
    pretrained_model_name_or_path, subfolder=tokenizer_subfolder, **kwargs,
)

The physical-intelligence/fast repo puts tokenizer.json, tokenizer_config.json,
special_tokens_map.json at the root, not in a bpe_tokenizer/ subdirectory. Hence the
subfolder lookup returns nothing, and PreTrainedTokenizerFast.__init__ falls through to its
"couldn't instantiate backend" error path.

This is confirmed by HuggingFace cache .no_exist/ markers β€” after the failed load, the cache
records that bpe_tokenizer/tokenizer.json, bpe_tokenizer/tokenizer_config.json etc.
"do not exist" in the repo.

In transformers < 5.0 the subfolder check was less strict and the loader could fall back to
root, which is why this hasn't surfaced before.

Proposed fix (repo side)

Move the three tokenizer files into a bpe_tokenizer/ subdirectory of the repo:

physical-intelligence/fast/
β”œβ”€β”€ processing_action_tokenizer.py
β”œβ”€β”€ processor_config.json
β”œβ”€β”€ README.md
└── bpe_tokenizer/
    β”œβ”€β”€ tokenizer.json
    β”œβ”€β”€ tokenizer_config.json
    └── special_tokens_map.json

This is a pure layout change, no code change, and is backward-compatible with anyone using
PreTrainedTokenizerFast.from_pretrained directly (as long as they pass subfolder="bpe_tokenizer").

Alternative: rename the attribute in processing_action_tokenizer.py from bpe_tokenizer
to tokenizer, which would make it is_primary and load from root. But the first option is
less invasive for downstream users.

Local workaround (until upstream is fixed)

DEST=~/.cache/huggingface/FAST_processor_local
SRC=~/.cache/huggingface/hub/models--physical-intelligence--fast/snapshots/<commit_hash>
mkdir -p $DEST/bpe_tokenizer
cp -L $SRC/{processing_action_tokenizer.py,processor_config.json} $DEST/
cp -L $SRC/{tokenizer.json,tokenizer_config.json,special_tokens_map.json} $DEST/bpe_tokenizer/

# Then load via local absolute path (bypasses HF Hub manifest check)
from transformers import AutoProcessor
proc = AutoProcessor.from_pretrained(DEST, trust_remote_code=True)

A symlink-based fix inside the HF cache snapshot dir does NOT work because HF Hub checks the
remote manifest in addition to the local cache; subfolder arguments are validated against
the remote file list, which still lacks bpe_tokenizer/.

Env

transformers==5.2.0
tokenizers==<...>
sentencepiece==0.2.1
tiktoken==0.12.0
python==3.10

Try LeRobot's official version: lerobot/fast-action-tokenizer
That one should work fine.

Sign up or log in to comment