OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
Paper
•
2511.18622
•
Published
A 16,384-token BPE tokenizer for OpenGloss OGBERT embedding models.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-16k")
tokens = tokenizer.encode("hello world")
<|start|>, <|end|>, <|pad|>, <|unk|>, <|cls|>, <|sep|>, <|mask|>)@misc{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Michael J. Bommarito II},
year={2025},
eprint={2511.18622},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Apache 2.0