Instructions to use google/pix2struct-widget-captioning-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/pix2struct-widget-captioning-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("visual-question-answering", model="google/pix2struct-widget-captioning-base")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/pix2struct-widget-captioning-base") model = AutoModelForImageTextToText.from_pretrained("google/pix2struct-widget-captioning-base") - Notebooks
- Google Colab
- Kaggle
Missing example for running the model
This model needs a bounding box to specify which widget to describe.
But there is no example for this on the model card.
What is unclear how the bounding box should be specified.
As I understand the code should look something like this:
model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-widget-captioning-base")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-widget-captioning-base")
question = "? bounding box ?"
inputs = processor(images=image, text=question, return_tensors="pt")
predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))
Same issue here.
The model seems to return same caption regardless of the bounding box.
Has anyone solved it yet?