tarsier — screenshot of github.com

tarsier

Tarsier is a library I'd use to prepare web page content for LLM interaction. It generates a simplified, tagged text representation of a page, including interactable elements, which helps even text-only LLMs understand visual structure and perform actions.

Visit github.com →

Questions & Answers

What is Tarsier?
Tarsier is a Python library designed to transform web pages into structured text representations optimized for Large Language Model (LLM) interaction. It visually tags interactable elements and converts page content into a whitespace-structured string.
Who is Tarsier intended for?
Tarsier is for developers and researchers building web interaction agents or automation tools that rely on LLMs. It helps overcome challenges in feeding web content to LLMs and enabling them to perform actions accurately.
How does Tarsier improve upon traditional web scraping for LLMs?
Tarsier differs by providing a specialized perception system that not only extracts text but also visually tags interactable elements with unique IDs. This approach, especially its OCR-based "ASCII art" representation, helps text-only LLMs understand visual context, often outperforming direct GPT-4V usage in benchmarks.
When should one use Tarsier?
Use Tarsier when developing autonomous web agents that require an LLM to understand a webpage's layout, identify interactable elements, and execute actions. It's particularly useful when fine-grained visual understanding is critical for successful web automation.
What OCR engines does Tarsier support?
Tarsier currently supports Google Cloud Vision and Microsoft Azure Computer Vision for processing page screenshots into its whitespace-structured text representation. It requires valid service account credentials for the chosen OCR service.