Palladia

A Benchmark Project for Vision LLMs

Introducing a benchmark for comparing the performance of state-of-the-art visual language models (VLMs) on historical document images, based on the GT4HistOCR dataset.

Latest data update: Loading...

Note: The purpose of this project is not to provide a fair or objective assessment of which model should be used for automated historical document analysis, nor to make any specific recommendations. Rather, the goal is to examine and understand how flagship and secondary market models—excluding those specifically created or fine-tuned for this task—are narrowing the gap in accurately extracting text from historical documents. Additionally, it may not be up to date.

Metrics Explanation

The criteria that have been used to benchmark the documents

Accuracy

Percentage of characters that match exactly between the OCR output and ground truth text. Higher values indicate better performance.

Match85%

Character Error Rate (CER)

Ratio of character-level errors to the total number of characters. Lower values indicate better performance.

Matches, deletions, insertions

Word Error Rate (WER)

Ratio of word-level errors to the total number of words. Lower values indicate better performance.

thequikbrown

Word comparison

Execution Time

Average time taken by each model to process an image. Lower values indicate faster processing.

2.3s

Avg time

Loading leaderboard…

Loading document leaderboards…

Palladia - Federico Dassiè, 2026