Palladia
A Benchmark Project for Vision LLMs
Introducing a benchmark for comparing the performance of state-of-the-art visual language models (VLMs) on historical document images, based on the GT4HistOCR dataset.
Latest data update: Loading...
Note: The purpose of this project is not to provide a fair or objective assessment of which model should be used for automated historical document analysis, nor to make any specific recommendations. Rather, the goal is to examine and understand how flagship and secondary market models—excluding those specifically created or fine-tuned for this task—are narrowing the gap in accurately extracting text from historical documents. Additionally, it may not be up to date.
Metrics Explanation
The criteria that have been used to benchmark the documents
Accuracy
Percentage of characters that match exactly between the OCR output and ground truth text. Higher values indicate better performance.
Character Error Rate (CER)
Ratio of character-level errors to the total number of characters. Lower values indicate better performance.
Word Error Rate (WER)
Ratio of word-level errors to the total number of words. Lower values indicate better performance.
Execution Time
Average time taken by each model to process an image. Lower values indicate faster processing.
Palladia - Federico Dassiè, 2026