OCR with Tesseract

This is a simple and very fast method to create OCR for the digital collections of Universitätsbibliothek Tübingen. The processing time for 167 pages like in the example below is less than 20 minutes on a macBook.

Example for OCR process with Tesseract

Here OCR is created for a German book from 1556, Wahrhafftig Historia unnd beschreibung einer Landtschafft der Wilden, Nacketen, Grimmigen Menschfresser Leuthen in der Newen welt America gelegen (https://opendigi.ub.uni-tuebingen.de/opendigi/FoXXVIII9_qt). It produces ALTO XML files (compatible with DFG viewer) and a TEI file (suitable for OpenDigi).

The following shell script includes all steps:

#!/usr/bin/env bash

# Name of book from Universitätsbibliothek Tübingen (part of the URL).
name=FoXXVIII9_qt

# Get IIIF manifest.
curl -O https://opendigi.ub.uni-tuebingen.de/opendigi/$name/manifest

# Get METS file.
curl -O https://opendigi.ub.uni-tuebingen.de/opendigi/$name/mets

# Get images.
mkdir -p jpg
for url in $(tei format --pretty mets|grep default.jpg|perl -pe 's/.*(https.*jpg).*/$1/'); do jpg=$(echo $url|perl -pe 's:^.*/(.*).jp2.*:$1.jpg:'); echo $url; curl --silent -o jpg/$jpg $url; done

# Create ALTO and text files with Tesseract.
MODEL=german_print
mkdir -p tess_$MODEL
(
cd jpg
for jpg in *.jpg; do echo $jpg && ~/src/github/tesseract-ocr/tesseract/bin/ndebug/clang/tesseract $jpg ../tess_$MODEL/${jpg/.jpg/} -l ubma/$MODEL alto txt; done 2>&1|tee ../tess_$MODEL/ocr.log
)

# Create TEI file from ALTO files.
tei alto2tei --manifest=https://opendigi.ub.uni-tuebingen.de/opendigi/$name/manifest tess_$MODEL/*.xml > $name.tei

Instead of german_print, other models like for example GT4HistOCR or frak2021 can be used.

Comparision of OCR results

Different OCR results or an OCR result and a perfect transcription (ground truth) can be compared using DingleHopper. The following commands create HTML reports for all page results.

# Compare different ALTO files.
dir1=tess_german_print_11
dir2=tess_german_print_12
mkdir -p reports/${dir1}_${dir2}
for xml in $dir1/*.xml; do dinglehopper $xml $dir2/$(basename $xml) $(basename $xml .xml) reports/${dir1}_${dir2}; done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR with Tesseract