Skip to content

OCR with Tesseract

Stefan Weil edited this page Nov 13, 2023 · 6 revisions

OCR with Tesseract

This is a simple and very fast method to create OCR for the digital collections of Universitätsbibliothek Tübingen. The OCR processing time for 167 pages like in the example below can be as low as less than 60 seconds on a MacBook Pro or even just 10 seconds on a server.

Example for OCR process with Tesseract

Here OCR is created for a German book from 1556, Wahrhafftig Historia unnd beschreibung einer Landtschafft der Wilden, Nacketen, Grimmigen Menschfresser Leuthen in der Newen welt America gelegen (https://opendigi.ub.uni-tuebingen.de/opendigi/FoXXVIII9_qt). It produces ALTO XML files (compatible with DFG viewer) and a TEI file (suitable for OpenDigi).

Single threaded with image download

The following shell script includes all steps, runs single threaded and the OCR part is finished in less than 20 minutes on a MacBook Pro.

#!/usr/bin/env bash

# Name of book from Universitätsbibliothek Tübingen (part of the URL).
name=FoXXVIII9_qt

# Get IIIF manifest.
curl -O https://opendigi.ub.uni-tuebingen.de/opendigi/$name/manifest

# Get METS file.
curl -O https://opendigi.ub.uni-tuebingen.de/opendigi/$name/mets

# Get images. Be patient, that takes some time (1:30 min with a good connection).
mkdir -p jpg
for url in $(tei format --pretty mets|grep default.jpg|perl -pe 's/.*(https.*jpg).*/$1/'); do jpg=$(echo $url|perl -pe 's:^.*/(.*).jp2.*:$1.jpg:'); echo $url; curl --silent -o jpg/$jpg $url; done

# Create ALTO and text files with Tesseract.
MODEL=german_print
mkdir -p tess_$MODEL
(
cd jpg
for jpg in *.jpg; do echo $jpg && tesseract $jpg ../tess_$MODEL/${jpg/.jpg/} -l ubma/$MODEL alto txt; done 2>&1|tee ../tess_$MODEL/ocr.log
)

# Create TEI file from ALTO files.
tei alto2tei --manifest=https://opendigi.ub.uni-tuebingen.de/opendigi/$name/manifest tess_$MODEL/*.xml > $name.tei

That example uses the latest model german_print, but older models like for example GT4HistOCR or frak2021 can be used, too.

Multi threaded with image download

Instead of running single threaded, all available CPU cores can be used. This reduces the time for the OCR to less than a minute on a MacBook Pro:

MODEL=german_print
mkdir -p tess_$MODEL
(
cd jpg
ls *.jpg | /usr/bin/time -p parallel --keep-order tesseract {} ../tess_$MODEL/{.} -l ubma/$MODEL alto txt 2>&1 | tee ../tess_$MODEL/ocr.log
)

Multi threaded without explicit image download

Tesseract can not only process local image files, but also image URLs. Therefore it is not necessary to store the image files locally. The whole process only takes 12 seconds on a server.

MODEL=german_print
mkdir -p tess_$MODEL
tei format --pretty mets | grep default.jpg | perl -pe 's/.*(https.*jpg).*/$1/' | /usr/bin/time -p parallel --keep-order tesseract {} tess_$MODEL/'{= $_ =~ s,.*/([^/]*).jp2.*,$1, =}' -l ubma/$MODEL alto txt 2>&1 | tee tess_$MODEL/ocr.log

Comparision of OCR results

Different OCR results or an OCR result and a perfect transcription (ground truth) can be compared using DingleHopper. The following commands create HTML reports for all page results.

# Compare different ALTO files.
dir1=tess_german_print_11
dir2=tess_german_print_12
mkdir -p reports/${dir1}_${dir2}
for xml in $dir1/*.xml; do dinglehopper $xml $dir2/$(basename $xml) $(basename $xml .xml) reports/${dir1}_${dir2}; done