-
Notifications
You must be signed in to change notification settings - Fork 1
OCR with Tesseract
This is a simple and very fast method to create OCR for the digital collections of Universitätsbibliothek Tübingen. The OCR processing time for 167 pages like in the example below can be as low as less than 60 seconds on a MacBook Pro or even just 10 seconds on a server.
Here OCR is created for a German book from 1556, Wahrhafftig Historia unnd beschreibung einer Landtschafft der Wilden, Nacketen, Grimmigen Menschfresser Leuthen in der Newen welt America gelegen (https://opendigi.ub.uni-tuebingen.de/opendigi/FoXXVIII9_qt). It produces ALTO XML files (compatible with DFG viewer) and a TEI file (suitable for OpenDigi).
The following shell script includes all steps, runs single threaded and the OCR part is finished in less than 20 minutes on a MacBook Pro.
#!/usr/bin/env bash
# Name of book from Universitätsbibliothek Tübingen (part of the URL).
name=FoXXVIII9_qt
# Get IIIF manifest.
curl -O https://opendigi.ub.uni-tuebingen.de/opendigi/$name/manifest
# Get METS file.
curl -O https://opendigi.ub.uni-tuebingen.de/opendigi/$name/mets
# Get images. Be patient, that takes some time (1:30 min with a good connection).
mkdir -p jpg
for url in $(tei format --pretty mets|grep default.jpg|perl -pe 's/.*(https.*jpg).*/$1/'); do jpg=$(echo $url|perl -pe 's:^.*/(.*).jp2.*:$1.jpg:'); echo $url; curl --silent -o jpg/$jpg $url; done
# Create ALTO and text files with Tesseract.
MODEL=german_print
mkdir -p tess_$MODEL
(
cd jpg
for jpg in *.jpg; do echo $jpg && tesseract $jpg ../tess_$MODEL/${jpg/.jpg/} -l ubma/$MODEL alto txt; done 2>&1|tee ../tess_$MODEL/ocr.log
)
# Create TEI file from ALTO files.
tei alto2tei --manifest=https://opendigi.ub.uni-tuebingen.de/opendigi/$name/manifest tess_$MODEL/*.xml > $name.tei
That example uses the latest model german_print
, but older models like for example GT4HistOCR
or frak2021
can be used, too.
Instead of running single threaded, all available CPU cores can be used. This reduces the time for the OCR to less than a minute on a MacBook Pro:
MODEL=german_print
mkdir -p tess_$MODEL
(
cd jpg
ls *.jpg | /usr/bin/time -p parallel --keep-order tesseract {} ../tess_$MODEL/{.} -l ubma/$MODEL alto txt 2>&1 | tee ../tess_$MODEL/ocr.log
)
Tesseract can not only process local image files, but also image URLs. Therefore it is not necessary to store the image files locally. The whole process only takes 12 seconds on a server.
MODEL=german_print
mkdir -p tess_$MODEL
tei format --pretty mets | grep default.jpg | perl -pe 's/.*(https.*jpg).*/$1/' | /usr/bin/time -p parallel --keep-order tesseract {} tess_$MODEL/'{= $_ =~ s,.*/([^/]*).jp2.*,$1, =}' -l ubma/$MODEL alto txt 2>&1 | tee tess_$MODEL/ocr.log
Different OCR results or an OCR result and a perfect transcription (ground truth) can be compared using DingleHopper. The following commands create HTML reports for all page results.
# Compare different ALTO files.
dir1=tess_german_print_11
dir2=tess_german_print_12
mkdir -p reports/${dir1}_${dir2}
for xml in $dir1/*.xml; do dinglehopper $xml $dir2/$(basename $xml) $(basename $xml .xml) reports/${dir1}_${dir2}; done