-
Notifications
You must be signed in to change notification settings - Fork 1
OCR with Tesseract
This is a simple and very fast method to create OCR for the digital collections of Universitätsbibliothek Tübingen. The processing time for 167 pages like in the example below is less than 20 minutes on a macBook.
Here OCR is created for a German book from 1556, Wahrhafftig Historia unnd beschreibung einer Landtschafft der Wilden, Nacketen, Grimmigen Menschfresser Leuthen in der Newen welt America gelegen (https://opendigi.ub.uni-tuebingen.de/opendigi/FoXXVIII9_qt). It produces ALTO XML files (compatible with DFG viewer) and a TEI file (suitable for OpenDigi).
The following shell script includes all steps:
#!/usr/bin/env bash
# Name of book from Universitätsbibliothek Tübingen (part of the URL).
name=FoXXVIII9_qt
# Get IIIF manifest.
curl -O https://opendigi.ub.uni-tuebingen.de/opendigi/$name/manifest
# Get METS file.
curl -O https://opendigi.ub.uni-tuebingen.de/opendigi/$name/mets
# Get images.
mkdir -p jpg
for url in $(tei format --pretty mets|grep default.jpg|perl -pe 's/.*(https.*jpg).*/$1/'); do jpg=$(echo $url|perl -pe 's:^.*/(.*).jp2.*:$1.jpg:'); echo $url; curl --silent -o jpg/$jpg $url; done
# Create ALTO and text files with Tesseract.
MODEL=german_print
mkdir -p tess_$MODEL
(
cd jpg
for jpg in *.jpg; do echo $jpg && tesseract $jpg ../tess_$MODEL/${jpg/.jpg/} -l ubma/$MODEL alto txt; done 2>&1|tee ../tess_$MODEL/ocr.log
)
# Create TEI file from ALTO files.
tei alto2tei --manifest=https://opendigi.ub.uni-tuebingen.de/opendigi/$name/manifest tess_$MODEL/*.xml > $name.tei
Instead of german_print
, other models like for example GT4HistOCR
or frak2021
can be used.
Different OCR results or an OCR result and a perfect transcription (ground truth) can be compared using DingleHopper. The following commands create HTML reports for all page results.
# Compare different ALTO files.
dir1=tess_german_print_11
dir2=tess_german_print_12
mkdir -p reports/${dir1}_${dir2}
for xml in $dir1/*.xml; do dinglehopper $xml $dir2/$(basename $xml) $(basename $xml .xml) reports/${dir1}_${dir2}; done