Home

Welcome to the digitue-gt wiki!

Here we describe the OCR process made for the DigiTheo collection from University Library of Tübingen. Some examples of the generated fulltexts for books and journals from that collection are shown using the UB Mannheim instance of the DFG Viewer.

In addition, selected prints of other OpenDigi collections were also processed with the same OCR tool and model. They are shown here, too.

TODO

Open tasks / issues:

Not all image filenames use the pattern *_nnnn.jp2 or *_nnnnn.jp2 with nnnn or nnnnnn being the page number (4 or 5 digits with leading 0). Some add a a, b and maybe other characters after the page number. Those cases which occur in 6 books and 227 journal issues are currently not handled, so OCR is missing for such pages. Fixed.
~~Text recognition for journals of DigiTheo still unfinished.~~ Finished 2022-08-28.
Integrate ALTO results into environment at UB Tübingen and show the OCR at https://opendigi.ub.uni-tuebingen.de/digitue/.

Preparing the environment for text recognition

The following instructions assume that a Linux or similar (MacOS is also fine) environment is available. WSL on Windows 10/11 can also be used.

Some basic programs like curl and python3 should be installed first. The OCR software kraken is installed in a virtual Python3 environment which must be activated for running text recognition:

python3 -m venv venv
source venv/bin/activate
pip install -U pip setuptools

Getting the data for text recognition (OCR)

These command lines can be used to get the list of books and journals in a collection (here: theo) of UB Tübingen:

# Get list of books.
curl -sS http://idb.ub.uni-tuebingen.de/digitue/theo | grep a.href=..opendigi | sed s/.*opendigi.// | sed 's/.>//' | tee monographien
# Get list of journals.
curl -sS http://idb.ub.uni-tuebingen.de/digitue/theo | grep a.href...digitue/theo.... | grep -v html | sed s/.*theo.// | sed 's/.>//' | tee zeitschriften
# Get list of journal issues.
for zeitschrift in $(cat zeitschriften); do curl -sS http://idb.ub.uni-tuebingen.de/opendigi/$zeitschrift|grep a.href=..opendigi.*span|grep -v pdf|sed s/.*opendigi.//|sed 's/.>.*//'; done|tee ausgaben

With that lists it is easy to get all METS files:

# Get METS files for books.
for i in $(cat monographien); do dir=Monographien/$i; mkdir -pv $dir; (cd $dir; wget http://idb.ub.uni-tuebingen.de/opendigi/$i/mets); done
# Get METS files for journals.
for i in $(cat ausgaben); do dir=Zeitschriften/$i; mkdir -pv $dir; (cd $dir; wget http://idb.ub.uni-tuebingen.de/opendigi/$i/mets); done

The METS files can be used to get the list of images (here: only for books) and to download them as JPEG files:

# Get images which have 4 digits for the image number.
for i in $(cat monographien); do dir=Monographien/$i; mkdir -pv $dir; echo $i; (cd $dir; n=1; while test $n -le 999; do file=$(printf %s_%04u.jpg $i $n); url=$(printf http://idb.ub.uni-tuebingen.de/opendigi/image/%s/%s_%04u.jp2/full/full/0/default.jpg $i $i $n); (test -f $file && echo $file exists) || (grep -q ${file/jpg/jp2} mets && echo $file && curl -sS -o $file $url) || break; n=$(expr $n + 1); done); done
# Get images which have 5 digits for the image number.
for i in $(cat monographien); do dir=Monographien/$i; mkdir -pv $dir; echo $i; (cd $dir; n=1; while test $n -le 999; do file=$(printf %s_%05u.jpg $i $n); url=$(printf http://idb.ub.uni-tuebingen.de/opendigi/image/%s/%s_%05u.jp2/full/full/0/default.jpg $i $i $n); (test -f $file && echo $file exists) || (grep -q ${file/jpg/jp2} mets && echo $file && curl -sS -o $file $url) || break; n=$(expr $n + 1); done); done

Running text recognition

The text recognition initially used kraken with the reichsanzeiger model from UB Mannheim. Get and install that model once before running OCR.

# Get and install model file. Up to now, model 23 is the best one, so get that.
curl -o ~/.config/kraken/reichsanzeiger.mlmodel https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/kraken/reichsanzeiger-gt/reichsanzeiger_23.mlmodel

As some OCR results with italic text showed high error rates, the model was later replaced by a new model which had been improved by additional training.

# Get and install improved model file.
curl -o ~/.config/kraken/digitue_best.mlmodel https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/kraken/digitue-gt/digitue_best.mlmodel

Then run the OCR on selected images:

# Don't use multithreading. It wastes a lot of CPU resources and does not produce the OCR much faster.
export OMP_NUM_THREADS=1

# Create ALTO files for all page images.
kraken -a -I "*/*.jpg" -o .xml segment -bl ocr -m digitue_best.mlmodel

The created ALTO files contain not only the detected words with coordinates and recognition confidence values, but also the glyphs with their corresponding information. The word and glyph confidences can be used to detect problematic OCR results. The DFG viewer does not need the glyph level, so that rather large data part can be removed in an additional step.

For maximum throughput it is possible to run one kraken process per CPU core (or two if hyperthreading is supported). OCR production for DigiTheo used up to 32 threads on one server and up to 44 threads on another server.

OCR progress over time

Parts of the production also used GPU resources, but it was not possible to run more than 5 kraken processes on a GPU with 24 GB RAM. The GPU is only used for small parts of the OCR process and does not improve the performance. Therefore using the GPU was stopped after some days, and the best performance was achieved using single threaded CPU based kraken processes only.

kraken process view

Cleaning the results

The model which was used for kraken OCR is able to recognize the unicode COMBINING LATIN SMALL LETTER E. Normally that combining character should only be used for German umlauts, which means it should only occur after A, O, U, a, o, u. To find other occurrences in the ALTO results, this command can be executed:

find * -name "*.xml"|xargs grep  ͤ|grep -v -E "Aͤ|Oͤ|Uͤ|aͤ|oͤ|uͤ"|grep -v 'CONTENT="ͤ"'

Especially the combination of a normal umlaut and COMBINING LATIN SMALL LETTER E occurs rather often and can be fixed with a simple replacement:

find * -name "*.xml"|xargs grep -l -E "Äͤ|Öͤ|Üͤ|äͤ|öͤ|üͤ"|xargs perl -pi -e 's/Äͤ/Aͤ/g; s/Öͤ/Oͤ/g; s/Üͤ/Uͤ/g; s/äͤ/aͤ/g; s/öͤ/oͤ/g; s/üͤ/uͤ/g'

After that there still remain unexpected COMBINING LATIN SMALL LETTER E, for example "Biſchdͤffe", "Auflbͤſungen" or "hebrdͤiſch". These can be fixed separately or simply be ignored as their number is relatively small.

Results (examples)

DigiAltori

MDOG_1898_000 – Antiqua
MDOG_1968_100 – Antiqua

DigiKrimdok

pic_01 – historische Antiqua, teilweise kursiv, italienisch, s/w, Nachtraining sinnvoll

DigiTheo

Monographien

andlaw1863
baur1827
baur1834
baur1836
baur1855
baur1860
baur1861
beck1836
beck1838
beck1843
beck1848
beck1879
bengel1831
dieringer1849
drey1819
drey1834
evthfaktue1830
ewald1842
ewald1848
feilmoser1830
flatt1806
gottschick1891
gottschick1893
gottschick1901
gratz_1817
harless1834
hase1855
hefele1870
hirscher1843
hirscher1849
kautysch1883
kautysch1900
keppler1883
knobel1844
koestlin1815
kuhn1838
linsenmann1877 – kursiv, Nachtraining durchgeführt, neues Ergebnis mit digitue_9.mlmodel
mack1836
marheineke1833 – historische Antiqua
moehler1834 – Fraktur
neander1836 – Fraktur
oehler1845 – Fraktur
oehler1872 – Fraktur
osiander1837 – Fraktur
palmer1839 – Fraktur
sack1836 – Fraktur
schlatter1896 – Fraktur
schmid1852 – Fraktur
steudel1814 – Fraktur
steudel1823 – historische Antiqua
steudel1831 – Fraktur
steudel1832 – Fraktur
steudel1837 – Fraktur
storr1803 – Fraktur
strauss1839 – Fraktur
strauss1848 – Fraktur
walz_1976 – Schreibmaschine, Nachtraining sinnvoll
weiysaecker1877 – kursiv, Nachtraining durchgeführt, neues Ergebnis mit digitue_15.mlmodel
zeller1871 – Fraktur
zeller1880 – Fraktur

Zeitschriften (Auswahl)

DigiTue

AhI8_qt – Wiegendruck, Latein und Deutsch (1498)
Bd231_qt_1
Eg1596
Fc271
FoIII926_02
FoXXIX53
JfI155
Jk2_qt
KeXXI21 – Fraktur (1791)
LXVI119_qt_1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly