Skip to content
Stefan Weil edited this page Nov 13, 2023 · 54 revisions

Welcome to the digitue-gt wiki!

Here we describe the OCR process made for the DigiTheo collection from University Library of Tübingen. Some examples of the generated fulltexts for books and journals from that collection are shown using the UB Mannheim instance of the DFG Viewer.

In addition, selected prints of other OpenDigi collections were also processed with the same OCR tool and model. They are shown here, too.

TODO

Open tasks / issues:

  • Not all image filenames use the pattern *_nnnn.jp2 or *_nnnnn.jp2 with nnnn or nnnnnn being the page number (4 or 5 digits with leading 0). Some add a a, b and maybe other characters after the page number. Those cases which occur in 6 books and 227 journal issues are currently not handled, so OCR is missing for such pages. Fixed.
  • Text recognition for journals of DigiTheo still unfinished. Finished 2022-08-28.
  • Integrate ALTO results into environment at UB Tübingen and show the OCR at https://opendigi.ub.uni-tuebingen.de/digitue/.

Preparing the environment for text recognition

The following instructions assume that a Linux or similar (MacOS is also fine) environment is available. WSL on Windows 10/11 can also be used.

Some basic programs like curl and python3 should be installed first. The OCR software kraken is installed in a virtual Python3 environment which must be activated for running text recognition:

python3 -m venv venv
source venv/bin/activate
pip install -U pip setuptools

Getting the data for text recognition (OCR)

These command lines can be used to get the list of books and journals in a collection (here: theo) of UB Tübingen:

# Get list of books.
curl -sS http://idb.ub.uni-tuebingen.de/digitue/theo | grep a.href=..opendigi | sed s/.*opendigi.// | sed 's/.>//' | tee monographien
# Get list of journals.
curl -sS http://idb.ub.uni-tuebingen.de/digitue/theo | grep a.href...digitue/theo.... | grep -v html | sed s/.*theo.// | sed 's/.>//' | tee zeitschriften
# Get list of journal issues.
for zeitschrift in $(cat zeitschriften); do curl -sS http://idb.ub.uni-tuebingen.de/opendigi/$zeitschrift|grep a.href=..opendigi.*span|grep -v pdf|sed s/.*opendigi.//|sed 's/.>.*//'; done|tee ausgaben

With that lists it is easy to get all METS files:

# Get METS files for books.
for i in $(cat monographien); do dir=Monographien/$i; mkdir -pv $dir; (cd $dir; wget http://idb.ub.uni-tuebingen.de/opendigi/$i/mets); done
# Get METS files for journals.
for i in $(cat ausgaben); do dir=Zeitschriften/$i; mkdir -pv $dir; (cd $dir; wget http://idb.ub.uni-tuebingen.de/opendigi/$i/mets); done

The METS files can be used to get the list of images (here: only for books) and to download them as JPEG files:

# Get images which have 4 digits for the image number.
for i in $(cat monographien); do dir=Monographien/$i; mkdir -pv $dir; echo $i; (cd $dir; n=1; while test $n -le 999; do file=$(printf %s_%04u.jpg $i $n); url=$(printf http://idb.ub.uni-tuebingen.de/opendigi/image/%s/%s_%04u.jp2/full/full/0/default.jpg $i $i $n); (test -f $file && echo $file exists) || (grep -q ${file/jpg/jp2} mets && echo $file && curl -sS -o $file $url) || break; n=$(expr $n + 1); done); done
# Get images which have 5 digits for the image number.
for i in $(cat monographien); do dir=Monographien/$i; mkdir -pv $dir; echo $i; (cd $dir; n=1; while test $n -le 999; do file=$(printf %s_%05u.jpg $i $n); url=$(printf http://idb.ub.uni-tuebingen.de/opendigi/image/%s/%s_%05u.jp2/full/full/0/default.jpg $i $i $n); (test -f $file && echo $file exists) || (grep -q ${file/jpg/jp2} mets && echo $file && curl -sS -o $file $url) || break; n=$(expr $n + 1); done); done

Running text recognition

The text recognition initially used kraken with the reichsanzeiger model from UB Mannheim. Get and install that model once before running OCR.

# Get and install model file. Up to now, model 23 is the best one, so get that.
curl -o ~/.config/kraken/reichsanzeiger.mlmodel https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/kraken/reichsanzeiger-gt/reichsanzeiger_23.mlmodel

As some OCR results with italic text showed high error rates, the model was later replaced by a new model which had been improved by additional training.

# Get and install improved model file.
curl -o ~/.config/kraken/digitue_best.mlmodel https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/kraken/digitue-gt/digitue_best.mlmodel

Then run the OCR on selected images:

# Don't use multithreading. It wastes a lot of CPU resources and does not produce the OCR much faster.
export OMP_NUM_THREADS=1

# Create ALTO files for all page images.
kraken -a -I "*/*.jpg" -o .xml segment -bl ocr -m digitue_best.mlmodel

The created ALTO files contain not only the detected words with coordinates and recognition confidence values, but also the glyphs with their corresponding information. The word and glyph confidences can be used to detect problematic OCR results. The DFG viewer does not need the glyph level, so that rather large data part can be removed in an additional step.

For maximum throughput it is possible to run one kraken process per CPU core (or two if hyperthreading is supported). OCR production for DigiTheo used up to 32 threads on one server and up to 44 threads on another server.

OCR progress over time

Parts of the production also used GPU resources, but it was not possible to run more than 5 kraken processes on a GPU with 24 GB RAM. The GPU is only used for small parts of the OCR process and does not improve the performance. Therefore using the GPU was stopped after some days, and the best performance was achieved using single threaded CPU based kraken processes only.

kraken process view

Cleaning the results

The model which was used for kraken OCR is able to recognize the unicode COMBINING LATIN SMALL LETTER E. Normally that combining character should only be used for German umlauts, which means it should only occur after A, O, U, a, o, u. To find other occurrences in the ALTO results, this command can be executed:

find * -name "*.xml"|xargs grep  ͤ|grep -v -E "Aͤ|Oͤ|Uͤ|aͤ|oͤ|uͤ"|grep -v 'CONTENT="ͤ"'

Especially the combination of a normal umlaut and COMBINING LATIN SMALL LETTER E occurs rather often and can be fixed with a simple replacement:

find * -name "*.xml"|xargs grep -l -E "Äͤ|Öͤ|Üͤ|äͤ|öͤ|üͤ"|xargs perl -pi -e 's/Äͤ/Aͤ/g; s/Öͤ/Oͤ/g; s/Üͤ/Uͤ/g; s/äͤ/aͤ/g; s/öͤ/oͤ/g; s/üͤ/uͤ/g'

After that there still remain unexpected COMBINING LATIN SMALL LETTER E, for example "Biſchdͤffe", "Auflbͤſungen" or "hebrdͤiſch". These can be fixed separately or simply be ignored as their number is relatively small.

Results (examples)

DigiAltori

DigiKrimdok

  • pic_01 – historische Antiqua, teilweise kursiv, italienisch, s/w, Nachtraining sinnvoll

DigiTheo

Monographien

Zeitschriften (Auswahl)

DigiTue