-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the digitue-gt wiki!
Here we describe the OCR process made for the DigiTheo collection from University Library of Tübingen. Some examples of the generated fulltexts for books and journals from that collection are shown using the UB Mannheim instance of the DFG Viewer.
In addition, selected prints of other OpenDigi collections were also processed with the same OCR tool and model. They are shown here, too.
Open tasks / issues:
-
Not all image filenames use the patternFixed.*_nnnn.jp2
or*_nnnnn.jp2
withnnnn
ornnnnnn
being the page number (4 or 5 digits with leading 0). Some add aa
,b
and maybe other characters after the page number. Those cases which occur in 6 books and 227 journal issues are currently not handled, so OCR is missing for such pages. -
Text recognition for journals of DigiTheo still unfinished.Finished 2022-08-28. - Integrate ALTO results into environment at UB Tübingen and show the OCR at https://opendigi.ub.uni-tuebingen.de/digitue/.
The following instructions assume that a Linux or similar (MacOS is also fine) environment is available. WSL on Windows 10/11 can also be used.
Some basic programs like curl
and python3
should be installed first.
The OCR software kraken
is installed in a virtual Python3 environment which must be activated for running text recognition:
python3 -m venv venv
source venv/bin/activate
pip install -U pip setuptools
These command lines can be used to get the list of books and journals in a collection (here: theo
) of UB Tübingen:
# Get list of books.
curl -sS http://idb.ub.uni-tuebingen.de/digitue/theo | grep a.href=..opendigi | sed s/.*opendigi.// | sed 's/.>//' | tee monographien
# Get list of journals.
curl -sS http://idb.ub.uni-tuebingen.de/digitue/theo | grep a.href...digitue/theo.... | grep -v html | sed s/.*theo.// | sed 's/.>//' | tee zeitschriften
# Get list of journal issues.
for zeitschrift in $(cat zeitschriften); do curl -sS http://idb.ub.uni-tuebingen.de/opendigi/$zeitschrift|grep a.href=..opendigi.*span|grep -v pdf|sed s/.*opendigi.//|sed 's/.>.*//'; done|tee ausgaben
With that lists it is easy to get all METS files:
# Get METS files for books.
for i in $(cat monographien); do dir=Monographien/$i; mkdir -pv $dir; (cd $dir; wget http://idb.ub.uni-tuebingen.de/opendigi/$i/mets); done
# Get METS files for journals.
for i in $(cat ausgaben); do dir=Zeitschriften/$i; mkdir -pv $dir; (cd $dir; wget http://idb.ub.uni-tuebingen.de/opendigi/$i/mets); done
The METS files can be used to get the list of images (here: only for books) and to download them as JPEG files:
# Get images which have 4 digits for the image number.
for i in $(cat monographien); do dir=Monographien/$i; mkdir -pv $dir; echo $i; (cd $dir; n=1; while test $n -le 999; do file=$(printf %s_%04u.jpg $i $n); url=$(printf http://idb.ub.uni-tuebingen.de/opendigi/image/%s/%s_%04u.jp2/full/full/0/default.jpg $i $i $n); (test -f $file && echo $file exists) || (grep -q ${file/jpg/jp2} mets && echo $file && curl -sS -o $file $url) || break; n=$(expr $n + 1); done); done
# Get images which have 5 digits for the image number.
for i in $(cat monographien); do dir=Monographien/$i; mkdir -pv $dir; echo $i; (cd $dir; n=1; while test $n -le 999; do file=$(printf %s_%05u.jpg $i $n); url=$(printf http://idb.ub.uni-tuebingen.de/opendigi/image/%s/%s_%05u.jp2/full/full/0/default.jpg $i $i $n); (test -f $file && echo $file exists) || (grep -q ${file/jpg/jp2} mets && echo $file && curl -sS -o $file $url) || break; n=$(expr $n + 1); done); done
The text recognition initially used kraken
with the reichsanzeiger model from UB Mannheim.
Get and install that model once before running OCR.
# Get and install model file. Up to now, model 23 is the best one, so get that.
curl -o ~/.config/kraken/reichsanzeiger.mlmodel https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/kraken/reichsanzeiger-gt/reichsanzeiger_23.mlmodel
As some OCR results with italic text showed high error rates, the model was later replaced by a new model which had been improved by additional training.
# Get and install improved model file.
curl -o ~/.config/kraken/digitue_best.mlmodel https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/kraken/digitue-gt/digitue_best.mlmodel
Then run the OCR on selected images:
# Don't use multithreading. It wastes a lot of CPU resources and does not produce the OCR much faster.
export OMP_NUM_THREADS=1
# Create ALTO files for all page images.
kraken -a -I "*/*.jpg" -o .xml segment -bl ocr -m digitue_best.mlmodel
The created ALTO files contain not only the detected words with coordinates and recognition confidence values, but also the glyphs with their corresponding information. The word and glyph confidences can be used to detect problematic OCR results. The DFG viewer does not need the glyph level, so that rather large data part can be removed in an additional step.
For maximum throughput it is possible to run one kraken
process per CPU core (or two if hyperthreading is supported).
OCR production for DigiTheo used up to 32 threads on one server and up to 44 threads on another server.
Parts of the production also used GPU resources, but it was not possible to run more than 5 kraken
processes on a GPU with 24 GB RAM.
The GPU is only used for small parts of the OCR process and does not improve the performance.
Therefore using the GPU was stopped after some days,
and the best performance was achieved using single threaded CPU based kraken
processes only.
The model which was used for kraken OCR is able to recognize the unicode COMBINING LATIN SMALL LETTER E. Normally that combining character should only be used for German umlauts, which means it should only occur after A, O, U, a, o, u. To find other occurrences in the ALTO results, this command can be executed:
find * -name "*.xml"|xargs grep ͤ|grep -v -E "Aͤ|Oͤ|Uͤ|aͤ|oͤ|uͤ"|grep -v 'CONTENT="ͤ"'
Especially the combination of a normal umlaut and COMBINING LATIN SMALL LETTER E occurs rather often and can be fixed with a simple replacement:
find * -name "*.xml"|xargs grep -l -E "Äͤ|Öͤ|Üͤ|äͤ|öͤ|üͤ"|xargs perl -pi -e 's/Äͤ/Aͤ/g; s/Öͤ/Oͤ/g; s/Üͤ/Uͤ/g; s/äͤ/aͤ/g; s/öͤ/oͤ/g; s/üͤ/uͤ/g'
After that there still remain unexpected COMBINING LATIN SMALL LETTER E, for example "Biſchdͤffe", "Auflbͤſungen" or "hebrdͤiſch". These can be fixed separately or simply be ignored as their number is relatively small.
- MDOG_1898_000 – Antiqua
- MDOG_1968_100 – Antiqua
- pic_01 – historische Antiqua, teilweise kursiv, italienisch, s/w, Nachtraining sinnvoll
- andlaw1863
- baur1827
- baur1834
- baur1836
- baur1855
- baur1860
- baur1861
- beck1836
- beck1838
- beck1843
- beck1848
- beck1879
- bengel1831
- dieringer1849
- drey1819
- drey1834
- evthfaktue1830
- ewald1842
- ewald1848
- feilmoser1830
- flatt1806
- gottschick1891
- gottschick1893
- gottschick1901
- gratz_1817
- harless1834
- hase1855
- hefele1870
- hirscher1843
- hirscher1849
- kautysch1883
- kautysch1900
- keppler1883
- knobel1844
- koestlin1815
- kuhn1838
- linsenmann1877 – kursiv, Nachtraining durchgeführt, neues Ergebnis mit digitue_9.mlmodel
- mack1836
- marheineke1833 – historische Antiqua
- moehler1834 – Fraktur
- neander1836 – Fraktur
- oehler1845 – Fraktur
- oehler1872 – Fraktur
- osiander1837 – Fraktur
- palmer1839 – Fraktur
- sack1836 – Fraktur
- schlatter1896 – Fraktur
- schmid1852 – Fraktur
- steudel1814 – Fraktur
- steudel1823 – historische Antiqua
- steudel1831 – Fraktur
- steudel1832 – Fraktur
- steudel1837 – Fraktur
- storr1803 – Fraktur
- strauss1839 – Fraktur
- strauss1848 – Fraktur
- walz_1976 – Schreibmaschine, Nachtraining sinnvoll
- weiysaecker1877 – kursiv, Nachtraining durchgeführt, neues Ergebnis mit digitue_15.mlmodel
- zeller1871 – Fraktur
- zeller1880 – Fraktur
- agtck_1833_04 – Fraktur
- akzs_1862 – Fraktur
- akzs_1871
- artl_005 – Fraktur
- artl_009
- artl_018
- artl_040
- artl_062
- artl_064
- artl_066
- dzcw_1850 – Fraktur
- dzcw_1851
- dzcw_1852
- dzcw_1853
- dzcw_1854
- dzcw_1855
- dzcw_1856
- dzcw_1857
- dzcw_1858
- dzcw_1859
- dzcw_1860
- dzcw_1861
- jdth_1856 – Fraktur
- jpth_1875 – historische Antiqua
- jthkr_1806_01_01 – historische Antiqua
- kath_1823_009 – Fraktur
- kath_1826_022
- kath_1832_046
- kath_1833_049
- kath_1841_080
- kath_1851_004
- kath_1864_011
- kath_1877_038
- kath_1880_043
- kath_1907_035
- litrdsch_1884 – Fraktur
- stml_1890_039 – Fraktur
- thjbt_1842 – Antiqua
- thlb_001_1880 – Fraktur
- thlblb_1866 – Fraktur
- thlz_007_1882 – historische Antiqua
- thlz_008_1883
- thlz_009_1884
- thlz_010_1885
- thlz_011_1886
- thlz_012_1887
- thq_1819 – Fraktur
- thstkr_1828 – Fraktur
- tzth_1828_01 – Antiqua
- tzw_1813_01 – Fraktur
- zlthk_1840 – Antiqua
- zpk_1844_07 – Fraktur
- zpk_1844_08
- zpk_1851_22
- zpk_1852_24 – Fraktur
- zpk_1855_29
- zpk_1857_34
- zpk_1869_58
- zpk_1871_62
- zpk_1872_64
- zpk_1874_68
- zpk_1875_70
- zpkt_1832_01 – Fraktur
- zth_1843_10 – Fraktur, s/w
- zth_1849_21
- AhI8_qt – Wiegendruck, Latein und Deutsch (1498)
- Bd231_qt_1
- Eg1596
- Fc271
- FoIII926_02
- FoXXIX53
- JfI155
- Jk2_qt
- KeXXI21 – Fraktur (1791)
- LXVI119_qt_1