Textvervollständigung, OCR- und Rechtschreibkorrektur Drei Sichten auf gleiche Methoden Erfahrungen aus der Digitalisierungspraxis: OCR, Volltexte und Präsentationsformen München, 2011/10/12 Marco Büchler, Gerhard Heyer Natural Language Processing Group Department of Computer Science University of Leipzig
1. Textvervollstndigung, OCR- und Rechtschreibkorrektur Drei
Sichten auf gleiche MethodenErfahrungen aus der
Digitalisierungspraxis: OCR, Volltexte und Prsentationsformen
Mnchen, 2011/10/12 Marco Bchler, Gerhard Heyer Natural Language
Processing Group Department of Computer Science University of
Leipzig
2. Agenda Gap correction, OCR correction, and spell checking:
Two common tasks How to mine structures that can be used for
correction 2 tasks of reconstruction Identification of damaged
words Generation of suggestions Some problems / further stepsMarco
Bchler 2
3. Correcting words in two steps Finding words that contain an
error (Detection of candidates) Ancient texts: Leiden conventions
Modern texts: Trust of correctness by likelihood ratio Redundancy:
negative information for misspelled or fragmentary words Correcting
candidate words (features are pre-computed on different corpora) By
semantic approaches By syntactical approaches Word properties
String similarity Named Entities Morphological knowledgeMarco
Bchler 3
4. Classes of mining toolsUnsupervised SupervisedBootstrapping
Pattern Manual
5. Past and recent projects Spell checking: On car repair
orders of a famous German car builder Challenge: written in bullet
points Gap correction: eAQUA: correcting ancient Greek documentary
papyri Challenge: How to deal with different dialects and spelling
variations Plans for an integration to papyri.info OCR correction:
Texts from US-Iran relation Challenge: How to deal with islamic
person names?Marco Bchler 5
8. Some error classes Some German examples Inter word error:
Das am 19. Oktober erscheinende Album firmiertnmlich wieder unter
EAV. Real word error: Er viel auf den Boden und verletzt sich
dabei. Non word error: Was solche Berufspolitiker wie Herr Brdele
mit ihrer Praxisferne sagen, das ist doch nur zum Lachen.
Abbreviations: Auf zwei Stze mit einem Zeitlimit von max. 40
Minuten wurde in der Vorrunde am Samstag sowie am Sonntagvormittag
gespielt.Marco Bchler 8
9. Text sourceMarco Bchler 9
10. Text correction By (text) miningMarco Bchler 10
11. Correcting words in two steps Finding words that contain an
error (Detection of candidates) Ancient texts: Leiden conventions
Modern texts: Trust of correctness by likelihood ratio Redundancy:
negative information for misspelled or fragmentary words Correcting
candidate words (features are pre-computed on different corpora) By
semantic approaches By syntactical approaches Word properties
String similarity Named Entities Morphological knowledgeMarco
Bchler 11
12. Pre-processing of text and training of data Currently
processed corpora: TLG, PHI7, PHI7_INS, PHI_DDP, epiDuke
Pre-processing: All texts are segmented into sentences, paragraphs.
Meta information such as dating or classification are extracted.
Tokenisation Training: Features (e. g. signal words) for every word
are pre-computed in the background (up to 100s of millions
datasets) Features are classified by different approaches Scoring
the overall list: Main idea/assumption: Every known word in a
corpus is a potential candidate for text completion. That means:
TLG about 1.7M words, epiDuke avout 550T words Every approach
delivers an independent list of candidates having a score between 0
and 1. Overall candidate list is scored by the sum of a words
individual score by a selected algorithmMarco Bchler 12
13. Task 1: Detection of critical words Detection of words by
Leiden Conventions (Source: Wikipedia): [abc]: letters missing from
the original text due to lacuna, but restored by the editor :
characters erroneously omitted by the ancient scribe, restored by
the editor [[abc]]: deleted letters ...Marco Bchler 13
14. Finding candidatesV.3 Erased and lost [...] ... [ ] V.3
Erased and lost [ c.5 ]V.3 Erased and lost [---] VI.1 Text struck
over erasure abc VI.1 Overstruck text, incomprehensible ABC VI.1
Overstruck text ambiguous abc VI.2 Overstruck text, lost but
restored [abc] VI.3 Overstruck text, completely lost [...] VI.3
Overstruck text, lost, extent Gabriel Bodard (et al.), (2006-2009),
_EpiDoc Cheat Sheet: Krummrey-Panciera sigla & EpiDoc tags_,
version 1085, accessed: 2010-07-04. available Marco Bchler 14
15. Task 2: Correction of critical words Semantically best word
(co-occurrences) Syntactically best word (N-gram) String similar
best word (Levenshtein, FastSS) Word length (Stoichedon texts) Best
word by domain classification by Mathematics and mechanics
Centuries Cities Jurisdiction Slave tradingMarco Bchler 15
16. Finding best fitting word Toy sample: A b C d G h. Semantic
approach: Features: sentences based co-occurrences (function words
filtered) Toy sample: A, C, G are selected as semantic profile
Looking for words that have the best overlap with the semantic
profile (all permutations are possible) Real world example: =: , ,
, , ,, , Syntactical approach: Method: Looking for immediately
neighboured words (bi-gram level) Toy sample: d, G are selected as
features Word similarity: Method: letter bi-gram overlapping (word)
Real word examples: and or and Named Entity list and word
lengthMarco Bchler 16
17. An example What is the ORIGINAL missing word?Marco Bchler
17
18. The textMarco Bchler 18
19. Step 1: Take text and copy it to the web pageMarco Bchler
19
20. Step 2: Choosing damaged wordMarco Bchler 20
21. Strategy 1 Only use of information about the damaged
wordMarco Bchler 21
22. Step 3: Choosing set of algorithmsMarco Bchler 22
23. Strategy 2 Only use of any kind of context informationMarco
Bchler 23
24. Step 3: Choosing set of algorithmsMarco Bchler 24
25. The Real Strategy 2 (remove everthing that you know about
the damaged word) Only use of any kind of context informationMarco
Bchler 25
26. Step 1&2: Changing text and choosing damaged wordMarco
Bchler 26
27. Step 3: Suggesting wordsMarco Bchler 27
28. Strategy 3 The Full Strategy choosing whatever makes
senseMarco Bchler 28
29. Step 3: Choosing set of algorithmsMarco Bchler 29
30. Problems and Summary More approaches (up to 30 (automatic)
approaches are planned/possible) Separation of left hand side and
right hand side signal words Semantics: Semantics by Wittgenstein:
Co-occurrences Semantics by Firth (here HGV classification): You
shall know a word by the company it keeps. Different semantic
spaces Does any cluster of algorithms fit to specific problems?
What are good training data for identifying and suggesting damaged
words? Big challenge: How can we CORRECT OCR without CORRECTING
SPELLING ERRORS? (They are often of humanists interest?)Marco
Bchler 30