Digitalisierungspraxis - Büchler - Textvervollständigung

1. Textvervollstndigung, OCR- und Rechtschreibkorrektur Drei Sichten auf gleiche MethodenErfahrungen aus der Digitalisierungspraxis: OCR, Volltexte und Prsentationsformen Mnchen, 2011/10/12 Marco Bchler, Gerhard Heyer Natural Language Processing Group Department of Computer Science University of Leipzig

2. Agenda Gap correction, OCR correction, and spell checking: Two common tasks How to mine structures that can be used for correction 2 tasks of reconstruction Identification of damaged words Generation of suggestions Some problems / further stepsMarco Bchler 2

3. Correcting words in two steps Finding words that contain an error (Detection of candidates) Ancient texts: Leiden conventions Modern texts: Trust of correctness by likelihood ratio Redundancy: negative information for misspelled or fragmentary words Correcting candidate words (features are pre-computed on different corpora) By semantic approaches By syntactical approaches Word properties String similarity Named Entities Morphological knowledgeMarco Bchler 3

4. Classes of mining toolsUnsupervised SupervisedBootstrapping Pattern Manual

5. Past and recent projects Spell checking: On car repair orders of a famous German car builder Challenge: written in bullet points Gap correction: eAQUA: correcting ancient Greek documentary papyri Challenge: How to deal with different dialects and spelling variations Plans for an integration to papyri.info OCR correction: Texts from US-Iran relation Challenge: How to deal with islamic person names?Marco Bchler 5

6. Spell checking todayMarco Bchler 6

7. Spell checking some results (Source: ctAusgabe 23, 2007)Marco Bchler 7

8. Some error classes Some German examples Inter word error: Das am 19. Oktober erscheinende Album firmiertnmlich wieder unter EAV. Real word error: Er viel auf den Boden und verletzt sich dabei. Non word error: Was solche Berufspolitiker wie Herr Brdele mit ihrer Praxisferne sagen, das ist doch nur zum Lachen. Abbreviations: Auf zwei Stze mit einem Zeitlimit von max. 40 Minuten wurde in der Vorrunde am Samstag sowie am Sonntagvormittag gespielt.Marco Bchler 8

9. Text sourceMarco Bchler 9

10. Text correction By (text) miningMarco Bchler 10

11. Correcting words in two steps Finding words that contain an error (Detection of candidates) Ancient texts: Leiden conventions Modern texts: Trust of correctness by likelihood ratio Redundancy: negative information for misspelled or fragmentary words Correcting candidate words (features are pre-computed on different corpora) By semantic approaches By syntactical approaches Word properties String similarity Named Entities Morphological knowledgeMarco Bchler 11

12. Pre-processing of text and training of data Currently processed corpora: TLG, PHI7, PHI7_INS, PHI_DDP, epiDuke Pre-processing: All texts are segmented into sentences, paragraphs. Meta information such as dating or classification are extracted. Tokenisation Training: Features (e. g. signal words) for every word are pre-computed in the background (up to 100s of millions datasets) Features are classified by different approaches Scoring the overall list: Main idea/assumption: Every known word in a corpus is a potential candidate for text completion. That means: TLG about 1.7M words, epiDuke avout 550T words Every approach delivers an independent list of candidates having a score between 0 and 1. Overall candidate list is scored by the sum of a words individual score by a selected algorithmMarco Bchler 12

13. Task 1: Detection of critical words Detection of words by Leiden Conventions (Source: Wikipedia): [abc]: letters missing from the original text due to lacuna, but restored by the editor : characters erroneously omitted by the ancient scribe, restored by the editor [[abc]]: deleted letters ...Marco Bchler 13

14. Finding candidatesV.3 Erased and lost [...] ... [ ] V.3 Erased and lost [ c.5 ]V.3 Erased and lost [---] VI.1 Text struck over erasure abc VI.1 Overstruck text, incomprehensible ABC VI.1 Overstruck text ambiguous abc VI.2 Overstruck text, lost but restored [abc] VI.3 Overstruck text, completely lost [...] VI.3 Overstruck text, lost, extent Gabriel Bodard (et al.), (2006-2009), _EpiDoc Cheat Sheet: Krummrey-Panciera sigla & EpiDoc tags_, version 1085, accessed: 2010-07-04. available Marco Bchler 14

15. Task 2: Correction of critical words Semantically best word (co-occurrences) Syntactically best word (N-gram) String similar best word (Levenshtein, FastSS) Word length (Stoichedon texts) Best word by domain classification by Mathematics and mechanics Centuries Cities Jurisdiction Slave tradingMarco Bchler 15

16. Finding best fitting word Toy sample: A b C d G h. Semantic approach: Features: sentences based co-occurrences (function words filtered) Toy sample: A, C, G are selected as semantic profile Looking for words that have the best overlap with the semantic profile (all permutations are possible) Real world example: =: , , , , ,, , Syntactical approach: Method: Looking for immediately neighboured words (bi-gram level) Toy sample: d, G are selected as features Word similarity: Method: letter bi-gram overlapping (word) Real word examples: and or and Named Entity list and word lengthMarco Bchler 16

17. An example What is the ORIGINAL missing word?Marco Bchler 17

18. The textMarco Bchler 18

19. Step 1: Take text and copy it to the web pageMarco Bchler 19

20. Step 2: Choosing damaged wordMarco Bchler 20

21. Strategy 1 Only use of information about the damaged wordMarco Bchler 21

22. Step 3: Choosing set of algorithmsMarco Bchler 22

23. Strategy 2 Only use of any kind of context informationMarco Bchler 23

25. The Real Strategy 2 (remove everthing that you know about the damaged word) Only use of any kind of context informationMarco Bchler 25

26. Step 1&2: Changing text and choosing damaged wordMarco Bchler 26

27. Step 3: Suggesting wordsMarco Bchler 27

28. Strategy 3 The Full Strategy choosing whatever makes senseMarco Bchler 28

30. Problems and Summary More approaches (up to 30 (automatic) approaches are planned/possible) Separation of left hand side and right hand side signal words Semantics: Semantics by Wittgenstein: Co-occurrences Semantics by Firth (here HGV classification): You shall know a word by the company it keeps. Different semantic spaces Does any cluster of algorithms fit to specific problems? What are good training data for identifying and suggesting damaged words? Big challenge: How can we CORRECT OCR without CORRECTING SPELLING ERRORS? (They are often of humanists interest?)Marco Bchler 30

31. Questions ?Marco Bchler 31

Education

Digitalisierungspraxis - Büchler - Textvervollständigung