32
Text curation for the Deutsches Textarchiv www.deutschestextarchiv.de, www.dwds.de Alexander Geyken, BBAW Digitales Wörterbuch (DWDS), Deutsches Textarchiv (DTA), CLARIN-D

Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Text curation for the Deutsches Textarchiv

www.deutschestextarchiv.de, www.dwds.de

Alexander Geyken,

BBAW Digitales Wörterbuch (DWDS), Deutsches Textarchiv (DTA), CLARIN-D

Page 2: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Berlin-Brandenburgische Akademie der Wissenschaften - BBAW

Page 3: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

DWDS-étendu

DWDS base

DWDS noyau

• 2,6 G tokens

• 6 mill. Doc.

• Sous-partie CMC

• 254 M tokens

• 272 000 docs

• 100 M tokens

• 80 000 docs

DWDS: 1900 -

DTA étendu

DTA noyau

• 150 M

• 80 M

DTA: 1650-1900

Corpora at BBAW

Page 4: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Deutsches Textarchiv in a nutshell

• Select important hist. German prints, (1650-1900, 1,300 vol.) – Planck, Hilbert, Boltzmann,

Euler; Goethe, Lessing;Marx, Wundt , Forster, …

• Digitize (first editions, high accuracy transcription), TEI/P5 (DTA-baseformat, linguistic annotation

• Interoperable, e.g. CLARIN-D

• Funded by DFG (2007-2014); staff 5 FTE

Page 5: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

www.deutschestextarchiv.de (new beta version)

Page 6: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100
Page 7: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100
Page 8: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100
Page 9: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100
Page 10: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

charakterisirt

->

charakterisiert

Page 11: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

DTAQ – a collaborative plattform for QA

Page 12: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100
Page 13: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100
Page 14: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100
Page 15: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Some detail on transcriptions …

(only) documented emendations are welcome, e.g. "Ednard": <choice> <sic>Ednard</sic> <corr>Eduard</corr> </choice>

chamisso_schlemihl_1814?p=13

no modernizations, no ‘normalizations’, e.g. “Ich laſſe mich nicht irre ſchreyn”:

ſchreyn → ſchreyn ſchreyn → schreyn ſchreyn → schreien goethe_faust01_1808?p=293

Transcription: UTF-8; transcribed true to the source; high accuracy in keying (>99%)

Page 16: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

… and structuring

• „formative quality assurance“:

Page 17: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100
Page 18: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Volltextdigitalisierung – ZOT 18 von

36

Page 19: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Extensions of DTA core corpus

19

Cooperations with other partners

Technical support

Gesamt: 53.870 Seiten (ohne

Polytechnisches Journal)

Page 20: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Web form gathering text, images and metadata

Page 21: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Web form gathering text, images and metadata

Page 22: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

preliminary DTA-id

Web form gathering text, images and metadata

Page 23: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Metadata on source

preliminary DTA-id

Web form gathering text, images and metadata

Page 24: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Metadata on source

preliminary DTA-id

on the transcription

Web form gathering text, images and metadata

Page 25: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Metadata on source

preliminary DTA-id

on the transcription

licence/legal

Web form gathering text, images and metadata

Page 26: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

img source(s)

Web form gathering text, images and metadata

Page 27: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

conversion

Page 28: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Available for all DTAE texts (as for all DTA texts):

DTAE – Key features

Parallel view: img | HTML; img | XML; img | CAB; …

+ full bibliographic record and metadata

+ info on transcription & encoding

Page 29: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Benefits

DTAE offers established infrastructure supporting every stage in an electronic document's life cycle

+ well-documented, consistent encoding of DTA core corpus and sub-corpora (DTA 'base format')

+ present and explore text in the context of DTA's corpora

+ linguistic analysis & tools to explore high quality corpus

+ integration in CLARIN-D (via BBAW as service centre)

Page 30: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

DTAE: Extensions via cooperations

34

Cooperation with 12 partners,

– HAB Wolfenbüttel (DFG-Projekt AEdit)

– Dinglers Polytechnisches Journal (HU Berlin)

– Forschungsstelle für Personalschriften, Marburg (AdW Mainz)

– CLARIN-D Kurationsprojekt

– MPI für Bildungsforschung www.deutschestextarchiv.de/dtae-dlc

Gesamt: 53.870 Seiten (ohne

Polytechnisches Journal)

Page 31: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

DTAE: Extensions via cooperations

Page 32: Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6 G tokens •6 mill. Doc. •Sous-partie CMC •254 M tokens •272 000 docs •100

Thanks for your attention!

Questions? [email protected]

Have a look at DTAE, DTAQ: www.deutschestextarchiv.de/dtae

www.deutschestextarchiv.de/dtaq