Alexander Geyken, BBAW Digitales Wörterbuch (DWDS ... · DWDS-étendu DWDS base DWDS noyau •2,6...

Preview:

Citation preview

Text curation for the Deutsches Textarchiv

www.deutschestextarchiv.de, www.dwds.de

Alexander Geyken,

BBAW Digitales Wörterbuch (DWDS), Deutsches Textarchiv (DTA), CLARIN-D

Berlin-Brandenburgische Akademie der Wissenschaften - BBAW

DWDS-étendu

DWDS base

DWDS noyau

• 2,6 G tokens

• 6 mill. Doc.

• Sous-partie CMC

• 254 M tokens

• 272 000 docs

• 100 M tokens

• 80 000 docs

DWDS: 1900 -

DTA étendu

DTA noyau

• 150 M

• 80 M

DTA: 1650-1900

Corpora at BBAW

Deutsches Textarchiv in a nutshell

• Select important hist. German prints, (1650-1900, 1,300 vol.) – Planck, Hilbert, Boltzmann,

Euler; Goethe, Lessing;Marx, Wundt , Forster, …

• Digitize (first editions, high accuracy transcription), TEI/P5 (DTA-baseformat, linguistic annotation

• Interoperable, e.g. CLARIN-D

• Funded by DFG (2007-2014); staff 5 FTE

www.deutschestextarchiv.de (new beta version)

charakterisirt

->

charakterisiert

DTAQ – a collaborative plattform for QA

Some detail on transcriptions …

(only) documented emendations are welcome, e.g. "Ednard": <choice> <sic>Ednard</sic> <corr>Eduard</corr> </choice>

chamisso_schlemihl_1814?p=13

no modernizations, no ‘normalizations’, e.g. “Ich laſſe mich nicht irre ſchreyn”:

ſchreyn → ſchreyn ſchreyn → schreyn ſchreyn → schreien goethe_faust01_1808?p=293

Transcription: UTF-8; transcribed true to the source; high accuracy in keying (>99%)

… and structuring

• „formative quality assurance“:

Volltextdigitalisierung – ZOT 18 von

36

Extensions of DTA core corpus

19

Cooperations with other partners

Technical support

Gesamt: 53.870 Seiten (ohne

Polytechnisches Journal)

Web form gathering text, images and metadata

Web form gathering text, images and metadata

preliminary DTA-id

Web form gathering text, images and metadata

Metadata on source

preliminary DTA-id

Web form gathering text, images and metadata

Metadata on source

preliminary DTA-id

on the transcription

Web form gathering text, images and metadata

Metadata on source

preliminary DTA-id

on the transcription

licence/legal

Web form gathering text, images and metadata

img source(s)

Web form gathering text, images and metadata

conversion

Available for all DTAE texts (as for all DTA texts):

DTAE – Key features

Parallel view: img | HTML; img | XML; img | CAB; …

+ full bibliographic record and metadata

+ info on transcription & encoding

Benefits

DTAE offers established infrastructure supporting every stage in an electronic document's life cycle

+ well-documented, consistent encoding of DTA core corpus and sub-corpora (DTA 'base format')

+ present and explore text in the context of DTA's corpora

+ linguistic analysis & tools to explore high quality corpus

+ integration in CLARIN-D (via BBAW as service centre)

DTAE: Extensions via cooperations

34

Cooperation with 12 partners,

– HAB Wolfenbüttel (DFG-Projekt AEdit)

– Dinglers Polytechnisches Journal (HU Berlin)

– Forschungsstelle für Personalschriften, Marburg (AdW Mainz)

– CLARIN-D Kurationsprojekt

– MPI für Bildungsforschung www.deutschestextarchiv.de/dtae-dlc

Gesamt: 53.870 Seiten (ohne

Polytechnisches Journal)

DTAE: Extensions via cooperations

Thanks for your attention!

Questions? geyken@bbaw.de

Have a look at DTAE, DTAQ: www.deutschestextarchiv.de/dtae

www.deutschestextarchiv.de/dtaq

Recommended