27
Mitglied der Leibniz- Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Embed Size (px)

Citation preview

Page 1: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Querying Spoken Language Corpora

Thomas SchmidtIDS Mannheim

Page 2: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Outline1) Background: EXMARaLDA, FOLKER, AGD, DGD22) Transcription: Data models, data formats, TEI3) Corpora: Recordings, transcripts, metadata4) Query requirements5) Query technologies6) Demo7) Future directions

Page 3: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Background

• EXMARaLDA: System for building and querying spoken language corpora

• Used in many individual projects, at the HZSK CLARIN Centre• Transcription editor, Corpus management tool, query tool

EXAKT• FOLKER: Transcription tool – same technical basis, optimised

for Research and Teaching Corpus of Spoken German (FOLK)

Page 4: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• Archive for Spoken German (AGD): central archive for oral corpora in Germany, IDS Mannheim

• Dialect corpora, conversation corpora• Database for Spoken German (DGD2): access (browsing and

query) for AGD data

Background

Page 5: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Model: Single timeline, multiple tiers

• Annotation tuples: text label + timeline reference• Timeline: fully ordered, reference to a recording• Tiers: collections of annotations of a specific category, a specific speaker,

annotations in a tier do not overlap Annotation Graph Framework (Bird/Liberman 2001)

Page 6: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

EXMARaLDA Basic Transcription:• (Flat) hierarchy of events in

tiers• Use of ID and IDREFS to

encode temporal relations• No additional markup, no

„deep“ semantics

Page 7: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• EXMARaLDA

• ELAN

Page 8: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• EXMARaLDA

• ELAN• Praat

Page 9: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Data formats• Schmidt, Loehr et al. (2008): An exchange format for

multimodal annotations.– XML format for data exchange between seven tools with STMT data

models improves interoperability for data creation

• Drawbacks– no document order (non-linear, non-hierachical)– what is the „full text“ / the „primary data“ / the „character data“?– no explicit representation of dependencies– temporal structure, not linguistic structure bad for querying?

Page 10: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

STMT to OHCO transformation

Page 11: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

STMT to OHCO transformation

• Segment chain = any temporally connected chain of annotations within one tier

• Assumption: all other hierarchical structure beneath the level of segment chains

• Correspondence: segment chain ↔ <u>

Page 12: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Page 13: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Unparsed (EXAKT) Parsed (DGD2)

Page 14: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Free annotation (EXAKT)

Token annotation (DGD2)

Page 15: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• Schmidt (2011): A TEI-based Approach to Standardising Spoken Language Transcription. jTEI (1)

• Romary, Witt, Schmidt: ISO/DIN PWI 24624: Transcription Of Speech

Page 16: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Transcripts, recordings, metadata• Interaction metadata

– date, „genre“, place, degree of formality, etc.– pertains to a (set of) transcription(s)

• Speaker metadata– age, sex, language biography, speech impediments, etc.– pertains to (a) part(s) of a transcription

• Audio and video recordings– for checking transcription quality– for obtaining information not encoded in transcripts

• Transcripts– not (the) primary data!– a „convenient index into the recording“?– selective, theory-dependent, …

Page 17: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Corpora

Page 18: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Corpora• AGD Corpora: 8 mill. tokens • CGN Corpus: 9 mill. tokens• BNC Spoken: 10 mill. tokens• MICASE: 2 mill. tokens• Most other corpora: < 1 mill. Tokens(at least) one order of magnitude smaller than

written corporaQuery speed is (not that) important

Page 19: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• „In informal conversation in Northern Scotland, older female speakers tend to use ‚aye‘ as a backchannel signal with a rising intonation“– Situational context Interaction metadata– Speaker metadata – Text data / Surface form Transcript text– Interactional context Temporal transcript structure– Prosodic properties Recording

Requirement #1: Access to all types of contextRequirement #2: (Manual) postprocessing of query results

Page 20: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• „After a cut-off word followed by a pause of more than 0.3 seconds, the cut-off word is frequently repeated“– special word tokens (incomplete words, semi-lexical

material, …)– non-word tokens (pauses, non-verbal articulations, …)– temporal measurements (pause length)

Requirement #3: Queries for „special“ tokensRequirement #4: Queries with special properties (numerical

values, repetition)

Page 21: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• „Filled pauses are less frequent in overlapping speech than at the beginning of turns“

• „Modal particles and modal adverbs often occur near one another in an utterance“ vs. „Filled pauses occur more frequently near another speaker‘s backchannel“

Requirement #5: Queries for position in temporal structureRequirement #6: Multiple distance measures, query scopes[…]

Page 22: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• RequirementsAccess to all types of contextManual post-processing of query resultsQueries for special tokensQueries with special propertiesQueries for position in temporal structureMultiple distance measures, query scopes…

Page 23: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Recordings

Metadata

Transcripts

Corp

us

Query Query result

Context

Postprocessing

Page 24: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• EXAKT– Regular expression on „full text“ of <u>– (XPath on <u> with markup)– (XSL on transcripts)

• DGD2– Oracle full text on documents– SQL on <w> with attributes

Page 25: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• Demo 1: EXAKT with HaMaTaC corpus• HaMaTaC: Hamburg Map Task Corpus

– advanced L2 learners of German– solving a map task– Orthographic transcription with lemma, POS,

disfluency annotation

Page 26: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• Demo 2: DGD2 with FOLK Corpus• FOLK: Research & Teaching Corpus of Spoken

German

Page 27: Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

• Future directions:– Support a „real“ query language: CQL– CQPWeb as a test case– User survey DGD2 (approaching 2000 users!)– …– …– TEI as common ground

• for different spoken language corpora query platforms? • for querying spoken and written data side-by-side?