Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft

Querying Spoken Language Corpora

Thomas SchmidtIDS Mannheim


Outline1) Background: EXMARaLDA, FOLKER, AGD, DGD22) Transcription: Data models, data formats, TEI3) Corpora: Recordings, transcripts, metadata4) Query requirements5) Query technologies6) Demo7) Future directions


Background

• EXMARaLDA: System for building and querying spoken language corpora

• Used in many individual projects, at the HZSK CLARIN Centre• Transcription editor, Corpus management tool, query tool

EXAKT• FOLKER: Transcription tool – same technical basis, optimised

for Research and Teaching Corpus of Spoken German (FOLK)


• Archive for Spoken German (AGD): central archive for oral corpora in Germany, IDS Mannheim

• Dialect corpora, conversation corpora• Database for Spoken German (DGD2): access (browsing and

query) for AGD data

Background


Model: Single timeline, multiple tiers

• Annotation tuples: text label + timeline reference• Timeline: fully ordered, reference to a recording• Tiers: collections of annotations of a specific category, a specific speaker,

annotations in a tier do not overlap Annotation Graph Framework (Bird/Liberman 2001)


EXMARaLDA Basic Transcription:• (Flat) hierarchy of events in

tiers• Use of ID and IDREFS to

encode temporal relations• No additional markup, no

„deep“ semantics


• EXMARaLDA

• ELAN


• EXMARaLDA

• ELAN• Praat


Data formats• Schmidt, Loehr et al. (2008): An exchange format for

multimodal annotations.– XML format for data exchange between seven tools with STMT data

models improves interoperability for data creation

• Drawbacks– no document order (non-linear, non-hierachical)– what is the „full text“ / the „primary data“ / the „character data“?– no explicit representation of dependencies– temporal structure, not linguistic structure bad for querying?


STMT to OHCO transformation


STMT to OHCO transformation

• Segment chain = any temporally connected chain of annotations within one tier

• Assumption: all other hierarchical structure beneath the level of segment chains

• Correspondence: segment chain ↔ <u>



Unparsed (EXAKT) Parsed (DGD2)


Free annotation (EXAKT)

Token annotation (DGD2)


• Schmidt (2011): A TEI-based Approach to Standardising Spoken Language Transcription. jTEI (1)

• Romary, Witt, Schmidt: ISO/DIN PWI 24624: Transcription Of Speech


Transcripts, recordings, metadata• Interaction metadata

– date, „genre“, place, degree of formality, etc.– pertains to a (set of) transcription(s)

• Speaker metadata– age, sex, language biography, speech impediments, etc.– pertains to (a) part(s) of a transcription

• Audio and video recordings– for checking transcription quality– for obtaining information not encoded in transcripts

• Transcripts– not (the) primary data!– a „convenient index into the recording“?– selective, theory-dependent, …


Corpora


Corpora• AGD Corpora: 8 mill. tokens • CGN Corpus: 9 mill. tokens• BNC Spoken: 10 mill. tokens• MICASE: 2 mill. tokens• Most other corpora: < 1 mill. Tokens(at least) one order of magnitude smaller than

written corporaQuery speed is (not that) important


• „In informal conversation in Northern Scotland, older female speakers tend to use ‚aye‘ as a backchannel signal with a rising intonation“– Situational context Interaction metadata– Speaker metadata – Text data / Surface form Transcript text– Interactional context Temporal transcript structure– Prosodic properties Recording

Requirement #1: Access to all types of contextRequirement #2: (Manual) postprocessing of query results


• „After a cut-off word followed by a pause of more than 0.3 seconds, the cut-off word is frequently repeated“– special word tokens (incomplete words, semi-lexical

material, …)– non-word tokens (pauses, non-verbal articulations, …)– temporal measurements (pause length)

Requirement #3: Queries for „special“ tokensRequirement #4: Queries with special properties (numerical

values, repetition)


• „Filled pauses are less frequent in overlapping speech than at the beginning of turns“

• „Modal particles and modal adverbs often occur near one another in an utterance“ vs. „Filled pauses occur more frequently near another speaker‘s backchannel“

Requirement #5: Queries for position in temporal structureRequirement #6: Multiple distance measures, query scopes[…]


• RequirementsAccess to all types of contextManual post-processing of query resultsQueries for special tokensQueries with special propertiesQueries for position in temporal structureMultiple distance measures, query scopes…


Recordings

Metadata

Transcripts

Corp

us

Query Query result

Context

Postprocessing


• EXAKT– Regular expression on „full text“ of <u>– (XPath on <u> with markup)– (XSL on transcripts)

• DGD2– Oracle full text on documents– SQL on <w> with attributes


• Demo 1: EXAKT with HaMaTaC corpus• HaMaTaC: Hamburg Map Task Corpus

– advanced L2 learners of German– solving a map task– Orthographic transcription with lemma, POS,

disfluency annotation


• Demo 2: DGD2 with FOLK Corpus• FOLK: Research & Teaching Corpus of Spoken

German


• Future directions:– Support a „real“ query language: CQL– CQPWeb as a test case– User survey DGD2 (approaching 2000 users!)– …– …– TEI as common ground

• for different spoken language corpora query platforms? • for querying spoken and written data side-by-side?

Documents

Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim