Upload
laura-oliver
View
235
Download
2
Embed Size (px)
Citation preview
Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS
Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft
Universität Frankfurt am MainE-Mail: [email protected]
26.06.2013 2
Tokens and Types Distribution in TITUS
Outline
• TITUS Resource Data • Peculiarities of TITUS texts • Tokens and Types calculation in TITUS Resources• Metadata for Tokens and Types distribution
Корпусная лингвистика 2013
26.06.2013 3
Tokens and Types Distribution in TITUS
TITUS Resource Data • TITUS (Thesaurus
Indogermanischer Text- und Sprachmaterialien)
http://titus.uni-frankfurt.de
Корпусная лингвистика 2013
A token represents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled.
• TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens
26.06.2013 4
Tokens and Types Distribution in TITUS
TITUS Data
Корпусная лингвистика 2013
http://www.clarin.eu/node/1512
Added by J. Gippert, R. Mittmann
26.06.2013 5
Tokens and Types Distribution in TITUS
TITUS Search Engine• TITUS Search Engine does not determine the number of
tokens in the concrete text, but the number of quotations of the word.
Корпусная лингвистика 2013
26.06.2013 6
Tokens and Types Distribution in TITUS
Peculiarities of TITUS texts: Gothic• Biblia Gothica contains additional parallel passages in Latin and Greek.
Корпусная лингвистика 2013
Biblia Gothica (http://titus.uni-frankfurt.de/texte/etcs/germ/got/gotnt/gotnt.htm).
26.06.2013 7
Tokens and Types Distribution in TITUS
Peculiarities of TITUS texts: Old Church Slavonic• Old Church Slavonic texts are represented in two ways: in the
Glagolitic alphabet – original form of the text – and in Cyrillic one.
Корпусная лингвистика 2013
Codex Marianus (http://titus.uni-frankfurt.de/texte/etcs/slav/aksl/marianus/maria.htm).
26.06.2013 8
Tokens and Types Distribution in TITUS
Peculiarities of TITUS texts: Old Polish• Old Polish texts contain a simultaneous display of editions
that have arisen at different times.
Корпусная лингвистика 2013
Kazania Swiętokrzyskie (http://titus.uni-frankfurt.de/texte/etcs/slav/apoln/ kazania/kazan.htm).
26.06.2013 9
Tokens and Types Distribution in TITUS
Peculiarities of TITUS texts: Ossetian• The Ossetian Nart epic is represented in Latinica und in the
advanced Cyrillic.
Корпусная лингвистика 2013
Ossetian: Nart epic (http://titus.uni-frankfurt.de/texte/etcs/iran/niran/oss/nart/nart.htm).
26.06.2013 10
Tokens and Types Distribution in TITUS
Peculiarities of TITUS texts: Russian-Low German• Tönnies Fenne's Manual (17th century) contains at least 9
different languages or language variations.
Корпусная лингвистика 2013
26.06.2013 11
Tokens and Types Distribution in TITUS
Peculiarities of TITUS texts: Old Prussian
Корпусная лингвистика 2013
Old Prussian corpus consists of at least 21 different languages or language variants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German).
26.06.2013 12
Tokens and Types Distribution in TITUS
Creation• A digitized source consists not only of a source language words,
but contains various information which does not belong originally to the document: numbers, tags, punctuation marks, edition information etc.
Корпусная лингвистика 2013
$zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; #<†‡„>
$zeile =~ s/\d*\s+<\W<?ConvertCheck:\s+LevelNameTooLong>//g; #<?ConvertCheck: LevelNameTooLong>
26.06.2013 13
Tokens and Types Distribution in TITUS
Examples: Gothic
Корпусная лингвистика 2013
Gothic Bible. Old Testament Fragments. Total: 1629 tokens und 893 types
Tokens Types
Gothic 420 240
Latin 572 325
Greek 627 319
26.06.2013 14
Tokens and Types Distribution in TITUS
Examples: Gothic
Gothic Bible. New Testament Books. Total: 170215 tokens und 28876 types
Tokens Types
Gothic 61167 9121
Latin 52648 9036
Greek 56400 10719
Корпусная лингвистика 2013
26.06.2013 15
Tokens and Types Distribution in TITUS
Examples:
Корпусная лингвистика 2013
Tönnies Fenne's Manual (17th century)
The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German.
26.06.2013 16
Tokens and Types Distribution in TITUS
Examples: further application
Корпусная лингвистика 2013
26.06.2013 17
Tokens and Types Distribution in TITUS
Metadata• DC – Dublin Core• TEI – Text Encoding Initiative• CEI – Corpus Encoding Initiative• IMDI – ISLE Meta Data Initiative • OLAC – Open Language Archives Community• CMDI – Component MetaData Infrastructure
Корпусная лингвистика 2013
26.06.2013 18
Tokens and Types Distribution in TITUS
CMDI - Component MetaData Infrastructure
Корпусная лингвистика 2013
http://www.clarin.eu/cmdi
26.06.2013 19
Tokens and Types Distribution in TITUS
TITUS Metadata: HTML Format
<HEAD> <TITLE>TITUS Texts: Biblia gothica: Frame</TITLE> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> <META NAME="Author" CONTENT="Jost Gippert"> <META NAME="Description" CONTENT="TITUS: Texts: Biblia gothica: Frame"> <META NAME="KeyWords" CONTENT="TITUS Texte Texts Biblia gothica"></HEAD>
Корпусная лингвистика 2013
26.06.2013 20
Tokens and Types Distribution in TITUS
New Metadata Set for TITUS
Корпусная лингвистика 2013
* Name vorhanden*Author new*ProjectContactName existing*ProjectContactAddress existing*ProjectContactEmail existing*ProjectContactOranisation existing*ProjectDescription existing*Resource.Language neu*Resource.ResourceLink existing*Resource.Access.Availability existing*Resource.Access.Date existing*Resource.Access.Owner existing*Resource.Access.Publisher existing*Resource.Publication.Time.Original.Manuscript new*Resource.Publication.Time.Original.Facsimile new*Resource.Publication.Time.Original.Published new*Resource.Publication.Time.Electronic existing*Resource.Wordcount.General.Tokens *new (CLARIN)*Resource.Wordcount.General.Types new*Resource.Wordcount.Language.Tokens new*Resource.Wordcount.Language.Types new*Resource.Metadata.Encoding new
26.06.2013 21
Tokens and Types Distribution in TITUS
Metadata Example for TITUS – XML CMDI<ResourcePublicationTimeElectronic>16.6.2002</ResourcePublicationTimeElectronic> <ResourceWordcountGeneral> <Tokens>1629 Tokens</Tokens> <Types>893 Types</Types> </ResourceWordcountGeneral><ResourceWordcountTT> <Language></Language> <LanguageTokensTypes> Tokens | Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 1_General</Language> <LanguageTokensTypes>10 Tokens | 9 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 2_Gothic</Language> <LanguageTokensTypes>420 Tokens | 240 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 4_Latin</Language> <LanguageTokensTypes>572 Tokens | 325 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 5_Greek</Language> <LanguageTokensTypes>627 Tokens | 319 Types</LanguageTokensTypes> </ResourceWordcountTT>
Корпусная лингвистика 2013
26.06.2013 22
Tokens and Types Distribution in TITUS
Metadata for TITUS – Browser
Корпусная лингвистика 2013
26.06.2013 23
Tokens and Types Distribution in TITUS
Metadata for TITUS – Browser
Корпусная лингвистика 2013
26.06.2013 24
Tokens and Types Distribution in TITUS
Metadata for TITUS – Browser
Корпусная лингвистика 2013
26.06.2013 25
Tokens and Types Distribution in TITUS
Thank you for your attention!
Корпусная лингвистика 2013
Links• ARBIL (Metadaten-Editor)
http://tla.mpi.nl/tools/tla-tools/arbil/• CLARIN
http://www.clarin.eu• CMDI
http://www.clarin.eu/cmdi• Dublin Core
http://dublincore.org/documents/dcmi-terms/• IMDI
http://www.mpi.nl/IMDI/• OLAT
http://www.language-archives.org/• TEI
http://www.tei-c.org/index.xml• TITUS
http://titus.uni-frankfurt.de