Upload
peter-kiraly
View
133
Download
2
Embed Size (px)
Citation preview
Multilinguality of Metadata Measuring the Multilingual Degree of Europeana‘s Metadata
Juliane Stiller1, Péter Király21 Berlin School of Library and Information Science, Humboldt-Universität zu Berlin
2 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
ISI 2017, March 14, 2017
1
Languages by eltpics
Agenda
1. Multilinguality in Europeana2. Multilingual Score for Metadata3. Implementation4. Discussion & Future Work
2
○ Books, newspapers, letters, paintings, photographs, radio shows, films, etc.
○ Text, images, video, audio, sounds, 3D○ Over 54 million objects○ > 50 languages
Europeana - Facts
http://statistics.europeana.eu/europeana 4
Thumbnail
Metadata
Link to Provider
Metadata Multilinguality
6+ 40 other languages....
The Multilingual Problem
7
○ Mona Lisa 456 results○ La Gioconda 365 results ○ La Joconde 71 results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
Metadata Enrichment
8
Quantify the Multilinguality of Data to
○Take measures to improve multilinguality in data
○Establish a sense of the multilingual reach of Europeana
○Distribution of languages
○Devise strategies for underrepresented languages
Multilingual Score for Metadata
10
Multilingual saturation of metadata
11
Text w/o language annotation (dc.subject: Germany)
Text w language annotation (dc.subject: Germany@en)
Text w several language annotations (dc.subject: Germany@en, Deutschland@de)
Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)
CalculationMissing fieldText string without language tag (language not known)
Text string with 2-3 different language tags
Text string with 4-9 different language tagsText string with more than 10 different language tagsLink to (multilingual) vocabulary
Text string with language tag (language known)
NA
0
1
2
2.3
2.6
3
Example score
13
Text w/o language annotation (dc.subject: Germany):
Text w language annotation (dc.subject: Germany@en)
Text w several language annotations (dc.subject: Germany@en, Deutschland@de)
Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)
0
1
2
3
Aggregation of property dc:subject
The Wittgenstein Archives at the University of Bergen: high saturation
National Library Portugal: low saturation
14http://144.76.218.178/europeana-qa/saturation.php?collectionId=all&field=proxy_dc_subject&type=average
Good examples"Die Mauer muß weg!"@de"Die Mauer muß weg! (The Wall must go!)"@en
15
"Kommentiertes Fotorama mit Bildern von 1989-1990 in Berlin"@de"Annotated images from 1989-1990 in Berlin"@en
dc:d
escr
ipti
ondc
:tit
le
"Brandenburger Tor"@de"Brandenburg Gate"@en
"Grenzübergang Potsdamer Platz"@de"Postdamer Platz border crossing"@en
"Reichstag"@de"Reichstag building"@en
Plac
e/sk
os:p
refL
abel
Descriptive fields Subject headings
Implementationsource codes: http://pkiraly.github.io/about/#source-codes
data source: http://hdl.handle.net/21.11101/0000-0001-781F-7(Europeana snapshot, 2015 december) 16
Data processing workflow
web interface
statistical analysis
measuringingestion
★ OAI-PMH★ Europeana
API★ Hadoop★ NoSQL
★ Spark★ Hadoop★ Java★ Apache Solr
★ Spark★ R
★ PHP★ D3.js★ highchart.js★ NoSQL
json csv json, png html, svg
17
Visualization
1818
APIs,abstractio
n,reusing
"Place/skos:altLabel": { "instances": [ {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, ... {"TRANSLATION": 2.40}, {"STRING": 0.0}, ], "score": { "sum": 20.40, "average": 1.85454545, "normalized": 0.649681 }}
Discussion & Future Work
20
extension I. recalculation
The new metrics★ Distinct languages per object★ Language tags per object★ Literals per language★ Number of multilingual properties (a.k.a. fields)★ Number of multilingual statements (a.k.a. field
instances)★ Average number of languages per property with
language★ Average number of languages per proxy
21
extension II. record views
ex:providerProxy dc:subject "special relativity"@en ; dc:creator <http://vocab.getty.eu/ulan/500240971> ; dc:type <http://udcdata.info/001684> .
ex:europeanaProxy dc:subject <http://dbpedia.org/resource/Physics> .
<http://vocab.getty.edu/ulan/500240971> skos:prefLabel "Einstein, Albert"@de .
standard vocabulary
<http://dbpedia.org/resource/Physics> skos:prefLabel "Physics"@en .
<http://udcdata.info/001684> skos:prefLabel "Books in general"@en .
standard vocabulary
non-standard vocabulary
22
extension II. record views
source field link value ① ② ③ ④
ex:providerProxy dc:subject literal "special relativity"@en ① ② ③ ④
dc:creator standard "Einstein, Albert"@de ① ② ③ ④
dc:type non-std "Books in general"@en ② ④
ex:europeanaProxy
dc:subject standard "Physics"@en ③ ④
① data provider's proxy and standard enrichments② data provider's proxy and enrichments③ all proxies and standard enrichments④ all proxies and enrichments
23
Questions
○[email protected]@gwdg.de
○Metadata Quality Assurance Frameworkhttp://144.76.218.178/europeana-qa
○Europeana Data Quality Committeehttp://pro.europeana.eu/page/data-quality-committee
24
AppendixEuropeana data structure in 30 sec
provider proxy
Europeana proxy
Agent
Concept
Place
Timespan
descriptive fields
subject headings
sem
anti
c w
eb