25
Multilinguality of Metadata Measuring the Multilingual Degree of Europeana‘s Metadata Juliane Stiller 1 , Péter Király 2 1 Berlin School of Library and Information Science, Humboldt-Universität zu Berlin 2 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen ISI 2017, March 14, 2017 1 Languages by eltpics

Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Embed Size (px)

Citation preview

Page 1: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Multilinguality of Metadata Measuring the Multilingual Degree of Europeana‘s Metadata

Juliane Stiller1, Péter Király21 Berlin School of Library and Information Science, Humboldt-Universität zu Berlin

2 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

ISI 2017, March 14, 2017

1

Languages by eltpics

Page 2: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Agenda

1. Multilinguality in Europeana2. Multilingual Score for Metadata3. Implementation4. Discussion & Future Work

2

Page 3: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Plattform for Cultural Heritage Material

www.europeana.eu

3

Page 4: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

○ Books, newspapers, letters, paintings, photographs, radio shows, films, etc.

○ Text, images, video, audio, sounds, 3D○ Over 54 million objects○ > 50 languages

Europeana - Facts

http://statistics.europeana.eu/europeana 4

Page 5: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Thumbnail

Metadata

Link to Provider

Page 6: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Metadata Multilinguality

6+ 40 other languages....

Page 7: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

The Multilingual Problem

7

○ Mona Lisa 456 results○ La Gioconda 365 results ○ La Joconde 71 results

http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html

Page 8: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Metadata Enrichment

8

Page 9: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Quantify the Multilinguality of Data to

○Take measures to improve multilinguality in data

○Establish a sense of the multilingual reach of Europeana

○Distribution of languages

○Devise strategies for underrepresented languages

Page 10: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Multilingual Score for Metadata

10

Page 11: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Multilingual saturation of metadata

11

Text w/o language annotation (dc.subject: Germany)

Text w language annotation (dc.subject: Germany@en)

Text w several language annotations (dc.subject: Germany@en, Deutschland@de)

Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)

Page 12: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

CalculationMissing fieldText string without language tag (language not known)

Text string with 2-3 different language tags

Text string with 4-9 different language tagsText string with more than 10 different language tagsLink to (multilingual) vocabulary

Text string with language tag (language known)

NA

0

1

2

2.3

2.6

3

Page 13: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Example score

13

Text w/o language annotation (dc.subject: Germany):

Text w language annotation (dc.subject: Germany@en)

Text w several language annotations (dc.subject: Germany@en, Deutschland@de)

Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)

0

1

2

3

Page 14: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Aggregation of property dc:subject

The Wittgenstein Archives at the University of Bergen: high saturation

National Library Portugal: low saturation

14http://144.76.218.178/europeana-qa/saturation.php?collectionId=all&field=proxy_dc_subject&type=average

Page 15: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Good examples"Die Mauer muß weg!"@de"Die Mauer muß weg! (The Wall must go!)"@en

15

"Kommentiertes Fotorama mit Bildern von 1989-1990 in Berlin"@de"Annotated images from 1989-1990 in Berlin"@en

dc:d

escr

ipti

ondc

:tit

le

"Brandenburger Tor"@de"Brandenburg Gate"@en

"Grenzübergang Potsdamer Platz"@de"Postdamer Platz border crossing"@en

"Reichstag"@de"Reichstag building"@en

Plac

e/sk

os:p

refL

abel

Descriptive fields Subject headings

Page 16: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Implementationsource codes: http://pkiraly.github.io/about/#source-codes

data source: http://hdl.handle.net/21.11101/0000-0001-781F-7(Europeana snapshot, 2015 december) 16

Page 17: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Data processing workflow

web interface

statistical analysis

measuringingestion

★ OAI-PMH★ Europeana

API★ Hadoop★ NoSQL

★ Spark★ Hadoop★ Java★ Apache Solr

★ Spark★ R

★ PHP★ D3.js★ highchart.js★ NoSQL

json csv json, png html, svg

17

Page 18: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Visualization

1818

Page 19: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

APIs,abstractio

n,reusing

"Place/skos:altLabel": { "instances": [ {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, ... {"TRANSLATION": 2.40}, {"STRING": 0.0}, ], "score": { "sum": 20.40, "average": 1.85454545, "normalized": 0.649681 }}

Page 20: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Discussion & Future Work

20

Page 21: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

extension I. recalculation

The new metrics★ Distinct languages per object★ Language tags per object★ Literals per language★ Number of multilingual properties (a.k.a. fields)★ Number of multilingual statements (a.k.a. field

instances)★ Average number of languages per property with

language★ Average number of languages per proxy

21

Page 22: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

extension II. record views

ex:providerProxy dc:subject "special relativity"@en ; dc:creator <http://vocab.getty.eu/ulan/500240971> ; dc:type <http://udcdata.info/001684> .

ex:europeanaProxy dc:subject <http://dbpedia.org/resource/Physics> .

<http://vocab.getty.edu/ulan/500240971> skos:prefLabel "Einstein, Albert"@de .

standard vocabulary

<http://dbpedia.org/resource/Physics> skos:prefLabel "Physics"@en .

<http://udcdata.info/001684> skos:prefLabel "Books in general"@en .

standard vocabulary

non-standard vocabulary

22

Page 23: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

extension II. record views

source field link value ① ② ③ ④

ex:providerProxy dc:subject literal "special relativity"@en ① ② ③ ④

dc:creator standard "Einstein, Albert"@de ① ② ③ ④

dc:type non-std "Books in general"@en ② ④

ex:europeanaProxy

dc:subject standard "Physics"@en ③ ④

① data provider's proxy and standard enrichments② data provider's proxy and enrichments③ all proxies and standard enrichments④ all proxies and enrichments

23

Page 24: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Questions

[email protected]@gwdg.de

○Metadata Quality Assurance Frameworkhttp://144.76.218.178/europeana-qa

○Europeana Data Quality Committeehttp://pro.europeana.eu/page/data-quality-committee

24

Page 25: Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

AppendixEuropeana data structure in 30 sec

provider proxy

Europeana proxy

Agent

Concept

Place

Timespan

descriptive fields

subject headings

sem

anti

c w

eb