Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

Preview:

Citation preview

Multilinguality of Metadata Measuring the Multilingual Degree of Europeana‘s Metadata

Juliane Stiller1, Péter Király21 Berlin School of Library and Information Science, Humboldt-Universität zu Berlin

2 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

ISI 2017, March 14, 2017

1

Languages by eltpics

Agenda

1. Multilinguality in Europeana2. Multilingual Score for Metadata3. Implementation4. Discussion & Future Work

2

Plattform for Cultural Heritage Material

www.europeana.eu

3

○ Books, newspapers, letters, paintings, photographs, radio shows, films, etc.

○ Text, images, video, audio, sounds, 3D○ Over 54 million objects○ > 50 languages

Europeana - Facts

http://statistics.europeana.eu/europeana 4

Thumbnail

Metadata

Link to Provider

Metadata Multilinguality

6+ 40 other languages....

The Multilingual Problem

7

○ Mona Lisa 456 results○ La Gioconda 365 results ○ La Joconde 71 results

http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html

Metadata Enrichment

8

Quantify the Multilinguality of Data to

○Take measures to improve multilinguality in data

○Establish a sense of the multilingual reach of Europeana

○Distribution of languages

○Devise strategies for underrepresented languages

Multilingual Score for Metadata

10

Multilingual saturation of metadata

11

Text w/o language annotation (dc.subject: Germany)

Text w language annotation (dc.subject: Germany@en)

Text w several language annotations (dc.subject: Germany@en, Deutschland@de)

Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)

CalculationMissing fieldText string without language tag (language not known)

Text string with 2-3 different language tags

Text string with 4-9 different language tagsText string with more than 10 different language tagsLink to (multilingual) vocabulary

Text string with language tag (language known)

NA

0

1

2

2.3

2.6

3

Example score

13

Text w/o language annotation (dc.subject: Germany):

Text w language annotation (dc.subject: Germany@en)

Text w several language annotations (dc.subject: Germany@en, Deutschland@de)

Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)

0

1

2

3

Aggregation of property dc:subject

The Wittgenstein Archives at the University of Bergen: high saturation

National Library Portugal: low saturation

14http://144.76.218.178/europeana-qa/saturation.php?collectionId=all&field=proxy_dc_subject&type=average

Good examples"Die Mauer muß weg!"@de"Die Mauer muß weg! (The Wall must go!)"@en

15

"Kommentiertes Fotorama mit Bildern von 1989-1990 in Berlin"@de"Annotated images from 1989-1990 in Berlin"@en

dc:d

escr

ipti

ondc

:tit

le

"Brandenburger Tor"@de"Brandenburg Gate"@en

"Grenzübergang Potsdamer Platz"@de"Postdamer Platz border crossing"@en

"Reichstag"@de"Reichstag building"@en

Plac

e/sk

os:p

refL

abel

Descriptive fields Subject headings

Implementationsource codes: http://pkiraly.github.io/about/#source-codes

data source: http://hdl.handle.net/21.11101/0000-0001-781F-7(Europeana snapshot, 2015 december) 16

Data processing workflow

web interface

statistical analysis

measuringingestion

★ OAI-PMH★ Europeana

API★ Hadoop★ NoSQL

★ Spark★ Hadoop★ Java★ Apache Solr

★ Spark★ R

★ PHP★ D3.js★ highchart.js★ NoSQL

json csv json, png html, svg

17

Visualization

1818

APIs,abstractio

n,reusing

"Place/skos:altLabel": { "instances": [ {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, ... {"TRANSLATION": 2.40}, {"STRING": 0.0}, ], "score": { "sum": 20.40, "average": 1.85454545, "normalized": 0.649681 }}

Discussion & Future Work

20

extension I. recalculation

The new metrics★ Distinct languages per object★ Language tags per object★ Literals per language★ Number of multilingual properties (a.k.a. fields)★ Number of multilingual statements (a.k.a. field

instances)★ Average number of languages per property with

language★ Average number of languages per proxy

21

extension II. record views

ex:providerProxy dc:subject "special relativity"@en ; dc:creator <http://vocab.getty.eu/ulan/500240971> ; dc:type <http://udcdata.info/001684> .

ex:europeanaProxy dc:subject <http://dbpedia.org/resource/Physics> .

<http://vocab.getty.edu/ulan/500240971> skos:prefLabel "Einstein, Albert"@de .

standard vocabulary

<http://dbpedia.org/resource/Physics> skos:prefLabel "Physics"@en .

<http://udcdata.info/001684> skos:prefLabel "Books in general"@en .

standard vocabulary

non-standard vocabulary

22

extension II. record views

source field link value ① ② ③ ④

ex:providerProxy dc:subject literal "special relativity"@en ① ② ③ ④

dc:creator standard "Einstein, Albert"@de ① ② ③ ④

dc:type non-std "Books in general"@en ② ④

ex:europeanaProxy

dc:subject standard "Physics"@en ③ ④

① data provider's proxy and standard enrichments② data provider's proxy and enrichments③ all proxies and standard enrichments④ all proxies and enrichments

23

Questions

○contactjuliane.stiller@ibi.hu-berlin.depeter.kiraly@gwdg.de

○Metadata Quality Assurance Frameworkhttp://144.76.218.178/europeana-qa

○Europeana Data Quality Committeehttp://pro.europeana.eu/page/data-quality-committee

24

AppendixEuropeana data structure in 30 sec

provider proxy

Europeana proxy

Agent

Concept

Place

Timespan

descriptive fields

subject headings

sem

anti

c w

eb

Recommended