26
Metadata Quality Assurance Péter Király [email protected] Heyne Haus, Göttingen, 18/12/2015 Oberseminar Datenmanagement, Cloud und e-Infrastructure Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

Metadata Quality Assurance

Embed Size (px)

Citation preview

Page 1: Metadata Quality Assurance

Metadata Quality Assurance

Péter Kirá[email protected] Haus, Göttingen, 18/12/2015Oberseminar Datenmanagement, Cloud und e-Infrastructure

Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

Page 2: Metadata Quality Assurance

2

Metadata Quality Assurance Framework

What is metadata?

Data about data Specifically: descriptive data about ...

digitized (or physical) objectsuch as paintings, books,

photos larger datasets

such as research data Provides access points to the

underlining data

Page 3: Metadata Quality Assurance

3

Metadata Quality Assurance Framework

Why data quality is important?

„Fitness for purpose”

no metadata no access to data no data usage

more explanation:Data on the Web Best PracticesW3C Working Draft 17 December 2015http://www.w3.org/TR/2015/WD-dwbp-20151217/

Page 4: Metadata Quality Assurance

4

Metadata Quality Assurance Framework

Symptoms of bad quality metadata

Hard to identify („What it is?”) Hard to distinguish from other records Misleading descriptions Uninterpretable descriptions Missing fields Unreusable (lost original context) Hard to find

Page 5: Metadata Quality Assurance

5

Metadata Quality Assurance Framework

Some typical issues

Title is not informative

Page 6: Metadata Quality Assurance

6

Metadata Quality Assurance Framework

Mixing different data types

Numeric

RDF resource

Page 7: Metadata Quality Assurance

7

Metadata Quality Assurance Framework

Field overuse

What is the meaning of the field?

identifier relation

source

TextGrid OAI-PMH response

Page 8: Metadata Quality Assurance

8

Metadata Quality Assurance Framework

Copy & paste cataloguing

Keeping placeholders / templates

Page 9: Metadata Quality Assurance

9

Metadata Quality Assurance Framework

Same entity, differently recorded

lucas cranach der ältere Cranach, Lucas (der Ältere)

[Herstellung] Cranach, Lucas (I) (naar tekening van) Cranach, Lucas vanem (autor)Result of entity detection: http://

dbpedia.org/resource/Lucas_Cranach_the_Elder http://viaf.org/viaf/49268177/ none

Page 10: Metadata Quality Assurance

10

Metadata Quality Assurance Framework

Same entity recorded differently

Different displays, and content: http://

dbpedia.org/resource/Lucas_Cranach_the_Elder http://viaf.org/viaf/49268177/ none

Page 11: Metadata Quality Assurance

11

Metadata Quality Assurance Framework

What to measure?

field1 field2 field3 field4doc1doc2doc3doc3

An overall value for a record

Page 12: Metadata Quality Assurance

12

Metadata Quality Assurance Framework

What to measure?

field1 field2 field3 field4doc1doc2doc3doc3

An overall value for a record set (e.g. a collection from the same source)

Page 13: Metadata Quality Assurance

13

Metadata Quality Assurance Framework

What to measure?

field1 field2 field3 field4doc1doc2doc3doc3

An overall value for a field – how users utilize the field?

Page 14: Metadata Quality Assurance

14

Metadata Quality Assurance Framework

What to measure?

field1 field2 field3 field4doc1doc2doc3doc3

Field group. A group of fields together supports a given funtionality, e.g. display, search, identify, re-use, multilinguality.

Page 15: Metadata Quality Assurance

15

Metadata Quality Assurance Framework

Grouping fields by functionalities

Mandatory

Descriptiveness

Searchability

Contextualisation

Identification

Browsing

Viewing

Re-Usability

Multilinguality

dc:title × × × × ×dcterms:alternative × × × ×dc:description × × × × × ×dc:creator × × × ×dc:publisher × ×dc:contributor ×

Created by Valentine Charles, Europeana Research and Development team

Page 16: Metadata Quality Assurance

16

Metadata Quality Assurance Framework

Metrics

The foundational metrics were set by Bruce–Hillmann, Stvilia, Ochoa–Duval, Gavrilis et al. Completeness Accuracy Conformance to expectations Logical consistency and coherence Accessibility Timeliness Provenance

Page 17: Metadata Quality Assurance

17

Metadata Quality Assurance Framework

Data sources

Europeana – the European digital library, museum and archive: 48M+ medatata records in EDM (Europeana Data Model) schema

TextGrid repository: Dublin Core metadata and TEI (Text Encoding Initiative) records

Research data from the Göttingen Campus

Library catalogue records in MARC (Machine Readable Catalog) schema

Other open data

Page 18: Metadata Quality Assurance

18

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Data collection (ingestion) via REST API, OAI-OMH harvesting, file download etc.

Issues: GWDG cloud: 160 GB, Europeana: 300

GB low I/O performance Europeana OAI-PMH is in a „beta”

state OAI-PMH requires 10M+ HTTP

requests REST API requires 50M+ HTTP

requests

Page 19: Metadata Quality Assurance

19

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Measuring records Big data so it should be scalable Apache Hadoop: MapReduce and

friends Plugable architecture: „meters” UI: set parameters for meters input: records, schema, meters, config

files output: identifier, projected metadata fields metric1, metric2, metric3 ... metricN

Page 20: Metadata Quality Assurance

20

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Statistical analysis Calculating descriptive statistics with

R/Julia/other tool Derivation of numbers representing

collections and fields from the record level measurements

Page 21: Metadata Quality Assurance

21

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Completeness of 3 collections 2 response types

best incollection

worst incollection

similar records

heterogenious records

different manifestations

scale: 0

-1

Page 22: Metadata Quality Assurance

22

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

outputs Display results in an interactive

dashboard

REST API to share the raw dataImages: i) European Data Portal Metadata Quality Dashboard ii) Kibana promotional video

Page 23: Metadata Quality Assurance

23

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Data Quality Vocabulary (W3C Working Draft)http://w3c.github.io/dwbp/vocab-dqg.html :myDatasetDistribution

dqv:hasQualityMeasure :measure1, :measure2 .

:measure1 a dqv:QualityMeasure ;dqv:computedOn :myDatasetDistribution ;dqv:hasMetric :csvAvailabilityMetric ;dqv:value "1.0"^^xsd:double .

:measure2a dqv:QualityMeasure ;dqv:computedOn :myDatasetDistribution ;dqv:hasMetric :csvConsistencyMetric ;dqv:value "0.5"^^xsd:double .

Page 24: Metadata Quality Assurance

24

Metadata Quality Assurance Framework

What it is good for?

Improve the metadata Improve metadata schema and its

docum. Propagate „good practice” Improve services: „good” data is ranked

higher in search result list

Specifically for GWDG: Could be built in to current and planned

data management / data archiving tools

Page 25: Metadata Quality Assurance

25

Metadata Quality Assurance Framework

Further steps

Define meters by Domain Specific Language

Pattern discovery, machine learning, clustering

Connectors for data sources „Jenkins for data publication”

Problem catalogue

Data source

Schema

Metadata QA Report

Page 26: Metadata Quality Assurance

26

Metadata Quality Assurance Framework

Follow me

Project plan and blog: http://pkiraly.github.io

Software development: https://

github.com/pkiraly/europeana-oai-pmh-client: Harvester for Europeana OAI-PMH Service

https://github.com/pkiraly/oai-pmh-lib: OAI-PMH client library

https://github.com/pkiraly/europeana-api-php-client: PHP client for Europeana’s REST API

https://github.com/pkiraly/europeana-qa: Europeana Metadata Quality Assurance Toolkit

@kiru, https://www.linkedin.com/in/peterkiraly