Metadata Quality Assurance

  • View
    726

  • Download
    0

  • Category

    Science

Preview:

Citation preview

Metadata Quality Assurance

Péter Királypeter.kiraly@gwdg.deHeyne Haus, Göttingen, 18/12/2015Oberseminar Datenmanagement, Cloud und e-Infrastructure

Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen

2

Metadata Quality Assurance Framework

What is metadata?

Data about data Specifically: descriptive data about ...

digitized (or physical) objectsuch as paintings, books,

photos larger datasets

such as research data Provides access points to the

underlining data

3

Metadata Quality Assurance Framework

Why data quality is important?

„Fitness for purpose”

no metadata no access to data no data usage

more explanation:Data on the Web Best PracticesW3C Working Draft 17 December 2015http://www.w3.org/TR/2015/WD-dwbp-20151217/

4

Metadata Quality Assurance Framework

Symptoms of bad quality metadata

Hard to identify („What it is?”) Hard to distinguish from other records Misleading descriptions Uninterpretable descriptions Missing fields Unreusable (lost original context) Hard to find

5

Metadata Quality Assurance Framework

Some typical issues

Title is not informative

6

Metadata Quality Assurance Framework

Mixing different data types

Numeric

RDF resource

7

Metadata Quality Assurance Framework

Field overuse

What is the meaning of the field?

identifier relation

source

TextGrid OAI-PMH response

8

Metadata Quality Assurance Framework

Copy & paste cataloguing

Keeping placeholders / templates

9

Metadata Quality Assurance Framework

Same entity, differently recorded

lucas cranach der ältere Cranach, Lucas (der Ältere)

[Herstellung] Cranach, Lucas (I) (naar tekening van) Cranach, Lucas vanem (autor)Result of entity detection: http://

dbpedia.org/resource/Lucas_Cranach_the_Elder http://viaf.org/viaf/49268177/ none

10

Metadata Quality Assurance Framework

Same entity recorded differently

Different displays, and content: http://

dbpedia.org/resource/Lucas_Cranach_the_Elder http://viaf.org/viaf/49268177/ none

11

Metadata Quality Assurance Framework

What to measure?

field1 field2 field3 field4doc1doc2doc3doc3

An overall value for a record

12

Metadata Quality Assurance Framework

What to measure?

field1 field2 field3 field4doc1doc2doc3doc3

An overall value for a record set (e.g. a collection from the same source)

13

Metadata Quality Assurance Framework

What to measure?

field1 field2 field3 field4doc1doc2doc3doc3

An overall value for a field – how users utilize the field?

14

Metadata Quality Assurance Framework

What to measure?

field1 field2 field3 field4doc1doc2doc3doc3

Field group. A group of fields together supports a given funtionality, e.g. display, search, identify, re-use, multilinguality.

15

Metadata Quality Assurance Framework

Grouping fields by functionalities

Mandatory

Descriptiveness

Searchability

Contextualisation

Identification

Browsing

Viewing

Re-Usability

Multilinguality

dc:title × × × × ×dcterms:alternative × × × ×dc:description × × × × × ×dc:creator × × × ×dc:publisher × ×dc:contributor ×

Created by Valentine Charles, Europeana Research and Development team

16

Metadata Quality Assurance Framework

Metrics

The foundational metrics were set by Bruce–Hillmann, Stvilia, Ochoa–Duval, Gavrilis et al. Completeness Accuracy Conformance to expectations Logical consistency and coherence Accessibility Timeliness Provenance

17

Metadata Quality Assurance Framework

Data sources

Europeana – the European digital library, museum and archive: 48M+ medatata records in EDM (Europeana Data Model) schema

TextGrid repository: Dublin Core metadata and TEI (Text Encoding Initiative) records

Research data from the Göttingen Campus

Library catalogue records in MARC (Machine Readable Catalog) schema

Other open data

18

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Data collection (ingestion) via REST API, OAI-OMH harvesting, file download etc.

Issues: GWDG cloud: 160 GB, Europeana: 300

GB low I/O performance Europeana OAI-PMH is in a „beta”

state OAI-PMH requires 10M+ HTTP

requests REST API requires 50M+ HTTP

requests

19

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Measuring records Big data so it should be scalable Apache Hadoop: MapReduce and

friends Plugable architecture: „meters” UI: set parameters for meters input: records, schema, meters, config

files output: identifier, projected metadata fields metric1, metric2, metric3 ... metricN

20

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Statistical analysis Calculating descriptive statistics with

R/Julia/other tool Derivation of numbers representing

collections and fields from the record level measurements

21

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Completeness of 3 collections 2 response types

best incollection

worst incollection

similar records

heterogenious records

different manifestations

scale: 0

-1

22

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

outputs Display results in an interactive

dashboard

REST API to share the raw dataImages: i) European Data Portal Metadata Quality Dashboard ii) Kibana promotional video

23

Metadata Quality Assurance Framework

Method: collection – measuring – sharing

Data Quality Vocabulary (W3C Working Draft)http://w3c.github.io/dwbp/vocab-dqg.html :myDatasetDistribution

dqv:hasQualityMeasure :measure1, :measure2 .

:measure1 a dqv:QualityMeasure ;dqv:computedOn :myDatasetDistribution ;dqv:hasMetric :csvAvailabilityMetric ;dqv:value "1.0"^^xsd:double .

:measure2a dqv:QualityMeasure ;dqv:computedOn :myDatasetDistribution ;dqv:hasMetric :csvConsistencyMetric ;dqv:value "0.5"^^xsd:double .

24

Metadata Quality Assurance Framework

What it is good for?

Improve the metadata Improve metadata schema and its

docum. Propagate „good practice” Improve services: „good” data is ranked

higher in search result list

Specifically for GWDG: Could be built in to current and planned

data management / data archiving tools

25

Metadata Quality Assurance Framework

Further steps

Define meters by Domain Specific Language

Pattern discovery, machine learning, clustering

Connectors for data sources „Jenkins for data publication”

Problem catalogue

Data source

Schema

Metadata QA Report

26

Metadata Quality Assurance Framework

Follow me

Project plan and blog: http://pkiraly.github.io

Software development: https://

github.com/pkiraly/europeana-oai-pmh-client: Harvester for Europeana OAI-PMH Service

https://github.com/pkiraly/oai-pmh-lib: OAI-PMH client library

https://github.com/pkiraly/europeana-api-php-client: PHP client for Europeana’s REST API

https://github.com/pkiraly/europeana-qa: Europeana Metadata Quality Assurance Toolkit

@kiru, https://www.linkedin.com/in/peterkiraly

Recommended