Upload
peter-kiraly
View
726
Download
0
Embed Size (px)
Citation preview
Metadata Quality Assurance
Péter Kirá[email protected] Haus, Göttingen, 18/12/2015Oberseminar Datenmanagement, Cloud und e-Infrastructure
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
2
Metadata Quality Assurance Framework
What is metadata?
Data about data Specifically: descriptive data about ...
digitized (or physical) objectsuch as paintings, books,
photos larger datasets
such as research data Provides access points to the
underlining data
3
Metadata Quality Assurance Framework
Why data quality is important?
„Fitness for purpose”
no metadata no access to data no data usage
more explanation:Data on the Web Best PracticesW3C Working Draft 17 December 2015http://www.w3.org/TR/2015/WD-dwbp-20151217/
4
Metadata Quality Assurance Framework
Symptoms of bad quality metadata
Hard to identify („What it is?”) Hard to distinguish from other records Misleading descriptions Uninterpretable descriptions Missing fields Unreusable (lost original context) Hard to find
5
Metadata Quality Assurance Framework
Some typical issues
Title is not informative
6
Metadata Quality Assurance Framework
Mixing different data types
Numeric
RDF resource
7
Metadata Quality Assurance Framework
Field overuse
What is the meaning of the field?
identifier relation
source
TextGrid OAI-PMH response
8
Metadata Quality Assurance Framework
Copy & paste cataloguing
Keeping placeholders / templates
9
Metadata Quality Assurance Framework
Same entity, differently recorded
lucas cranach der ältere Cranach, Lucas (der Ältere)
[Herstellung] Cranach, Lucas (I) (naar tekening van) Cranach, Lucas vanem (autor)Result of entity detection: http://
dbpedia.org/resource/Lucas_Cranach_the_Elder http://viaf.org/viaf/49268177/ none
10
Metadata Quality Assurance Framework
Same entity recorded differently
Different displays, and content: http://
dbpedia.org/resource/Lucas_Cranach_the_Elder http://viaf.org/viaf/49268177/ none
11
Metadata Quality Assurance Framework
What to measure?
field1 field2 field3 field4doc1doc2doc3doc3
An overall value for a record
12
Metadata Quality Assurance Framework
What to measure?
field1 field2 field3 field4doc1doc2doc3doc3
An overall value for a record set (e.g. a collection from the same source)
13
Metadata Quality Assurance Framework
What to measure?
field1 field2 field3 field4doc1doc2doc3doc3
An overall value for a field – how users utilize the field?
14
Metadata Quality Assurance Framework
What to measure?
field1 field2 field3 field4doc1doc2doc3doc3
Field group. A group of fields together supports a given funtionality, e.g. display, search, identify, re-use, multilinguality.
15
Metadata Quality Assurance Framework
Grouping fields by functionalities
Mandatory
Descriptiveness
Searchability
Contextualisation
Identification
Browsing
Viewing
Re-Usability
Multilinguality
dc:title × × × × ×dcterms:alternative × × × ×dc:description × × × × × ×dc:creator × × × ×dc:publisher × ×dc:contributor ×
Created by Valentine Charles, Europeana Research and Development team
16
Metadata Quality Assurance Framework
Metrics
The foundational metrics were set by Bruce–Hillmann, Stvilia, Ochoa–Duval, Gavrilis et al. Completeness Accuracy Conformance to expectations Logical consistency and coherence Accessibility Timeliness Provenance
17
Metadata Quality Assurance Framework
Data sources
Europeana – the European digital library, museum and archive: 48M+ medatata records in EDM (Europeana Data Model) schema
TextGrid repository: Dublin Core metadata and TEI (Text Encoding Initiative) records
Research data from the Göttingen Campus
Library catalogue records in MARC (Machine Readable Catalog) schema
Other open data
18
Metadata Quality Assurance Framework
Method: collection – measuring – sharing
Data collection (ingestion) via REST API, OAI-OMH harvesting, file download etc.
Issues: GWDG cloud: 160 GB, Europeana: 300
GB low I/O performance Europeana OAI-PMH is in a „beta”
state OAI-PMH requires 10M+ HTTP
requests REST API requires 50M+ HTTP
requests
19
Metadata Quality Assurance Framework
Method: collection – measuring – sharing
Measuring records Big data so it should be scalable Apache Hadoop: MapReduce and
friends Plugable architecture: „meters” UI: set parameters for meters input: records, schema, meters, config
files output: identifier, projected metadata fields metric1, metric2, metric3 ... metricN
20
Metadata Quality Assurance Framework
Method: collection – measuring – sharing
Statistical analysis Calculating descriptive statistics with
R/Julia/other tool Derivation of numbers representing
collections and fields from the record level measurements
21
Metadata Quality Assurance Framework
Method: collection – measuring – sharing
Completeness of 3 collections 2 response types
best incollection
worst incollection
similar records
heterogenious records
different manifestations
scale: 0
-1
22
Metadata Quality Assurance Framework
Method: collection – measuring – sharing
outputs Display results in an interactive
dashboard
REST API to share the raw dataImages: i) European Data Portal Metadata Quality Dashboard ii) Kibana promotional video
23
Metadata Quality Assurance Framework
Method: collection – measuring – sharing
Data Quality Vocabulary (W3C Working Draft)http://w3c.github.io/dwbp/vocab-dqg.html :myDatasetDistribution
dqv:hasQualityMeasure :measure1, :measure2 .
:measure1 a dqv:QualityMeasure ;dqv:computedOn :myDatasetDistribution ;dqv:hasMetric :csvAvailabilityMetric ;dqv:value "1.0"^^xsd:double .
:measure2a dqv:QualityMeasure ;dqv:computedOn :myDatasetDistribution ;dqv:hasMetric :csvConsistencyMetric ;dqv:value "0.5"^^xsd:double .
24
Metadata Quality Assurance Framework
What it is good for?
Improve the metadata Improve metadata schema and its
docum. Propagate „good practice” Improve services: „good” data is ranked
higher in search result list
Specifically for GWDG: Could be built in to current and planned
data management / data archiving tools
25
Metadata Quality Assurance Framework
Further steps
Define meters by Domain Specific Language
Pattern discovery, machine learning, clustering
Connectors for data sources „Jenkins for data publication”
Problem catalogue
Data source
Schema
Metadata QA Report
26
Metadata Quality Assurance Framework
Follow me
Project plan and blog: http://pkiraly.github.io
Software development: https://
github.com/pkiraly/europeana-oai-pmh-client: Harvester for Europeana OAI-PMH Service
https://github.com/pkiraly/oai-pmh-lib: OAI-PMH client library
https://github.com/pkiraly/europeana-api-php-client: PHP client for Europeana’s REST API
https://github.com/pkiraly/europeana-qa: Europeana Metadata Quality Assurance Toolkit
@kiru, https://www.linkedin.com/in/peterkiraly