Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Technische Universität DresdenFakultät Informatik
Wikidata
Markus KrötzschTU Dresden
August 2014
2
“ Imagine a world in which every single person
is given free access to the sum
of all human knowledge.That’s our mission.”
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickof
21M+ articles 1.5B+ edits 280+ languages
Markus Krötzsch: Wikidata Toolkit Kickoff
about 500 Million views per day
~480M unique visitors per month
Markus Krötzsch: Wikidata Toolkit Kickof
Markus Krötzsch: Wikidata Toolkit Kickof
Problem 1: Content Quality
Problem 1: Content Quality High cost of maintenance
– Fighting spam and vandalism– Updating old content– Fixing errors
Problem 1: Content Quality High cost of maintenance
– Fighting spam and vandalism– Updating old content– Fixing errors
“But we have an army of contributors with the Wisdom of the Crowds!“
Number of Articles (English)
Source: wikistatistics.net
Number of Active Users (English)
Number of Edits (English)
The Crowds are not Enough
Amount of content grows Maintenance effort grows→
Number of contributors stablizes
Problem 2: Language Diversity
Problem 2: Language Diversity
Language diversity– 285 languages– English, German, French, Dutch: 1 Mio+– 40 languages: 100,000+– 112 languages: 10.000+
Quality problem– Even basic facts do not agree across languages
Coverage problem
Markus Krötzsch: Wikidata Toolkit Kickof
Mastertextformat bearbeiten Zweite Ebene Dritte Ebene
Vierte Ebene Fünfte Ebene
English
Markus Krötzsch: Wikidata Toolkit Kickof
Mastertextformat bearbeiten Zweite Ebene Dritte Ebene
Vierte Ebene Fünfte Ebene
French
Markus Krötzsch: Wikidata Toolkit Kickof
Mastertextformat bearbeiten Zweite Ebene Dritte Ebene
Vierte Ebene Fünfte Ebene
Catalan
Markus Krötzsch: Wikidata Toolkit Kickof
Mastertextformat bearbeiten Zweite Ebene Dritte Ebene
Vierte Ebene Fünfte Ebene
Italian
Markus Krötzsch: Wikidata Toolkit Kickof
Mastertextformat bearbeiten Zweite Ebene Dritte Ebene
Vierte Ebene Fünfte Ebene
Greek
Markus Krötzsch: Wikidata Toolkit Kickof
Mastertextformat bearbeiten Zweite Ebene Dritte Ebene
Vierte Ebene Fünfte Ebene
Russian
Markus Krötzsch: Wikidata Toolkit Kickof
Mastertextformat bearbeiten Zweite Ebene Dritte Ebene
Vierte Ebene Fünfte Ebene
Chinese
Markus Krötzsch: Wikidata Toolkit Kickof
Mastertextformat bearbeiten Zweite Ebene Dritte Ebene
Vierte Ebene Fünfte Ebene
English
Problem 3: Information Access
Problem 3: Information Access
Wikipedia has articles about…… all cities… their populations… their mayors
“So can I ask for a list of the world’s ten largest cities with a female mayor?“
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Wikipedia’s answer: Lists
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickoff
Markus Krötzsch: Wikidata Toolkit Kickof
63
“ Imagine a world in which every single person
is given free access to the sum
of all human knowledge.That’s our mission.”
Wikidata Provide a database of the world’s knowledge that anyone
can edit Collect references and quotes for millions of data items Engage a sustainable community that collects data from
everywhere in a machine-readable way Increase the quality and lower the maintenance cost of
Wikipedia and related projects Deliver software and community best practices enabling
others to engage in projects of data collection and provisioning
Project Funding 1.5 Mio EUR 4 donors Wikimedia Foundation
Wikidata
Official “Wikipedia Database” For all 285 language editions
Very recent: Live since November 2012 Enabled on all Wikipedia editions since March 2013 Ongoing development led by Wikimedia Germany
The Content of Wikidata
Size as of 4th August 2014
Items: 15,792,256
Properties: 1,176 Statements: 43,189,145
… with references: 23,242,779
Labels: 52,811,608 Aliases: 8,765,542 Descriptions: 37,636,220
Site links: 39,356,543
Growth (up to Feb 2014)
Activity(Feb 2014)
54k contributors – 5k contributors with 5+ edits in Jun 2014 Over 150M edits so far – up to 500k per day
Wikidata, DBpedia, RDF,and all that
Wikidata and DBpedia: A Superficial Comparison
Wikidata
Data related to Wikipedia Online since late 2012* Manual editing One multilingual dataset Based on statements About 1k properties Wikipedia integration Unique community
*) influenced by Semantic MediaWiki (started 2005)
DBpedia
Data related to Wikipedia Started in 2006 Automated extraction One dataset per language Based on triples (RDF) >10k properties Stand-alone dataset Unique community
Exporting Wikidata to RDF
Define URIs for items: http://www.wikidata.org/entity/<id>
Map MediaWiki languages to BCP 47 languages
Select suitable vocabulary for reuse: rdfs:label, schema.org description, skos:altLabel prov:wasDerivedFrom
Exporting Wikidata Statements to RDF
Simpler Export of Statements
What makes RDF export complex: Qualifiers References Complex values
Idea: export only statements that have no qualifiers drop references simplify value encoding
Classification
Properties subclass of (P279) and instance of (P31) P31 is the most used property on Wikidata
Often (but not always) used without qualifiers
Interesting class hierarchy: Entities used as classes: 41,868 Subclass of: 40,192 (without qualifiers) Instance of: 6,169,821(without qualifiers)
Available RDF Exports
RDF/OWL file exports at:http://tools.wmflabs.org/wikidata-exports/rdf/
Results for April 20, 2014:
Usage & Applications
Application Areas
Labels and descriptions
Identifiers
Data access
Advanced analytics
Third-party applications
Third-party applications
Third-party applications
Getting the Data
See www.wikidata.org/wiki/Wikidata:Data_access
Direct access per item (Web API, RDF/JSON/...)
Database dumps (full dumps + daily changes)
Full dumps in more convenient formats planned
Conclusions
Wikidata is developing rapidly Data size Vocabulary size Technical features and community processes
A platform for data integration Including links to many other databases
Data access is easy, both legally and technically Further improvements planned for exports