Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in...

Preview:

Citation preview

Die ZBW ist Mitglied der Leibniz-Gemeinschaft

A Data Restore Model

for Reproducibility in Computational Statistics

Daniel Bahls, ZBW, I-Know 2013, Graz, Austria

Outline

1. Motivation – Repeatability in Empirical Research

2. Our Approach – The Data Restore Model

3. Outlook – Status of this Work / Next Steps

Seite 2

Repeatability in Science

• Fundamental criterion – to verify is the job of the community

• Experiments must lead to the same findings• different researchers• under certain constant parameters

• Further• Robustness (w.r.t measuring errors, etc.)• Repeatability vs. Reproducibility vs. Verifiability

Seite 3

Repeatability in Economicsand the infamous case of Rogoff and Reinhard

Seite 4

Improving Review Processes

Seite 5

- Justin Wolfers, Betsey Stevenson, economists at University of Michigan

....so we need access to the data

If we try it all on our own

and cannot reproduce the results,

what does it mean?

McCullough – Experiences & Recommendations

Seite 6

McCullough – Requirements & Experiences

Seite 7

McCullough – Requirements & Experiences

Seite 8

Sweave – Literate Programming for Statistics

Seite 9

Sweave – Literate Programming for Statistics

Seite 10

Data Publishing in Economics / Social Sciences

Different disciplines have different challenges

Characteristics of empirical research:

• sensitive / protected data

• distributed external data sources

Seite 11

Data Sharing

submit data bundles to 3rd-party repositories?

?

Data ManagementThe Black Box Approach

data reviewcuration legal situation

re-use transparency repeatability

Seite 12

a data set copy(some resource bundle)

Statistical Data on the Semantic Web

Seite 13

Outline

1. Motivation – Repeatability in Empirical Research

2. Our Approach – The Data Restore Model

3. Outlook – Status of this Work / Next Steps

Seite 14

Data Restore Model

Seite 15

Spreadsheet

obs data set

Data Restore Model

Seite 16

Spreadsheet

obs data set

DataSet

type

UserDataSet

Data Items

type

Data Itemsfrom own survey

includesData

external dataset

buildScript

No gaps

Trust

Incentive

17

Seite 18

Source: EuroStatDataset: Household XZVersion: 0.2Published: Jan 2009[read more]

Integration with Research Environments

Seite 19

Seite 20

Review and Re-use

Seite 21

Client

Source CodeRepository

Archive DArchive CArchive B

Archive A

DOI

Code andData Templates

Authenticate & Request Data

Data Infrastructure Concept

• One source per data set

transparency, curation by highest expertise

• Data protection

make data publishing possible for all scenarios

• Data and code integration

one-click-solution – no manual efforts for replication attempts

• Precise Citation

traceable data provenance

Seite 22

Incentives for the Research Community

• Transparency increases trust:

no gaps – trust – incentive

• Easy re-use:

the research models applied live longer

• More impact:

more citation

Seite 23

Incentives for the Research Community

• Material for tutorials:

Students learn computational research in practice

• Research is more efficient:

Easier to understand and pick up the research of others

• Secured Knowledge:

Replication attempts in different research environments and context

discussion, inspiration, innovation

“Non-Findings” may get more recognition

Seite 24

Outline

1. Motivation – Repeatability in Empirical Research

2. Our Approach – The Data Restore Model

3. Outlook – Status of this Work / Next Steps

Seite 25

What we are currently working on

Seite 26

The Rogoff and Reinhard / Herndon case

• apply Data Restore Model

• add semantic data documentation (partly available as RDF already)

• model by Data and Code ontology

Data and Code Ontology

Seite 27

Data and Code

System Environment

Resources

HW

SW

Replication Attempts

ExperimentSetup

• Maven• Make

• Build

• Virtualisation

• Emulation

• Linked Science

• Social M

edia

Data References

• Semantic Coding?

What we are currently working on

Seite 28

The Koenker Zeileis case

• Model relations between Data and Code instances

protectedpublic use file

figures

data set

transformationby code

The Koenker Zeileis case

Data Access and Retrieval

Next Steps

Seite 30

1. Challenge, Goals, Requirements

2. The Data Restore Model

3. Semantic Linkup / Data Annotation

4. Data Retrieval and Reuse

5. System Architecture

6. Validation / Evaluation

Thank you

Daniel Bahls, ZBWd.bahls@zbw.eu

So there are still gaps

Examples:

•data set is titled “EU Unemployment statistics 2012, EuroStat”• age class? seasonal adjustments?

•Executing the code does not produce the results• wrong data? system environment? error?• cf. Herndon’s replication of Rogoff/Reinhard research

•DOI does not specify file format

Seite 32

Data and Code Ontology

Seite 33

observation string value

s p o

data ref

default value

for_stata

for_spss

Such relationship can be stated within the semantic model

Proxy Relations

Dataset foreconomic growth(GDP or the like)

Dataset forAluminium

Price Index

Describes the proxy relation: - details on correlation

- best practices - frequency of use

- ...

hasProxyRel

Recommended