34
Die ZBW ist Mitglied der Leibniz- Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz, Austria

Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Embed Size (px)

Citation preview

Page 1: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Die ZBW ist Mitglied der Leibniz-Gemeinschaft

A Data Restore Model

for Reproducibility in Computational Statistics

Daniel Bahls, ZBW, I-Know 2013, Graz, Austria

Page 2: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Outline

1. Motivation – Repeatability in Empirical Research

2. Our Approach – The Data Restore Model

3. Outlook – Status of this Work / Next Steps

Seite 2

Page 3: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Repeatability in Science

• Fundamental criterion – to verify is the job of the community

• Experiments must lead to the same findings• different researchers• under certain constant parameters

• Further• Robustness (w.r.t measuring errors, etc.)• Repeatability vs. Reproducibility vs. Verifiability

Seite 3

Page 4: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Repeatability in Economicsand the infamous case of Rogoff and Reinhard

Seite 4

Page 5: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Improving Review Processes

Seite 5

- Justin Wolfers, Betsey Stevenson, economists at University of Michigan

....so we need access to the data

If we try it all on our own

and cannot reproduce the results,

what does it mean?

Page 6: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

McCullough – Experiences & Recommendations

Seite 6

Page 7: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

McCullough – Requirements & Experiences

Seite 7

Page 8: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

McCullough – Requirements & Experiences

Seite 8

Page 9: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Sweave – Literate Programming for Statistics

Seite 9

Page 10: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Sweave – Literate Programming for Statistics

Seite 10

Page 11: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Data Publishing in Economics / Social Sciences

Different disciplines have different challenges

Characteristics of empirical research:

• sensitive / protected data

• distributed external data sources

Seite 11

Data Sharing

submit data bundles to 3rd-party repositories?

Page 12: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

?

Data ManagementThe Black Box Approach

data reviewcuration legal situation

re-use transparency repeatability

Seite 12

a data set copy(some resource bundle)

Page 13: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Statistical Data on the Semantic Web

Seite 13

Page 14: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Outline

1. Motivation – Repeatability in Empirical Research

2. Our Approach – The Data Restore Model

3. Outlook – Status of this Work / Next Steps

Seite 14

Page 15: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Data Restore Model

Seite 15

Spreadsheet

obs data set

Page 16: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Data Restore Model

Seite 16

Spreadsheet

obs data set

Page 17: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

DataSet

type

UserDataSet

Data Items

type

Data Itemsfrom own survey

includesData

external dataset

buildScript

No gaps

Trust

Incentive

17

Page 18: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Seite 18

Source: EuroStatDataset: Household XZVersion: 0.2Published: Jan 2009[read more]

Page 19: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Integration with Research Environments

Seite 19

Page 20: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Seite 20

Page 21: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Review and Re-use

Seite 21

Client

Source CodeRepository

Archive DArchive CArchive B

Archive A

DOI

Code andData Templates

Authenticate & Request Data

Page 22: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Data Infrastructure Concept

• One source per data set

transparency, curation by highest expertise

• Data protection

make data publishing possible for all scenarios

• Data and code integration

one-click-solution – no manual efforts for replication attempts

• Precise Citation

traceable data provenance

Seite 22

Page 23: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Incentives for the Research Community

• Transparency increases trust:

no gaps – trust – incentive

• Easy re-use:

the research models applied live longer

• More impact:

more citation

Seite 23

Page 24: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Incentives for the Research Community

• Material for tutorials:

Students learn computational research in practice

• Research is more efficient:

Easier to understand and pick up the research of others

• Secured Knowledge:

Replication attempts in different research environments and context

discussion, inspiration, innovation

“Non-Findings” may get more recognition

Seite 24

Page 25: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Outline

1. Motivation – Repeatability in Empirical Research

2. Our Approach – The Data Restore Model

3. Outlook – Status of this Work / Next Steps

Seite 25

Page 26: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

What we are currently working on

Seite 26

The Rogoff and Reinhard / Herndon case

• apply Data Restore Model

• add semantic data documentation (partly available as RDF already)

• model by Data and Code ontology

Page 27: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Data and Code Ontology

Seite 27

Data and Code

System Environment

Resources

HW

SW

Replication Attempts

ExperimentSetup

• Maven• Make

• Build

• Virtualisation

• Emulation

• Linked Science

• Social M

edia

Data References

• Semantic Coding?

Page 28: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

What we are currently working on

Seite 28

The Koenker Zeileis case

• Model relations between Data and Code instances

protectedpublic use file

figures

data set

transformationby code

The Koenker Zeileis case

Page 29: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Data Access and Retrieval

Page 30: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Next Steps

Seite 30

1. Challenge, Goals, Requirements

2. The Data Restore Model

3. Semantic Linkup / Data Annotation

4. Data Retrieval and Reuse

5. System Architecture

6. Validation / Evaluation

Page 31: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Thank you

Daniel Bahls, [email protected]

Page 32: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

So there are still gaps

Examples:

•data set is titled “EU Unemployment statistics 2012, EuroStat”• age class? seasonal adjustments?

•Executing the code does not produce the results• wrong data? system environment? error?• cf. Herndon’s replication of Rogoff/Reinhard research

•DOI does not specify file format

Seite 32

Page 33: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Data and Code Ontology

Seite 33

observation string value

s p o

data ref

default value

for_stata

for_spss

Page 34: Die ZBW ist Mitglied der Leibniz-Gemeinschaft A Data Restore Model for Reproducibility in Computational Statistics Daniel Bahls, ZBW, I-Know 2013, Graz,

Such relationship can be stated within the semantic model

Proxy Relations

Dataset foreconomic growth(GDP or the like)

Dataset forAluminium

Price Index

Describes the proxy relation: - details on correlation

- best practices - frequency of use

- ...

hasProxyRel