TDWI AWARD - sigs.de · In der heutigen Zeit steht die Pharmaindustrie einer Vielzahl von Herausforderungen gegenüber: der Zu-nahme sowie Beschleunigung des Wettbewerbs, der fortschreitenden

tdwi.eu

Abstracts der eingereichten Arbeiten

im Bereich BI

Bachelor- und Masterarbeiten

TDWI AWARD

2017

2

Inhaltsverzeichnis

Vorwort 3

Silke Bandyszak 4Text Mining in der Pharmaindustrie – Eine Meinungsanalyse zur Verträglichkeit von Medikamenten

Erster Platz Benjamin Barann 6Quantifying the Potential of Taxi- Ridesharing Services: A Data-Driven Analysis in Metropolitan Areas

Monica Deutsch 12Data Acquisition, Inspection and Adaption as Mandatory Prerequisites for Generating High Quality Predictions by Means of Business Analytics

Dritter Platz Benjamin Gürtl 18Road Traffic Optimization and Decision Support by Means of Prescriptive Analytics as a Service

Frank Heimerzheim 26Statistische Analyse von Unternehmensdaten unter besonderer Berücksichtigung von Heavy-tailed-Verteilungen

Nelli Neigum 28Was ist "Agile Business Intelligence"?

Simon Schausberger 30The Semantic Data Warehouse for the AgriProKnow Project: A First Prototype

Erster Platz Helena Sternkopf 34Doing Good with Data – Development of a Maturity Model for Data Literacy in Non-Governmental Organizations

Aline Venckus (ex-Krampl) 35Auswirkungen von Big-Data-Technologien auf Data- Warehouse-Architekturen – Referenzmodell einer Lambda-Architektur

Zum Award wurden weitere Arbeiten eingereicht, die für die Veröffentlichung in diesem Sammelband aufgrund von konkurrierenden Rechten Dritter oder anderer Gründe nicht freigegeben wurden.

Dabei handelt es sich um folgende Arbeiten:

Tobias DielingAktuelle Trends im Bereich Business Intelligence – Erfolgsfaktoren und Einzelfallstudie bei der DATEV eG

Florian DörrBig Data Analytics – Eine Datenanalyse am Beispiel einer vernetzten Firmenwagenflotte

Oliver Wesely How Do Naturally Inspired Search Algorithms Help to Find the Optimal Trading Strategy?

3

Liebe Leser,

wir freuen uns sehr, dass mit der Unterstützung der INFOMOTION GmbH zum 11ten Male der TDWI Award und damit der Preis für hervorragende Abschlussar-beiten im Rahmen der großen TDWI Konferenz 2017 in München verliehen worden ist. Auch hier noch einmal Gratulation an die Gewinner! Dabei gab es in diesem Jahr ein Novum: es wurden zum ersten Male zwei erste Preise verliehen, da sich die Jurymitglieder, die sich dankenswerterweise durch alle Arbeiten und Gutachten gearbeitet haben, nicht für einen ersten und zweiten Platz entscheiden konnte. Erneut war es eine Herausforderung, drei Arbeiten als die Besten aus dem Kreis der Einreichungen auszuwählen, da das Niveau insgesamt sehr hoch war und lediglich Nuancen den Unterschied ausmachten. Das zeigt sich auch darin, dass die beiden erst platzierten Arbeiten einfach nicht abzustufen sind. Die auch in diesem Jahr hohe An-zahl an Einreichungen zeigt das große Interesse am TDWI Award. Gleichzeitig verdeutlicht es auch, dass die Themen rund um die Business Intelligence und Business Analytics dauerhaft spannende Ergebnisse ermöglichen und auch eine permanente Präsenz in den Unternehmen aufweisen. Die preistragenden Themen sind in diesem Jahr

Erster Preis: Benjamin Barann, Quantifying the Potential of Taxi-Ridesharing Services: A Data-Driven Analysis in Metropolitan Areas

Erster Preis: Helena Sternkopf, Doing Good with Data – Development of a Maturity Model for Data Literacy in Non-Governmental Organizations

Dritter Preis: Benjamin Gürtl, Road Traffic Optimiza-tion and Decision Support by Means of Prescriptive Analytics as a Service

Benjamin Barann bearbeitet im Rahmen seiner Mas-terarbeit an der Westfälischen-Wilhelms-Universität Münster ein aktuelles Thema, das sich auch durch ökonomische Sinnhaftigkeit auszeichnet. Dabei wird

nicht nur eine Lösung detailliert und nachvollziehbar präsentiert, sondern auch schon die erforderlichen Grundlagen präzise hergeleitet, so dass es eine in sich runde Arbeit geworden ist. Helena Sternkopf, als weite-re Erstplatzierte, verfasste ihre Abschlussarbeit an der Hochschule für Wirtschaft und Recht in Berlin. Ihr inte-ressantes Thema wird selten im BI-Kontext betrachtet, ist aber eigentlich maßgeblich für die Wertschöpfung von BI-Systemen. Dabei fokussiert die Autorin nicht nur die praktische Anwendung ihres Ansatzes, sondern leitet diesen auch nachvollziehbar aus der Literatur ab, was auch diese Arbeit als gelungen bezeichnen lässt. Benjamin Gürtl schrieb seine preiswürdige Arbeit an der FH Joanneum (Österreich). Hier überzeugte die Kombination aus Prescriptive Analytics, Machine Lear-ning und Cloud Computing. Ein hochaktuelles Thema, das in vielen Unternehmen auf der Agenda steht und bei dem derzeit noch wenige Erfahrungen vorhanden sind, so dass diese Arbeit fundierte Erkenntnisse liefert, die sich direkt in der Praxis nutzen lassen.

Nicht nur diese Themen, sondern alle Einreichungen demonstrieren aktuelle Diskussionen; alle Einreichun-gen zeigen in jedem Jahr Trends und Herausforderun-gen auf – dies verbunden mit innovativen Lösungen. Dabei ist es ein Kriterium des TDWI Awards, dass es nicht nur um die theoretische Betrachtung der Themen geht, sondern ein praktischer Bezug vorzuliegen hat, so dass ein entsprechender Impuls zurückgegeben werden kann.

Daher freuen wir uns, Ihnen die Abstracts der Arbeiten im vorliegenden Sammelband anbieten zu können, so dass Sie sich selbst ein Bild über aktuelle Trends ma-chen und auch durch die erarbeiteten Lösungsansätze inspirieren lassen können. Wir hoffen, dass Ihnen das Lesen der Beiträge genau so viel Spaß bereitet, wie dem Gutachterkreis und wir freuen uns schon auf das kommende Jahr und dabei auf neuerlich spannende Einreichungen zum TDWI Abschlussarbeiten Award 2018!

Carsten FeldenVorsitzender des TDWI e.V.

Vorwort

4

In der heutigen Zeit steht die Pharmaindustrie einer Vielzahl von Herausforderungen gegenüber: der Zu-nahme sowie Beschleunigung des Wettbewerbs, der fortschreitenden Digitalisierung der Branche und nicht zuletzt dem Wandel in der Rolle des Patienten. Die Entwicklungszeiten und -kosten für neue Medikamente steigen jährlich; mittlerweile nimmt der Entstehungs-prozess eines einzigen Medikamentes bis zu 16 Jahre in Anspruch und ist mit Kosten von mehr als einer Milliarde US-Dollar verbunden. Der Erfolgsdruck für neu entwickelte Medikamente auf dem Markt ist dem-entsprechend hoch.

Die öffentliche Berichterstattung sowie die Nutzen- Risiko-Wahrnehmung von Ärzten und Patienten neh-men großen Einfluss auf die Marketingerfolge von Pharmaunternehmen. Insbesondere Phänomene wie das ‚Web 2.0‘ und ‚Social Media‘ erleichtern es den Pati-enten, ihre Meinung zu bestimmten Pharmaprodukten über das Internet zu verbreiten und Lob, aber auch Kri-tik, an ihnen zu üben. Analog zu Portalen wie Amazon oder Ciao.de existieren auch in der Gesundheitsindus-trie Bewertungsplattformen, auf denen Patienten so-wohl numerische Ratings als auch voll ausformulierte Kommentare zu Medikamenten abgeben können.

Das Ziel der vorliegenden Masterarbeit „Text Mining in der Pharmaindustrie – Eine Meinungsanalyse zur Verträglichkeit von Medikamenten“ besteht in der nutzenbringenden Auswertung solcher Medikamen-tenbewertungen aus dem Internet. Es wird untersucht, ob sich das Online-Patientenfeedback mithilfe auto-matisierter Verfahren verarbeiten und anschließend sinnvoll interpretieren lässt. Eine zentrale Fragestel-lung besteht darin, welche Implikationen sich aus den Ergebnissen für die Pharmaindustrie ergeben und ob

die verwendeten Verfahren für den Einsatz in der me-dizinischen Domäne geeignet sind. Da Umfang und Struktur der nutzergenerierten Webinhalte für eine manuelle Auswertung zu komplex sind, werden Metho-den aus den Bereichen Data Mining, Web Mining und insbesondere Text Mining angewandt. Das Opinion Mi-ning und die Sentimentanalyse sind Teildisziplinen des Text Mining, die sich mit der Analyse von meinungsäu-ßernden Texten im Speziellen befassen und mit deren Hilfe Dokumenten eine positive, negative oder neutrale Stimmungslage zugeordnet werden kann.

Am Beispiel einer ausgewählten Medikamentengruppe – die oralen Antikoagulanzien (sprich: blutgerinnungs-hemmende Arzneimittel) – werden von Patienten ab-gegebene Bewertungen einer Sentimentanalyse (d. h. Stimmungsanalyse) unterzogen und darüber hinaus die meistgenannten Nebenwirkungen der betrachte-ten Antikoagulanzien identifiziert. Als Datengrundlage dienen die beiden größten Online-Gesundheitsportale des englischen Sprachraumes: Drugs.com und WebMD. Die verwendete Analysesoftware ist das Open-Source- Produkt RapidMiner, welches über ausgeprägte Data- Mining- und Text-Mining-Funktionalitäten verfügt.

Im Rahmen des Text-Beschaffungsprozesses werden zunächst die 1875 relevanten Medikamentenbewer-tungen mit Methoden des Web Crawling/Informati-on Extraction den jeweiligen Webseiten entnommen und in Excel-Tabellen abgelegt. Anschließend wird die Datenbasis um doppelt vorhandene oder auch unvollständige Bewertungen bereinigt. Im Zuge der darauf folgenden Text-Vorverarbeitung kommen klassische Methoden des Text Mining zum Einsatz, zu denen bspw. die Tokenisierung, Stoppwort-Filterung und das Stemming gehören. Bei diesem Schritt wird

Silke Bandyszak

in der Pharmaindustrie

Text Mining

5

jede einzelne Bewertung in ihre Bestandteile zerlegt und in das Vektorraum-Modell überführt, d. h es wird eine gewisse ‚Struktur‘ in die zuvor unstrukturierten, natürlichsprachigen Rohtexte gebracht. Zur Analyse von Texten, die in derart aufbereiteter Form vorliegen, eignen sich die klassischen Verfahren des Data Mi-ning bzw. des maschinellen Lernens. Die Qualität des Klassifikationsmodells variiert abhängig von den ge-wählten Verfahren bzw. Algorithmen (z. B. Naive Bayes, Support Vector Machines, Neuronale Netze, etc.); aber auch eine passende Gewichtung der Attribute kann zur Verbesserung der Analyseergebnisse beitragen. Im Falle der betrachteten Medikamentenbewertungen liefert der Naive-Bayes-Algorithmus in Verbindung mit der Gewichtungsmethode Information Gain die akkuratesten Ergebnisse (Vorhersagegenauigkeit von 75,2 % im erstellten Klassifikationsmodell). Dennoch beträgt die Vorhersagegenauigkeit bei der Anwendung des Modells auf neue Datensätze lediglich 47 %.

Als mögliche Alternative zu den quantitativen Ver-fahren des maschinellen Lernens wird in der Arbeit ebenfalls eine qualitative Analysevariante in Form des lexikonbasierten Ansatzes erprobt. Bei Verfahren dieser Art kann auf die Textvorverarbeitungs- und Transformations-schritte weitgehend verzichtet wer-den, da vollständige syntaktische und semantische Satzstrukturen in der Analyse Berücksichtigung finden. Die Vorhersagegenauigkeit dieses Verfahrens liegt im gewählten Anwendungsbeispiel ebenfalls bei 47 %.

Dass die jeweiligen Vorhersagegenauigkeiten der ein-gesetzten Verfahren mit 47 % relativ niedrig ausfallen, gibt Grund zu der Annahme, dass eine Sentimentanalyse in der medizinischen Domäne mit größeren Schwierig-keiten verbunden ist als bei den bisher in der Forschung betrachteten Produktkategorien, wie z. B. Elektronikar-tikeln oder Kinofilmen. Die Besonderheit im Hinblick auf Medikamentenbewertungen besteht vor allem im Vorkommen zahlreicher negativ behafteter Wörter, da u. a. Nebenwirkungen und Krankheitsverläufe von den Patienten beschrieben werden. Vor diesem Hintergrund stuft eine Sentimentanalyse die Mehrheit der Bewer-tungen aufgrund der enthaltenen Wörter als negativ ein, wenngleich viele Patienten mit der Wirksamkeit der Medikamente im Allgemeinen zufrieden sind. Da Antikoagulanzien in die Kategorie der lebensrettenden

Medikamente fallen, sind Patienten eher geneigt, nega-tive Begleiterscheinungen einer Behandlung in Kauf zu nehmen, solange der Wirkstoff seinen primären Zweck erfüllt: die Thrombosevorbeugung. Dadurch lässt sich eine z. T. hohe Diskrepanz zwischen den numerischen Ratings der Patienten und dem Auswertungsergebnis ihrer zugehörigen Textkommentare erklären.

Die Unterschiede zwischen der alten und neuen Ge-neration oraler Antikoagulanzien fallen vergleichs-weise gering aus und auch die in den Bewertungen häufig genannten Nebenwirkungen sind ähnlich. Die-ses Ergebnis unterstreicht Resultate aus vorherigen klinischen Studien, in denen die neuen oralen Anti-koagulanzien den älteren Präparaten gegenüber als mindestens gleichwertig ausgewiesen wurden. Um eine bahnbrechende Weiterentwicklung scheint es sich aus Patientensicht laut Online-Bewertungen nicht zu handeln, da Blutungskomplikationen bei den neuen Präparaten weiterhin auftreten, genauso wie andere häufige Nebenwirkungen der älteren Medikamenten-generation (darunter v. a. Schmerzen, Schwindel und Übelkeit).

Die vorliegende Masterarbeit macht deutlich, dass die Analyse von Patientenbewertungen mithilfe von Text Mining und Sentimentanalysen zahlreiche An-satzpunkte für zukünftige Forschungsbemühungen bietet. Die verwendeten Verfahren müssen insbeson-dere in ihrer Vorhersagegenauigkeit verbessert und an die Besonderheiten von Texten aus der medizinischen Domäne angepasst werden. So ist bspw. ein besonderer Umgang mit implizit enthaltenen Meinungsäußerun-gen sowie den vielen medizinischen Fachbegriffen erforderlich. Aber auch das Aspekt-basierte Opinion Mining birgt ggf. Potential zur Verbesserung der Analy-seergebnisse (bisher wurden die Bewertungen nur auf Dokumentenebene betrachtet). Abschließend bleibt festzuhalten, dass die Überwachung von Social-Me-dia-Plattformen mithilfe von Text Mining ein mächti-ges Instrument darstellt. Dieses lässt sich von Phar-maunternehmen gezielt einsetzen, um den Erfolg ihrer Medikamente aus Patientensicht zu verfolgen, häufige (und selbst bislang unentdeckte) Nebenwirkungen zu ermitteln oder Anregungen zur Medikamentenver-besserung bzw. für innovative Neuentwicklungen zu finden.

Die Masterarbeit wurde eingereicht an der Universität Duisburg-Essen, Mercator School of Management, Lehrstuhl für Wirtschaftsinformatik, insbesondere Business Intelligence, bei Univ.-Professor Dr. Peter Chamoni.

6

The basic idea of sharing is as old as humankind [2, p. 2,3, p. 3]. Though, through the recent development of information and communication technologies (ICT), many new forms of sharing developed [2, p. 2,4, p. 5]. New business models related to those forms are often subsumed by terms such as the sharing economy or collaborative consumption [2, p. 2,4, p. 1,5, p. 149]. Two exemplary business models in the domain of trans-portation have been developed by Uber and Lyft. They

6

Uber and Lyft are disrupting the taxi industry, which has remained largely unimproved since its emer-gence. Alongside those new business models arises the need for more sustainable consumption and thus, more efficient transportation systems and sustainable mobility to encounter the challenges posed by urban congestion, fuel-wasting and air-pollution. Therefore, taxi-ridesharing (TRS) approaches, which match at least two separate ride requests with similar spatio-tempo-ral characteristics to a joint taxi trip, gain attention in practice and academic institutions. Nevertheless, a simple approach, which requires common start and end

Benjamin Barann

points as the basis for shared rides, was not consid-ered. For this reason, my master thesis explored such an approach using open-data. The empirical analysis revealed that the proposed approach matches up to 48.34% of all taxi rides in New York and 6.78% in Porto. This has shown that the approach works better in cities and at times with higher trip densities.The results from New York were published in the “De-cision Support Systems” journal: Barann, Benjamin, Daniel Beverungen, and Oliver Müller. "An open-data approach for quantifying the potential of taxi rideshar-ing." Decision Support Systems (2017)

of Taxi-Ridesharing Services:

A Data-Driven Analysis

in Metropolitan Areas

Quantifying the Potential

Introduction and Motivation

provide mobile applications that allow individuals with private vehicles to act as taxi drivers. Thereby, they disrupted the taxi industry which has remained largely unimproved since its emergence [2, p. 9,6, p. 405]. Alongside those new business models arose the need for more sustainable consumption [2, p. 1,7]. One example is the political ambitions for more efficient transportation systems and sustainable mobility [8, p. 140] to encounter the challenges posed by urban

ERSTER PLATZ

7

To solve this research question, an inductive data- driven analysis was conducted. Just as D’Orey et al. 2012, p. 143 and Santi et al. (2014a, p. 13294) his-torical taxi trajectory datasets from Porto and New York were used as direct input for this analysis. Yet, in contrast to those researches, methods from da-ta-science were used to quantify the potentials of the proposed TRS service. Based on the first five steps of the “Cross-Industry Standard Process for Data Mining”

Research Question: Under which operational circumstances can a simplified but technology-enabled approach that merges taxi trips with common start and end points be used to implement taxi-ridesharing in metropolitan areas?

However, in recent years, academic literature proposed several technology-driven approaches to improve the feasibility of TRS in the modern world. Each present-ed approach has different characteristics and utilizes different methods to determine the potentials of TRS (see chapter 3.2.2 of the thesis). Nevertheless, a simple TRS approach, which requires common start and end points as the basis for shared rides (one-to-one), was not (re)considered in the literature. Instead, the authors focused on more dynamic approaches with multiple start and/or end points [e.g.: 11,15]. For this reason, the thesis dealt with the research question:

(CRISP-DM, Pete et al. 2000, p. 12), a reusable and city-independent decision support system was estab-lished, which first generated shared trips by matching rides from the datasets. This framework was used to consider different scenarios of the one-to-one TRS service in a what-if analysis. Furthermore, it supported the data discovery activities to generate new knowl-edge and identify patterns from the matching results [17, p. 3,18, p. 2].

congestion, fuel-wasting and air-pollution [9, p. 85,10, p. 1]. Especially in cities like New York, the taxi industry is a considerable contributor to those challenges [11, p. 13290].

Therefore, taxi-ridesharing (TRS) approaches, which match at least two separate ride requests with simi-lar spatio-temporal characteristics to a joint taxi trip, gained attention in practice and academic institutions [12, p. 590,13]. The idea of taxi-ridesharing is not com-pletely new. In fact, the idea has been used in devel-oping countries for several years [13, p. 303,14, p. 476].

8

The results of the what-if analysis indicated that a sim-plified TRS approach is indeed able to generate bene-fits under certain circumstances. Not only the passen-gers can profit from the one-to-one TRS approach, but also the drivers and potentially the operator through increased profitability of the traveled distance. Yet, it was also shown that the feasibility of the TRS approach relies on the taxi trip density within the city.

Under the basis case, the one-to-one approach in Porto would have just achieved a ride reduction of 3.39% in the ordinary and 2.4% in the extraordinary week. It was therefore assumed that the provider would have to tweak parameters such as the distance constraint or implement the service in a spatio-temporally restrict-ed way. For example, the night hours were the most promising hours. What is more, transport hubs, tourist

Tab. 1: Operator result metrics of the basis case in New York

attractions, and hospitals seemed to be good starting locations for the TRS service. As an alternative, it was discussed that many-to-many TRS might deliver better results in such a situation.

In contrast to this, the basis case in New York could already reach a ride reduction of 24.17% in the ordi-nary and 21.71% in the extraordinary week (see Tab. 1). By considering the other scenarios, it was sug-gested that the basis case of the simple one-to-one approach was sufficient to implement TRS in New York. Moreover, a higher seat capacity, optional mul-ti-ridesharing or relaxed spatio-temporal restrictions could have further improved the service outcomes. Besides that, it was discovered that there was a cor-relation between the city structure and the best- suited distance constraint.

Operator Metric Ordinary Extraordinary

Total Raw Trips 3119254 2292934

Total matched rides 1507720 995468

% Matched System Rides 48.34 43.41

Total Ride Reduction 753860 497734

% System Ride Reduction 24.17 21.71

Ø Passenger Occupancy Old 1.71 1.8

Ø Passenger Occupancy Sharing 2.25 2.3

Total distance saved (km) 2892036.11 1864658.11

Ø Saved KM per RS 3.84 3.75

Relative Distance Rate 81.02 83.53

Saved Gas (Litre) 231362.89 149172.65

Saved CO2 (kg) 532134.64 343097.09

Saved Ride Time (h) 156965 94628.98

% Saved Ride Time 22.42 20.14

% Revenue Reduction 12.9 10.68

Ø Inc. Trip Fare (%) 53.24 53.53

Results and Conclusion

9

However, it was also discussed that to feasibly imple-ment the TRS approach, the presence of at least one factor that encounters the reduction of the individual trips for the operator or motivates the TRS approach in other ways would be required. This could either be a situation where the demand exceeds the supply or the presence of a demand elasticity that increases the requested trips when introducing a lower price. Besides that, it was assumed that TRS could help to stay com-petitive against companies such as Uber. Finally, it was argued that cities could subsidize the TRS approach to support the trend of sustainable consumption.

In the end, it is worth to note that the results were based on a few assumptions. First of all, while the da-ta-driven approach tried to mimic a dynamic matching approach as close as possible, it assumed that all trips that could have been shared were matched. Anyway, the reality may look different. Some participants may be unaware of the possibility to share a taxi and its potentials. This was especially an issue in the past were TRS approaches were not marketed aggressively [19, p. 32]. Furthermore, the data-driven approach utilized fixed parameters for all customers and did not consider their individual preferences. Not all customers may be willing to share a taxi with strangers due to psycho-logical barriers (see chapter 3.2.3 of the thesis). Also, the additional inconveniences may result in a smaller number of actual participants. In fact, the first attempts of taxi-ridesharing without technological support also encountered problems with small amounts of shared trips due to the participants’ time sensitivity [19, p. 32]. What is more, the datasets only included the satisfied trips and did not provide any information about the excessive demands [20, p. 418].

While it was not part of the thesis, the last step of the data-mining process would be the deployment of the results [see: 16, p. 25+27-28]. In this context, the TRS approach could be tested in a field trial in cooperation with a taxi company. This would allow measuring the actual success of the proposed service under realistic circumstances. Thereby, it would be possible to address a few open research questions. First, the impact of the approach on the amount of satisfied queries could be measured, to test the assumption that the price elastic-ity and/or the excess demand do indeed encounter the reduction of individual trips [20, p. 410-411,21, p. 37]. Also, the field trial could be used to figure out incen-tives that drive the early acceptance of the approach [8, p. 146]. As an alternative to the field trial, the pre-sented data-driven implementation could be replaced by a simulation software that mimics the real-world behavior of the different participants more closely. Yet, such a simulation might not be able to cover all real-world patterns of taxi mobility provided by the taxi trajectory datasets or a field trial [22, p. 6737].

To wrap things up, the sharing economy is still in its infancy [2, p. 9,23, p. 17] and especially ridesharing approaches just recently became disruptive innova-tions for the taxi industry [2, p. 9]. This is why the taxi industry should adapt to this development and implement a sharing approach, like the one presented in the thesis, as well.

10

[1] B. Barann, D. Beverungen, O. Müller, An open-data approach for quan-tifying the potential of taxi ridesharing, Decis. Support Syst. (2017). doi:10.1016/j.dss.2017.05.008.

[2] B. Cohen, P. Muñoz, Sharing cities and sustainable consumption and production: towards an integrated framework, J. Clean. Prod. (2015) 1–11. doi:10.1016/j.jclepro.2015.07.133.

[3] J.A. Price, Sharing: The Integration of Intimate Economies, Anthropo-logica. 17 (1975) 3. doi:10.2307/25604933.

[4] J. Hamari, M. Sjöklint, A. Ukkonen, The sharing economy: Why people participate in collaborative consumption, J. Assoc. Inf. Sci. Technol. (2015) 1–13. doi:10.1002/asi.23552.

[5] C . J. Martin, The sharing economy: A pathway to sustainability or a nightmarish form of neoliberal capitalism?, Ecol. Econ. 121 (2015) 149–159. doi:10.1016/j.ecolecon.2015.11.027.

[6] H.A. Posen, Ridesharing in the Sharing Economy: Should Regulators Impose Über Regulations on Uber?, Iowa Law Rev. 101 (2015) 405–433. http://ilr.law.uiowa.edu/files/ilr.law.uiowa.edu/files/ILR_101-1_Posen.pdf.

[7] R. Botsman, The Sharing Economy Lacks A Shared Definition, Co.EXIST. (2013). http://www.fastcoexist.com/3022028/the-sharing-eco-nomy-lacks-a-shared-definition#20 (accessed March 18, 2016).

[8] P.M. D’Orey, R. Fernandes, M. Ferreira, Empirical evaluation of a dy-namic and distributed taxi-sharing system, in: 2012 15th Int. IEEE Conf. Intell. Transp. Syst., IEEE, 2012: pp. 140–146. doi:10.1109/ITSC.2012.6338703.

[9] S.-F. Cheng, D.T. Nguyen, H.C. Lau, A Mechanism for Organizing Last-Mi-le Service Using Non-dedicated Fleet, in: 2012 IEEE/WIC/ACM Int. Conf. Web Intell. Intell. Agent Technol., IEEE, 2012: pp. 85–89. doi:10.1109/WI-IAT.2012.254.

[10] P.-Y. Chen, J.-W. Liu, W.-T. Chen, A Fuel-Saving and Pollution-Re-ducing Dynamic Taxi-Sharing Protocol in VANETs, in: 2010 IEEE 72nd Veh. Technol. Conf. - Fall, IEEE, 2010: pp. 1–5. doi:10.1109/VE-TECF.2010.5594422.

[11] P. Santi, G. Resta, M. Szell, S. Sobolevsky, S.H. Strogatz, C. Ratti, Quantifying the benefits of vehicle pooling with shareability net-works, Proc. Natl. Acad. Sci. 111 (2014) 13290–13294. doi:10.1073/pnas.1403657111.

[12] C.-C. Tao, C.-Y. Chen, Heuristic Algorithms for the Dynamic Taxipooling Problem Based on Intelligent Transportation System Technologies, in: Fourth Int. Conf. Fuzzy Syst. Knowl. Discov. (FSKD 2007), IEEE, 2007: pp. 590–595. doi:10.1109/FSKD.2007.346.

[13] H. Hosni, J. Naoum-Sawaya, H. Artail, The shared-taxi problem: Formu-lation and solution methods, Transp. Res. Part B Methodol. 70 (2014) 303–318. doi:10.1016/j.trb.2014.09.011.

[14] L.M. Martinez, G.H.A. Correia, J.M. Viegas, An agent-based simulation model to assess the impacts of introducing a shared-taxi system: an application to Lisbon (Portugal), J. Adv. Transp. 49 (2015) 475–495. doi:10.1002/atr.1283.

[15] C. Tao, C. Wu, Behavioral responses to dynamic ridesharing services - The case of taxi-sharing project in Taipei, in: 2008 IEEE Int. Conf. Serv. Oper. Logist. Informatics, IEEE, 2008: pp. 1576–1581. doi:10.1109/SOLI.2008.4682777.

[16] C. Pete, C. Julian, K. Randy, K. Thomas, R. Thomas, S. Colin, R. Wirth, Crisp-Dm 1.0 - Step by step data mining guide, 2000. https://www.the-modeling-agency.com/crisp-dm.pdf.

[17] V. Dhar, Data Science and Prediction, New York, New York, USA, 2012. http://archive.nyu.edu/bitstream/2451/31553/2/Dhar-DataScience.pdf.

[18] L.A. Kurgan, P. Musilek, A survey of Knowledge Discovery and Data Mining process models, Knowl. Eng. Rev. 21 (2006) 1–24. doi:10.1017/S0269888906000737.

[19] R. Cervero, Paratransit in America: Redefining Mass Transportation, Praeger, 1997. https://goo.gl/nogvDy.

[20] Shuo Ma, Yu Zheng, O. Wolfson, T-share: A large-scale dynamic taxi ridesharing service, in: 2013 IEEE 29th Int. Conf. Data Eng., IEEE, 2013: pp. 410–421. doi:10.1109/ICDE.2013.6544843.

[21] M. Stiglic, N. Agatz, M. Savelsbergh, M. Gradisar, The benefits of meeting points in ride-sharing systems, Transp. Res. Part B Methodol. 82 (2015) 36–53. doi:10.1016/j.trb.2015.07.025.

[22] D.O. Santos, E.C. Xavier, Taxi and Ride Sharing: A Dynamic Dial-a-Ride Problem with Money as an Incentive, Expert Syst. Appl. 42 (2015) 6728–6737. doi:10.1016/j.eswa.2015.04.060.

[23] D. Wosskow, Unlocking the sharing economy: An independent review, Department for Business, Innovation and Skills, London, 2014. https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/378291/bis-14-1227-unlocking-the-sharing-economy-an-inde-pendent-review.pdf.

Diese Masterarbeit wurde eingereicht an der Westfälischen Wilhelms-Universität Münster am Lehrstuhl für Wirtschafts informatik und Informations management von Prof. Dr. Dr. h.c. Dr. h.c. Jörg Becker. Sie wurde betreut von

Prof. Dr. Daniel Beverungen, welcher während der Bearbeitung an die Universität Paderborn berufen wurde.

References

11

12

Every Business Analytics project is indi-vidual and dif-ferent from others. For exam-ple, data can appear in a structured, semi-structured or unstructured way. Data sources might present themselves as data-streams or stored data. Therefore, compa-nies are uncertain when it comes to imple-menting such projects. Trying to find one single approach that can be used to solve every one of these projects is not possible.

The big challenge concerning big data is to figure out ways, how the mass of different kinds of produced data can be utilized. Once the intended purpose of the data ana-lyzation is clear and suitable data sources

for answering this question are available, the focus shifts to finding a system, which is able to conduct the desired analyzation.

However, the output quality of every system largely depends on the input quality. Due to that, data prepa-ration is a very important part of the process which includes various aspects to consider. Additionally, dif-ferent streams of data might not have the same data structure. Therefore, it is necessary to identify the re-quirements towards data sources, their structure and their origin and find a way to handle a wide variety of different use cases.

Monica Deutsch

High Quality Predictions

by Means of Business Analytics

Prerequisites for Generating

and Adaption as Mandatory

Data Acquisition, Inspection

13

Goals

To handle data quality and meet the input data re-quirements of an analytics solution, certain prepara-tion steps need to be done. These necessary steps have to be identi-fied, interrelated and documented.

In order to identify as many different as-pects of data preparation as possible, two different scenarios are described. In the first one, the traffic situation in Graz is analyzed and predicted and in the second one Aus-tria’s employment market in combination with university enrollment numbers is exam-ined. Moreover, a verifi-cation of the success of the identified data preparation approach is necessary, which is done by conducting Business Analytics with the prepared data.

For this, a flexible and adaptive system in the style of an “as a Service” product is used, since this represents a real world scenario the best. In the course of the search for related work, references for such a Business Ana-lytics system, providing the consumer with one complete solution, could not be found. Therefore, a system which was developed in a previous student project, is used. So far it has only been tested with one stream of data in the previously mentioned traffic pre-diction scenario.

By using this cloud based solution not only the suc-cess of the conducted data prepara-tion tasks can be verified. Additionally, the adaptability and flexibility of the solution are examined in an attempt to prove its ap-plicability to a wide range of diverse use cases.

The scientific questions, which derive from the underly-ing problem, the motivation and the goals of this thesis, are listed subse-quently. Whereupon the first question can be categorized as data domain specific while the other three questions focus on technical aspects.

• Is it reasonable to use machine learning for analyzing the employment market in connection with university enrollment numbers?

• What criteria applies to selecting and extracting relevant features for Business Analytics from the given data sources?

• In general, which data preparation steps are necessary to prepare a data source for Business Analytics?

• Is the presented Prescriptive Analytics solution applicable to use cases from various domains of knowledge?

14

Approach

To develop a process model for data preparation, two use cases are described and used to infer implica-tions for a general process model. In the first use case, which was conducted in a project prior to this thesis, road traffic predictions were the desired outcome. The main focus in this project was, to develop a cloud ar-chitecture capable of performing Business Analytics with different data sets. Nevertheless, some revealing factors for data preparation could also be identified throughout the project.

Following that, the second use case about job market predictions focuses mainly on data preparation. This use case was chosen specifically because of some ma-jor differences concerning the characteristics of the two projects. In the first case only one data source is used whereas the second case makes use of three different sources. Additionally, in the traffic prediction scenario, real time streaming data is handled while the second scenario uses statically stored files that are updated in intervals of several months. Therefore, as many different aspects of data preparation as possible are represented by the two use cases.

Figure 1: Process Model for Data Preparation

15

Results

Based on the experiences from the two use cases, a process model for data preparation is proposed. Figure 1 illustrates the model which starts right after a busi-ness problem was identified. At the end of the process model, the data is passed on to a Business Analytics System. Depending on the use case though, some of the listed data preparation steps can take place after the transition into the Business Analytics system oc-curred. Typically, starting with “Cleaning”, the tasks can be performed within the Business Analytics system. It has to be decided for every project individually though, which tasks are performed prior to sending the data to a Business Analytics system and which tasks are handled within this system.

The proposed model contains the tasks that can be required during data preparation. Not every task has to be necessary in every use case, though. Moreover, the arrows, which are used to connect the different steps, might seem to constitute a chronological order of the tasks. Yet, this is only a suggested flow that does not necessarily have to be suitable for each and every use case. Especially within the big phases, which are represented by the horizontal lines and named on the left border of the model, some of the tasks might be exchanged with one another.

When multiple data sources are used and merging is necessary, there are some more constraints to consider. Unification has to take place before merging, in order to actually make merging even possible. If values are not standardized across all involved data sets, they can-not be used for matching data records while merging. Therefore, unification is particularly important when working with multiple data sets.

In other cases though, only one data source might be incorporated and no merging is necessary, hence the dotted border of the merging task in Figure 1. In such cases, unification might not be that relevant, since there is no other data set to unify it with. Nonetheless, the separate substeps of the unification task can also be important in those cases in order to meet the re-quirements of the business problem.

Furthermore, cleaning can be done after merging, it is not advisable though. When missing values, outliers and duplicate or deceitful records are included while merging, this might result in even more inaccurate data records. Furthermore, the integrity of flawless records can be influenced when they are matched with faulty data records.

Additionally, removing not needed data records as well as unifying values can significantly decrease process-ing time, when executed early in the process model. This can be especially useful when working with large data sets.

Depending on the columns that are necessary to create a new value, feature engineering can also be located after merging. It is also possible to split the task and perform it before as well as after merging. First, fea-tures can be created separately for each data set if the required values are already present in the particular data set. Moreover, sometimes values from different data sets might be necessary to create a feature. In these cases, the engineering of the feature can only take place after merging.

The task of feature selection is essential to the suc-cessful analyzation of a data set. Even if only one mis-leading feature is included in the Machine Learning training data set, the possibility of finding valid pat-terns within the data set can be decreased. Therefore, when the ideal feature set is not obvious, it can also be recommendable to execute several Machine Learn-ing training runs with slightly different feature sets. In principle, the ideal feature set consists of those columns best describing underlying patterns. In order to decide to which features this rule applies, domain knowledge is necessary.

While looking for the ideal feature combination some factors have to be kept in mind. First, the selected features need to be independent of each other and of the label, especially. Second, it might be helpful to split a feature up into multiple values, in order to support the identification of patterns. Third, features should be

16

converted to the appropriate format, like for example string or integer, before training, which also facilitates the discovery of patterns.

Most importantly, the whole data preparation process appears as a cyclic flow. The necessity to loop back to a previous task might appear at any point in the pro-cess. For example, the know-how of the data acquisi-tion phase can be useful during data inspection, since some characteristics of the data are already known. Moreover, it can be necessary to loop back from data inspection to data acquisition though, if the selected data source turns out not to be exactly what is needed for the use case. Thus, due to new findings made while performing any task of the model it might be required to loop back to a previous phase. Following this ap-proach, the data preparation process works similar to a continuous improvement process, hence ensuring high quality outcomes.

Besides the results concerning data preparation, there was also the question whether the used Prescriptive Analytics as a Service solution would be applicable to data from different domains of knowledge. This could be confirmed, by using it for analyzing data from the field of street traffic as well as data concerning the em-ployment market and study enrollment numbers. The main reason for this positive outcome is the modular structure of the solution, which makes it highly flexible and adaptive. While in the first use case every archi-tectural component was utilized, the second use case did not require handling of real time data. Therefore, the not required components were skipped without influencing the operability of the whole solution.

Another big implication from the implemented use cases is the importance of fundamental domain knowl-edge. In the first use case, the understanding for the field of street traffic was sufficient to understand, inter-pret and work with the acquired data. Establishing that amount of knowledge was facilitated by the fact, that personal contact with the data provider was possible. Also, specialist literature concerning the relevant top-ics was available and could be consulted. Additionally, due to the real time character of the data, interpre-tation was easier. No statistical preprocessing takes place prior to receiving the data and sending it to the analytics architecture.

In contrast, in the second use case a lack of domain knowledge had to be handled. The fact that multiple data sources are used in this scenario further enhances the obstacle of lacking data understanding. Opposed to the first use case, all three data sources used here are statistical data sets which were already subject to classification and aggregation. The web services of the data providers do not offer any further information con-cerning these preprocessing measures and also no face to face contact with the data providers was possible.

Therefore, the interpretation of the data sets was more difficult than in the first use case. As a result, a highly subjective component influences the outcome of the scenario, because correlations between jobs and stud-ies were made based on personal experience.

Normally, in real world scenarios establishing the necessary amount of domain knowledge is not a big problem, though. Companies want to analyze their own data, which they have the necessary expertise for. Only in scenarios, where external third-party data is incorporated additionally to the own data, this factor becomes relevant.

Concluding from the importance of data understanding and looking at the big picture of Business Analytics, this thesis deals with one of three equally important areas. Additionally, to the expertise from the field of comput-er science, which is necessary to properly handle and prepare data sets and implement analytics systems, two other areas of expertise must not be disregarded. First, there is the absolute necessity to understand the data one is working with during a Business Analytics project exceeding computer science specific character-istics. Only by establishing a profound understanding for the specific domain of knowledge it is possible to create high quality predictions or prescriptions. Second, the field of statistics, which was not examined in this thesis at all, is essential to building Machine Learning models with the highest possible predictive value. It is fundamental to see how crucial every one of the three areas is for Business Analytics projects and how they are associated and interact with each other.

Diese Diplomarbeit wurde eingereicht am den Fach hochschule Joanneum, Lehrstuhl Information Management bei FH-Prof. Dipl.-Ing. Dr. techn. Erwin Zinser.

Conclusion

17

18

The amount of data being created is increasing expo-nentially. By 2020 44 ZB of data will have been created – compared to four ZB in 2013. Meanwhile, analyzing data to extract information and value out of data be-comes more important. Business Analytics can be highly enhanced by means of Cloud Computing. Both trends together provide possibilities to optimize the increasing amount of road traffic in order to decrease the pollution of the environment. This specific piece of work docu-ments the development of a cloud based framework, resting upon a given Business Analytics Framework, which enables the possibility of conducting Prescriptive Analytics on a branch-independent level for all busi-nesses. With the presented innovative solution data

scientists engaged in all businesses can apply Business Analytics in all knowledge and business domains in or-der to generate high value by extracting information out of data. By developing an approach for a self-learning algorithm, project executives can highly improve the implementations based on the proposed framework. To conclude, it can be said that this thesis not only provides a high-quality guideline to all businesses (know-how buildup, implementation and execution support), but also especially the developed “as a service” solution ad-dresses the regional economy of Graz, Styria and proves that Business Analytics enables the creation of value in basically every field of knowledge, which can be highly enriched by implementing self-learning algorithms.

Benjamin Gürtl

as a Service

Prescriptive Analytics

Decision Support by means of

Road Traffic Optimization and

Index Terms— Big Data, Business Analytics, Predictive Analytics, Prescriptive Analytics, Cloud Computing, as a Service, Microsoft Azure, Machine Learning, Stream Analytics, Road Traffic Quality, Road Traffic Optimization, Business Analytics Framework, Self-Learning Algorithm, Guideline, Workflow Automation

DRITTER PLATZ

19

This summary should outline the most important aspects of this thesis. First, the underlying problem statement, the aims and proposed approach are shortly summarized. Moreover, the scientific questions (defined in chapter 1.3.2 of the thesis) and results are discussed in a structured way. Last but not least, further research and development approaches are proposed.

In general, the amount of data being created is growing exponentially. The increasing amount of data brings challenges to businesses regarding the handling of this data. In addition to those challenges, data also brings capabilities with respect to gaining valuable insights. By analyzing data, information can be extracted in or-der to obtain perceptions, predictions of data and as a result, value for businesses. The processes of analyzing data to predict future events and consequently provide decision support for authorities are called Predictive and Prescriptive Analytics (special types of Business Analytics, see Figure 1). Business Intelligence related technologies like Machine Learning [ML] and predic-tive algorithms are used to conduct these types of An-alytics. Furthermore, the concept of Cloud Computing is trending and provides a rich set of possibilities to businesses regarding the implementation of Business Analytics projects.

Research in the application domain of road traffic shows that not only the traffic rates are growing, which consequently brings a higher level of environmental pollution, but moreover, the amount of measured and stored data within the domain of road traffic is in-

creasing as well. As a result, the combination of afore-mentioned technologies of Predictive & Prescriptive Analytics and Cloud Computing (e.g. P*AaaS) and the domain of road traffic seems to be useful in order to optimize road traffic and provide decision support (re-garding the optimal travel route) for traffic participants.

Based upon these underlying problems, this thesis pro-vides insights in form of theoretical principles of the topics of Big Data (chapter 2.1), Business Analytics (in-cluding technologies) (chapter 2.2), Cloud Computing (chapter 2.3) as well as conducted research in the field of Business Analytics as a Service (chapter 2.4). Further-more, theoretical principles in the application domain of road traffic are explained (chapter 3). In order to prove that a comprehensive cloud implementation by means of Microsoft Azure Cloud can provide high value in the field of road traffic by predicting future road traffic qualities, the practical implementation of the Traffic Prescription Service [TPS] is documented in chapter 4 in detail. In addition, chapter 4.4 highlights the development and implementation of an approach regarding self-learning algorithms, which improves the prediction accuracy by reusing previous results. Finally, the approach of Predictive and Prescriptive An-alytics as a Service (P*AaaS) is evaluated regarding the proof of concept and automation possibilities (chapter 5 and chapter 6). As a result, a P*AaaS framework in form of a guideline for practitioners, enriched with multiple theoretical fundamentals and references for practical implementations is proposed (chapter 5.3 and Appendix).

Figure 1 shows the Analytics Spectrum according to Gartner. While Descriptive and Diagnostic Analytics deal with the past, Predictive and Prescriptive Analytics answer future questions and minimize the hu-man input in order to provide Decision Support or even Decision Auto-mation for executives.

DRITTER PLATZ

20

As explained in the theoretical fundamentals (chapter 2 and chapter 3), research shows that a general frame-work for conducting Business Analytics by means of Cloud Computing can bring high value to businesses. As it is proposed by Lückl, the next step in the research of Business Analytics as a Service (especially Predictive and Prescriptive Analytics as a Service) is to implement a solution in a non-artificial environment. Deduced from these given recommendations, the chosen tech-nology (Microsoft Azure) and the application domain of road traffic, the first scientific question (“Does the present Business Analytics Framework provide a suf-ficient fundament to develop a cloud based service in a non-artificial environment in order to predict the quality of road traffic and provide the user with pre-dicted values as well as prescriptive solutions?”) was derived and answered in chapter 4. In fact, the practi-cal implementation of the Traffic Prescription Service (see Figure 2) proves a positive answer for all parts of

this question. The fundament for the implementation was the previously introduced BA Framework by Lückl. Some implementation parts (e.g. data storage) are in-herited from Lückl’s example implementation in the field of ITOA. In cooperation with the city of Graz and the data source of BLIDS, a use case with non-artificial data in a non-artificial environment was implement-ed. As a result of the TPS implementation it can be seen, that the quality of road traffic within Graz can be predicted with an accuracy of 83.3%. Furthermore, the results of the Predictive Analytics are used in or-der to conduct Prescriptive Analytics and provide the approach of traffic optimization and Decision Support.

After implementing a standard algorithm within Azure Machine Learning in order to conduct Predictive An-alytics in the Traffic Prescription Service, the second scientific question (“Is Azure Machine Learning suit-able to implement a self-learning algorithm in order

Figure 2 shows the service infrastructure of the TPS, which consists of multiple Azure services enriched with scripts and Power BI. Further-more, the figure represents the basic workflow of the TPS, which is supported by .NET C# scripts.

21

to optimize the quality of predictions and the decision support by means of reusing previous predicted values and matching them with actual occurring values?”) is answered in chapter 4.4. At first, an approach of in-cluding previous predictions in order to improve the outcome of future predictions is proposed. Therefore, the ML principle of duplicated data records as well as a

developed formula to weigh up all factors is introduced (dataset of ΔQ, see Figure 3). The implementation of this approach within an observation window of one month in the Traffic Prescription Service shows an improvement of over 3.6% of the prediction accuracy of the predic-tive algorithm (see chapter 4.4.2). Hence, the second scientific question can be answered positively as well.

Figure 3 shows the calculations which need to be done, in order to calculate the third dataset of ΔQ.

Parameters for ΔQ: Actual Quality … Qa (50 %)Predicted Quality … Qp (35 %)Predicted Possibilities Pp(15 %)

Formula for Predicted Possibilities: Pp = P1 + P2 x 2+P3 x 3 + …Pn x nΔQ = Math. Round (Qa x 0.5 + Qp x 0.35 + Pp x 0.15)

The successful implementation of the non-artificial use case in the application domain of road traffic au-tomatically leads to the next step within the research of Business Analytics as a Service (e.g. P*AaaS). This next step is to answer the scientific question regard-ing the complete “as a Service” approach (“Do current Microsoft Azure technologies enable the possibility of implementing a comprehensive cloud architecture which provides Prescriptive Analytics as a Service, usa-ble in multiple knowledge and business domains?”). In order to answer this question, the TPS implementation is evaluated in all detail regarding necessary steps of implementation and all therefore needed parameters and information (see chapter 5.1). The result of the evaluation showed that by means of 124 parameters and information as well as some knowledge within the fields of Azure Cloud Services and Computer Science, the implementation of the TPS and especially provided supportive documents (see chapter 5 and Appendix)

can be used as a P*AaaS framework by project teams in any business and field of knowledge. This thesis provides a detailed list of needed parameters and infor-mation (see chapter 5.1 and Appendix), a high-quality and a highly-detailed project plan (see chapter 5.3 and Appendix). Both of which, enriched with the documents of the TPS implementation, provide a guideline for practitioners, which highly supports the implementa-tion of a P*AaaS solution in any domain of knowledge and business. Furthermore, Deutsch [27] implemented a use case based on the TPS frame-work, to prove the correctness of predictions in the field of employment markets. Thus, the third scientific question is proven to be true. As it can be seen below in Figure 4, a BA project con-sists of several steps. The proposed project plan serves as guideline for practioniers in all necessary steps of such a project.

22

In order to optimize the “as a Service” approach, the aspect of automation with regard to service creation needs to be considered (“Do current Microsoft Azure technologies provide necessary possibilities to auto-mate the implementation of a Predictive or Prescriptive Analytics as a Service solution?”). The evaluation in chapter 5.2 as well as in chapter 6 with respect to au-tomation capabilities of the Azure Cloud showed that the required amount of automation in order to provide

benefit out of this mechanization cannot be achieved. Additionally, automation brings further issues (e.g. se-curity issues with tokens; problems regarding customi-zation and adaption), which complicates the process of implementing a P*AaaS solution. Hence, the answer for this scientific question can be stated as follows: auto-mation can be implemented, but instead of providing benefit, an additional layer of complexity is introduced and therefore automation is not advisable.

Figure 4 shows the detailed project plan, which represents the proposed P*AaaS framework.

23

In fact, the implementations and evaluations show that Microsoft Azure is a high-quality tool for imple-menting comprehensive cloud solutions for conducting Predictive or Prescriptive Analytics. Nevertheless, it is also shown that the Azure platform lacks possibil-ities of automation and brings security issues when introducing automation aspects (even in workflow- automation). Hence, implementing such solutions in a comprehensive Azure environment obviously brings benefits regarding interfaces and usability, but also results in downsides, like a lack of automation possi-bilities and at some point, eventually lack the thinking outside of the box. In the end, it needs to be considered to further enrich the proposed solution by introducing third party tools, services or engines, which help to overcome those downsides.

It is proposed to conduct future research regarding alternative tools to the Azure Cloud. These alternatives need to be evaluated regarding the P*AaaS approach with respect to multiple implementations in differ-ent business and knowledge domains. If the P*AaaS approach can be confirmed by using alternative tools (or a hybrid solution), an evaluation regarding automa-tion needs to be performed. If this evaluation confirms automation possibilities by using alternative frame-works, tools, services or engines, the proposed P*AaaS framework in this thesis needs to be adapted regarding those new insights. Nevertheless, if research shows that no alternatives exists, a whole new market niche was identified and further research should address the development of a service, which is able to fulfill the requirements of P*AaaS as well as an automated approach by using a central automation engine to sup-port the process of service creation by automation as well automating the workflow.

24

References

[1] N. Marz and J. Warren, Big Data- Principles and best practices of scal-able real-time data systems, New Yokr: Manning Publications Co., 2015.

[2] B. Vorhies, "4 Things you need to know about the growth of big data," Data Magnum, 31 10 2013. [Online]. Available: http://data-magnum.com/4-things-you-need-to-know-about-the-growth-of-big-data-where-and-what-types-of-data-are-growing-and-how-much-can-really-provide-value/. [Accessed 11 4 2016].

[3] M. Bansal, "Big Data: Creating the Power to Move Heaven and Earth," MIT Technology Review, 2 9 2014. [Online]. Available: https://www.technologyreview.com/s/530371/big-data-creating-the-power-to-move-heaven-and-earth/. [Accessed 11 4 2016].

[4] M. Walker, “Data Science Central,” 12 2012. [Online]. Available: http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructu-red-data-the-rise-of-data-anarchy. [Accessed 7 2 2016].

[5] R. Barga, V. Fontama and W.-H. Tok, Predictive Analytics with Microsoft Azure Machine Learning, Second Edition, New York: Apress Media, 2015.

[6] Capgemini, "Studie IT-Trends 2016," Capgemini, Berlin, Wien, Zürich, 2016.

[7] B. Lückl, Business Analytics as a Framework, Graz: FH JOANNEUM, 2015.

[8] G. Leduc, "Road Traffic Data: Collection Methods and Applications," European Commission, Joint Research Center, Sevilla, 2008.

[9] P. Wang, T. Hunter, A. Bayen, K. Schechtner and M. Gonzales, "Under-standing Road Usage Patterns in Urban Areas," Scientific Reports, 2012.

[10] W. Schnabel and D. Lohse, Grundlagen der Straßenverkehrstechnik und der Verkehrsplanung Band 1 - Straßenverkehrstechnik, Berlin: Beuth Verlag GmbH, Berlin, Wien, Zürich, 2011.

[11] W. Schnabel and D. Lohse, Grundlagen der Straßenverkehrstechnik und der Verkehrsplanung Band 2 - Verkehrsplanung, Berlin: Beuth Verlag GmbH, 2011.

[12] M. Deutsch, "Data Acquisition, Inspection and Adaption as Mandatory Prerequisites for Generating High Quality Predictions by Means of Business Analytics," FH JOANNEUM, Graz, 2016.

[13] B. Gürtl, "Road Traffic Optimization and Decision Support by Means of Prescriptive Analytics," FH JOANNEUM, Graz, 2017.

[14] Gartner, Inc., "IT Glossary - Big Data," Gartner, Inc., 2016. [Online]. Available: http://www.gartner.com/it-glossary/big-data/. [Accessed 12 4 2016].

[15] E. McNulty, "Understanding Big Data: The seven V's," Dataconomy, 22 5 2014. [Online]. Available: http://dataconomy.com/seven-vs-big-data/. [Accessed 12 4 2016].

[16] K. Normandeau, "Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity," insideBIGDATA, 12 9 2013. [Online]. Available: http://insidebigdata.com/2013/09/12/beyond-volume-variety-velo-city-issue-big-data-veracity/. [Accessed 12 4 2016].

[17] V. Kumar, Computational Business Analytics, Minneapolis, Minnesota, USA: CRC Press - Taylor & Francis Group, 2014.

[18] Gartner, "IT Glossary - Business Analytics," Gartner, 2016. [Online]. Available: http://www.gartner.com/it-glossary/business-analytics/. [Accessed 28 4 2016].

[19] L. Kart, G. Herschel, A. Linden and J. Hare, "Magic Quadrant for Advanced Analytics Platforms," Gartner, 9 2 2016. [Online]. Available: https://www.gartner.com/doc/reprints?id=1-2YEIILW&ct=160210&st=sb. [Accessed 27 4 2016].

[20] J. Evans and C. Lindner, "Business Analytics: The Next Frontier for Decision Sciences," Collage of Business, University of Cincinnati, 3 2012. [Online]. Available: http://www.cbpp.uaa.alaska.edu/afef/bu-siness_analytics.htm. [Accessed 29 4 2016].

[21] M. Minelli, M. Chambers and A. Dhiraj, Big Data. Big Analytics, New Jersey: Wiley, 2013.

[22] M. Gualtieri and R. Curran, "The Forrester Wave: Big Data Predictive Analytics Solutions, Q2 2015," Forrester, 2015.

[23] R. Wirth and J. Hipp, "CRISP-DM: Towards a Standard Process Model for Data Mining," Ulm, Tübingen.

25

[24] Microsoft, "Machine Learning Documentation," 2016. [Online]. Avai-lable: https://azure.microsoft.com/en-us/documentation/services/machine-learning/. [Accessed 27 10 2016].

[25] Microsoft Azure, "Was ist Azure?," Microsoft, 2016. [Online]. Available: https://azure.microsoft.com/de-de/overview/what-is-azure/. [Acces-sed 27 4 2016].

[26] D. Chappell, "TechEd - Europa 2014," Microsoft, 2014.

[27] Microsoft, "Service Bus documentation," 2016. [Online]. Available: htt-ps://azure.microsoft.com/en-us/documentation/services/service-bus/. [Accessed 27 10 2016].

[28] Microsoft, "Event Hubs documentation," 2016. [Online]. Available: https://azure.microsoft.com/en-us/documentation/services/event-hubs/. [Accessed 27 10 2016].

[29] Microsoft, "Storage documentation," 2016. [Online]. Available: https://azure.microsoft.com/en-us/documentation/services/storage/. [Acces-sed 27 10 2016].

[30] Microsoft, "Stream Analytics documentation," 2016. [Online]. Avai-lable: https://azure.microsoft.com/en-us/documentation/services/stream-analytics/. [Accessed 27 10 2016].

[31] Microsoft, "Azure WebJobs documentation resources," 2016. [Online]. Available: https://azure.microsoft.com/en-us/documentation/articles/websites-webjobs-resources/. [Accessed 27 10 2016].

[32] Microsoft, "Web Apps Overview," Microsoft, 28 10 2016. [Online]. Avai-lable: https://docs.microsoft.com/en-us/azure/app-service-web/app-service-web-overview. [Accessed 15 11 2016].

[33] Microsoft, "Scheduler documentation," 2016. [Online]. Available: https://azure.microsoft.com/en-us/documentation/articles/websi-tes-webjobs-resources/. [Accessed 27 10 2016].

[34] Microsoft, "Power BI documentation," 2016. [Online]. Available: https://powerbi.microsoft.com/en-us/documentation/powerbi-landing-pa-ge/. [Accessed 27 10 2016].

[35] B. Cagran, Interviewee, Business Analytics in Traffic Prediction. [Inter-view]. 4 11 2015.

[36] A. Tostes, F. Duarte-Figueiredo, R. Assuncao, J. Salles and A. Loureiro, “From Data to Knowledge: City-wide Traffic Flows Analysis and Pre-diction Using Bing Maps,” UrbComp', Chicago, 2013.

[37] J. Fawcett and P. Robinson, "Adaptive Routing for Road Traffic," Univer-sity of Cambridge, Cambridge, 2000.

[38] Microsoft Azure, "Machine learning tutorial: Create your frist data science experiment in Azure Machine Learning Studio," 21 11 2016. [Online]. Available: https://docs.microsoft.com/en-us/azure/machi-ne-learning/machine-learning-create-experiment. [Accessed 5 12 2016].

[39] M. Chang, R. Kruger and foobar, "What Accuracy Should I Be Aiming For?," kaggle, 2013. [Online]. Available: https://www.kaggle.com/c/titanic/forums/t/4894/what-accuracy-should-i-be-aiming-for. [Ac-cessed 2 11 2016].

[40] Microsoft Azure, "Create Service Bus resources using Azure Resource Manager templates," 14 10 2016. [Online]. Available: https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-re-source-manager-overview. [Accessed 5 12 2016].

[41] Microsoft Azure, "Create a Service Bus namespace using the Azure portal," 30 11 2016. [Online]. Available: https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-create-namespa-ce-portal. [Accessed 5 12 2016].

[42] Microsoft Azure, "Get started with Azure Blob storage using .NET," 17 11 2016. [Online]. Available: https://docs.microsoft.com/en-us/azure/storage/storage-dotnet-how-to-use-blobs. [Accessed 5 12 2016].

[43] Microsoft Azure, "Create and share an Azure Machine Learning work-space," 21 11 2016. [Online]. Available: https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-create-workspace. [Accessed 5 12 2016].

[44] Microsoft Azure, "Manage an Azure Machine Learning Workspace," 5 10 2016. [Online]. Available: https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-manage-workspace. [Accessed 5 12 2016].

[45] Microsoft Azure, "Deploy an ASP.NET web app to Azure App Service, using Visual Studio," 22 7 2016. [Online]. Available: https://docs.micro-soft.com/en-us/azure/app-service-web/web-sites-dotnet-get-starte-d#configure-a-new-web-project. [Accessed 5 12 2016].

Diese Diplomarbeit wurde eingereicht an der Fachhochschule Joanneum, Graz, am Institut für Information Management bei DI (FH) Christopher Schwarz.

26

In Unternehmen sind Daten von Kunden, Produkten, Prozessen, etc. in großem Umfang vorhanden und durch die laufende Geschäftstätigkeit kommen stän-dig neue Daten hinzu. Spätestens seitdem der Begriff New Economy um das Jahr 2000 aufkam, stellen Daten für Unternehmen wertvolles Kapital dar. Dieses Kapi-tal kann genutzt werden, indem die Daten analysiert werden und die Ergebnisse der Analyse als Grundlage für unternehmerische Entscheidungen dienen. Diese Aufgabe ist anspruchsvoll, da nicht nur der Umfang von zur Verfügung stehenden Daten, sondern auch der Grad der Vernetzung verschiedener Datenquellen untereinander immer weiter zunimmt.

Diese Bachelorarbeit liefert einen Beitrag dazu, diese Aufgabe der Datenanalyse umzusetzen. Sie beschäftigt sich mit einem speziellen Ansatz zur Analyse von Daten, und zwar der statistischen Analyse mit Hilfe von Wahr-scheinlichkeitsverteilungen. Eine Wahrscheinlichkeits-verteilung ordnet einem Ereignis eine Wahrscheinlich-keit zu, mit der es eintritt. So kann zum Beispiel einem Kredit eine Ausfallwahrscheinlichkeit in Abhängigkeit-von der Kredithöhe zugeordnet werden. Die Art und die Parameter einer Verteilung lassen sich aus vorhandenen Daten ermitteln. Aus der gewonnenen Information, der Kenntnis über die Verteilung der Daten, lassen sich eine Vielzahl von unternehmerisch relevanten Informationen ableiten. Insbesondere sind dies Prognosen über die

Eigenschaften von neu hinzugekommenen Datensätzen und die Abschätzung von potenziellen Risiken. Vertei-lungen lassen sich nach bestimmten Kriterien in Ver-teilungsklassen mit ähnlichen Eigenschaften einteilen. Eine solche Klasse stellen die Heavy-tailed-Verteilungen dar. Sie lassen sich am Besten im Kontrast zu bekannten Verteilungen wie der Normalverteilung beschreiben. Die Normalverteilung mit ihrer charakteristischen Glo-ckenkurve weist Daten die höchste Wahrscheinlichkeit zu, wenn sie dicht am Mittelwert der Daten liegen. Mit steigendem Abstand vom Mittelwert sinkt die Wahr-scheinlichkeit schnell, bis ein Niveau erreicht wird, bei dem man das Auftreten eines solchen Wertes als nahezu unmöglich beschreiben kann. Ein Beispiel ist die Kör-pergröße deutscher Frauen. Im Mittel sind sie 1,65m groß. Aus unserer täglichen Beobachtung wissen wir, dass die meisten Frauen nur wenig größer oder kleiner sind. Frauen mit einer Größe von z.B. 1,90m sind selten, Frauen mit einer Größe von mehr als 2,20m gibt es in Deutschland wahrscheinlich nicht.

Auch bei der Cauchy-Verteilung als Beispiel einer He-avy-tailed-Verteilung ist die Wahrscheinlichkeit für einen Wert nahe beim Mittelwert am höchsten und sinkt mit steigendem Abstand, im Unterschied zu einer Normalverteilung aber recht langsam. Nur mit Hilfe von Heavy-tailed-Verteilungen ist es möglich, Zusam-menhänge zu beschreiben, bei denen die Mehrzahl

Frank Heimerzheim

von Heavy-tailed-Verteilungen

besonderer Berücksichtigung

Unternehmensdaten unter

Statistische Analyse von

27

der Werte sich “normal” verhalten, aber extreme Werte nicht ausgeschlossen werden können. Ein Beispiel sind die im Schnitt moderaten Ausschläge am Aktienmarkt. Wie die Börsencrashs der Vergangenheit lehren, sind große Kursschwankungen zwar seltene, aber offenbar erwartbare Ereignisse. Mit Hilfe von Heavy-tailed-Ver-teilungen lassen sich Modelle entwickeln, die in der Lage sind, genau solche Extremereignisse abzubilden. Solche Modelle können zwar nicht den exakten Zeit-punkt und Höhe der Kursschwankung voraussagen, aber durchaus eine Prognose über die Wahrschein-lichkeit eines Ereignisses eines bestimmten Ausmaßes in einem definierten Zeitraum treffen. Dieses Wissen wiederum kann als Grundlage für geeignete unterneh-merische Handlungen dienen.

Im Zusammenhang mit Heavy-tailed-Verteilungen wird regelmäßig über ungewöhnliche Ereignisse, wie zum Beispiel starke Kursschwankungen, gesprochen. Hier-bei ist eine eindeutige Bestimmung von in dieser Ar-beit verwendeten Begriffen wichtig. Zu unterscheiden ist zwischen seltenen und extremen Werten. Seltene

Werte sind Ausprägungen von Daten, die zum Beispiel nur mit einer Wahrscheinlichkeit von 1% auftreten. Extreme Werte sind seltene Werte, die sich zudem deutlich von typischen Werten unterscheiden. Unter typischen Werten versteht man schließlich Werte, die gerade nicht selten sind. Relevant ist diese exakte Be-griffsbestimmung, da sich Heavy-tailed-Verteilungen dadurch auszeichnen, dass ihre seltenen Werte extrem sein können. Bei nicht Heavy-tailed-Verteilungen kom-men extreme Werte de facto nicht vor.

Ziel dieser Arbeit ist es, das nötige Rüstzeug zu liefern, mit dem festgestellt werden kann, ob Unternehmens-daten heavy-tailed sind oder nicht. Hierfür werden die notwendigen theoretischen Grundlagen untersucht und prototypisch eine Software entwickelt, die dabei hilft, diese Frage zu beantworten.Mit dieser Software ist es Unternehmen möglich, fundierte Entscheidungen zu treffen. Die Software bietet auch dem statistisch nicht versierten Fachanwender die Möglichkeit, mit immer neuen Daten selbstständig derartige Analysen durchzuführen.

Aufbau der Arbeit

Diese Bachelorarbeit wurde eingereicht bei Prof. Dr.-Ing. Marlis von der Hude, Fachbereich 02/Informatik, Hochschule Bonn-Rhein-Sieg.

Nach einer Vorstellung des Themas in diesem ers-ten Kapitel wird in Kapitel 2 auf verschiedene Arten der Darstellung von Daten eingegangen. Es werden Anwendungsfälle für den Einsatz von Verteilungen skizziert. Nach einer Diskussion des Begriffs hea-vy-tailed werden die Begriffe Extremwerte und Aus-reißer voneinander abgegrenzt. Neben der Einbe-ziehung wirtschaftlicher Kosten in die Analyse von Daten mittels Verteilungen wird auf die Auswirkung der Fehlklassifikationen von Verteilungen und ins-besondere Worst-Case-Szenarien eingegangen. Der Schwerpunkt von Kapitel 2 liegt in einer anschauli-chen Darstellung der diskutierten Sachverhalte unter Einbeziehung wirtschaftlicher Aspekte.

In Kapitel 3 werden Verteilungen aus einer mathema-tischen Sichtweise diskutiert. Nach einer einführenden Begriffsdefinition werden sechs in dieser Arbeit detail-liert untersuchte Verteilungen vorgestellt. Es folgt eine mathematische Definition des Begriffs heavy-tailed. Es werden Verfahren vorgestellt, mit denen aus gegebe-nen Daten diejenige Verteilung ermittelt werden kann,

die diese Daten am besten repräsentiert. Diskutiert werden Testverfahren für die Anpassungsgüte einer solchen Verteilung. Im Abschluss des Kapitels werden Kennzahlen vorgestellt, mit denen die Eigenschaften einer Verteilung beschrieben werden können. Es wird mit der Pareto-Kennzahl eine im Rahmen dieser Arbeit neu entwickelte Kennzahl hergeleitet.

Für die Datenanalyseplattform Infonea wurde ein Erweiterungsmodul entwickelt, welches die in Kapi-tel 3 vorgestellten Verfahren praktisch umsetzt. Die Software ist in der Lage, aus Daten Verteilungen zu ermitteln und die Ergebnisse detailliert darzustellen. In Kapitel 4 wird der Funktionsumfang der Software erläutert. Es werden die verwendeten

Programmierwerkzeuge und Schnittstellen zu Info-nea vorgestellt. Anhand von Ausschnitten aus dem Sourcecode wird die Funktionsweise der Software be-schrieben. Abschließend bietet Kapitel 5 neben einer Zusammenfassung einen Ausblick auf weiterführende Fragestellungen.

28

Das aktuelle Zeitalter ist stark von Veränderungen geprägt. Veränderungen in der IT haben eine bisher noch nie dagewesene Geschwindigkeit erreicht. Ein Unternehmen ist nur dann wettbewerbsfähig, wenn dieses auf eintretende Veränderungen schnell reagie-ren kann. Diese Fähigkeit wird auch mit dem Begriff „Agilität“ beschrieben. Auch im Bereich der Business Intelligence spielt die Agilität eine entscheidende Rol-le. Aus diesem Anlass bildeten einige Mitglieder des TDWI (The Data Warehouse Institute) den Begriff „Agile Business Intelligence“.

In der vorliegenden Masterarbeit wird analysiert, was sich hinter der „agile Business Intelligence“ verbirgt.Zu Beginn der Masterarbeit werden im ersten Kapitel der Hintergrund und die Motivation zur Thematik, die Zielsetzung der Arbeit und die Herangehensweise be-schrieben. Zur Schaffung des richtigen Rahmens und zur Einführung des Lesers in die Thematik, findet in Kapitel zwei eine Erläuterung der Begriffsarten, der Business Intelligence, der BI-Projekte und der Agilität statt. Im dritten Kapitel erfolgt die Darstellung der „agi-len Business Intelligence“ überwiegend aus Sicht der TDWI. Aber auch die Sichtweise außerhalb der TDWI, z.B. aus der Sicht des Marktforschungsunternehmens Forrester, wird aufgezeigt.

Kapitel vier dient der Beschreibung agiler Vorgehens-modelle, Methoden, Architekturen und Techniken los-gelöst von der Thematik Business Intelligence. Dabei soll die Darstellung der bereits in Kapitel drei vor-gestellten Vorgehensmodelle und Methoden vertieft und weitere agile Vorgehensmodelle und Methoden vorgestellt werden, die eventuell im Rahmen der „agi-len Business Intelligence“ eingesetzt werden könnten.Nachfolgend werden in Kapitel fünf einige Praxis-beispiele zu agilen BI-Projekten vorgestellt.

Abschließend erfolgt in Kapitel sechs eine Synthese der zuvor analysierten und dargestellten Aspekte und die Beantwortung der Frage „Was ist „Agile Business Intelligence“?“.

Nelli Neigum

„Agile Business Intelligence“?

Was ist

Diese Masterarbeit wurde an der Fachhochschule Vorarlberg Dornbirn bei Prof. (FH) Gunther Rothfuss, MSc., eingereicht.

29

30

The vast amounts of data generated in modern dairy farming contain potentially valuable information that can be used to increase efficiency of dairy farm oper-ations. As of now this potential remains underused, which the AgriProKnow project intends to change. The AgriProKnow project develops a data analysis platform as a means to extract knowledge from the information contained in the data, which may facilitate the work of dairy farm managers, farming consultants, equipment vendors and veterinarians. The AgriProKnow project aims at creating novel information-based methods to increase efficiency of milk production in precision dairy farming. Contemporary dairy farming heavily relies on modern technology such as milking robots, feeding systems, and various sensors which track animal move-ment, micro climate, etc. All these systems produce vast amounts of data in different formats, typically existing in isolation. In order to leverage the information con-tained in the produced data the AgriProKnow project taps into the various data sources that exist in modern dairy farms. To this end the AgriProKnow project builds on the results of AgriOpenLink [agriO\s\do5(2)014] which is a data integration platform that employs se-mantic technologies to overcome the differences in data format of the various sources. In AgriProKnow we transform these data into a form fit for analysis and provide intuitive query facilities for end users.

At the core of the AgriProKnow project is a semantic data warehouse (sDWH), a first prototype of which we present in this thesis. The sDWH contains integrated data from various sources described using semantic technologies. The data are organised following the multidimensional model, consisting of cubes and di-mensions, realised in this first prototype using a rela-tional database management system. The description of the semantics of the data in the sDWH allows to link the data to existing domain ontologies. The sDHW

uses the farm operation data for analytical queries. Dairy farm managers, farming consultants, equipment vendors and veterinarians may run these analytical queries on the data in the sDWH. These users are hardly experts in database technology. Thus, the sDWH must include intuitive query facilities.

The AgriProKnow project’s sDWH should provide his-torical data for statistical analysis at an adequate level of detail. In order to guarantee sensible results as far as possible the sDWH must take into account the dif-ferences in quality of the underlying data and cope with incomplete data. Furthermore, the sDWH must be open to future additions to the data model. The sDWH should also provide facilities for intuitive querying and ease interpretation of the results.

Concerning the choice of database management system (DBMS) the first prototype opts for a relational DBMS. Since the sDWH should store historical data, a stream DBMS was discarded although a stream DBMS may serve for data mining and preaggregation of the source data. With SQL as a powerful query language relational DBMS provide rich capabilities for statistical analysis, in particular when proprietary extensions are factored in. Relational DBMS represents mature and reliable tech-nology that has proven to be capable of handling large amounts of data. Although relational DBMS often reach their limits in big data applications such as processing of sensor data, in the case of the AgriProKnow project the preprocessing of data reduces the size of the data in such way that the data can be handled by relational DBMS. Yet, the choice of relational DBMS is not the rationale behind preaggregation. Rather, storing the full amount of sensor data at highly detailed levels of granularity is expensive and provides no gain for the analysis. The detailed data is noisy and more abstract data is sufficient for generating knowledge.

Simon Schausberger

for the AgriProKnow Project

The Semantic Data Warehouse

31

The AgriProKnow project is an ongoing endeavour, its data model is subject to constant revision and exten-sion. Moreover, once operative, the AgriProKnow data analysis platform should be open to agricultural do-mains apart from dairy farming, e.g., pig farming. Thus, the sDWH provides possibilities for schema creation and modification, which allows to evolve the data mod-el during the project and beyond, without the need to apply tedious changes to the source code directly.

The sDWH provides concise representation of all data related to a specific subject in subject-oriented views. The subject-oriented views are created through com-bination of sDWH cubes with the same subject. For example, cubes containing milking data from different sources such as milking robots or official authorities are combined in a subject-oriented view about milk. Since the source cubes may differ in quality and fre-quency of the recorded data subject-oriented views fill gaps with continuation. For example, official au-thorities track milk quality only once a month whereas milking robots constantly analyse milk contents.

Still, the subject-oriented view may contain data from both robots and authorities on a daily basis. The same

milk quality data from authorities, however, is used continuously for days between points of measurement, with indication of freshness of the data. Continuation avoids null values which are hard to handle during analysis.

Recurring patterns of analysis can be formalised using semOLAP patterns. Database and business intelligence experts in collaboration with domain experts formu-late semOLAP patterns using SQL and the semOLAP pattern definition language. Possible patterns are, for example, kinds of comparison such as the comparison of measures for different groups. Analysts may then instantiate semOLAP patterns by providing concrete values for wildcard elements within the pattern defini-tion. For example, an analyst may compare milk output from two different groups of animals.

Using an RDF vocabulary for data cubes, the sDWH provides machine-readable representation of its cubes which can be linked to existing ontologies in the agricul-tural domain, as semantic technologies and ontologies are used within the agricultural domain [Jahanshiri\s\do5(2)015, Roussey\s\do5(2)010]. Future work will also use semantic technologies to describe query output.

Realization

Figure 1 illustrates the architecture of the sDWH. Already preprocessed instance data and schema in-formation about the multidimensional model are uploaded to the sDWH using a REST interface. The sDWH consists of a relational database which con-tains instance data which are used for analysis. A tri-ple store serves as data dictionary that describes the multidimensional model that underlies the relational data. The triple store also serves as staging area for

the uploaded instance data which are represented using RDF. Database and business intelligence ex-perts in collaboration with domain experts upload semOLAP pattern definitions into the sDWH, which stores the semOLAP patterns in the triple store. An-alysts formulate pattern-based queries which are then executed on the relational database. The sDWH then returns semantically enriched query results in RDF format.

Figure 1: System Architecture of the Semantic Data Warehouse

32

Schema Definition and Data Loading

Data Analysis

Since the AgriProKnow project is work-in-progress the multidimensional model of the sDWH is constantly evolving. Modifications of the multidimensional model should be possible without direct access to the source code by database experts. Therefore, database experts define the sDWH schema using the RDF Data Cube Vocabulary (qb), a W3C recommendation [Cyganiak\s\do5(2)014], and its extension qb for OLAP (qb4o) [Etch-everry\s\do5(2)014]. The qbgen vocabulary provides further extensions of qb and qb4o specifically de-signed in order to realise the sDWH features required for the AgriProKnow project. Although designed for the AgriProKnow project the qbgen vocabulary may be used for other sDWH projects with similar require-ments. Loading of schema definition and instance data into the sDWH is done through a set of RESTful web services. All schema and instance data are defined us-ing the previously mentioned RDF vocabularies and

uploaded into the sDHW through the web services. Using the RESTful web services for schema definition it is possible to create new cubes, dimensions or complex attributes, add additional attributes to a dimension, or delete cubes. As the schema is represented in the relational database, the exeuction of the web services for schema definition results in creation, altering or dropping of tables. The RDF schema data remains as metadata that describes the tables. The web services for loading of instance data inserts/replaces or deletes data in the relational database.

In the AgriProKnow project, data originates from vari-ous sensors and databases, e.g., ear tags, temperatures sensors, milking robots, official authorities or dairies. Typically, the data in their raw format are unsuitable for analysis and must be preprocessed accordingly before being loaded into the sDWH.

Subject-oriented views provide a more integrated view on a particular subject. Subject-oriented views are cre-ated by combination and completion of the loaded instance data. Each base cube in the sDWH addresses a specific subject as recorded by a particular data source. Subject-oriented views combine data from different cubes about the same subject across different data sources. Furthermore, data sources track data at differ-ent frequencies. For example, official authorities track milk quality only once a month whereas milking robots constantly analyse milk contents. A subject-oriented view that combines data from these sources completes the missing authority data for days without measure-ment in order to avoid null values. Completion is done by continuation of the last available measurement. Continuation extends for a specified number of days. If the available measurement is too old, no continuation is applied and a null value persists. The schema also indicates the freshness of a continued measurement.Another feature of subject-oriented views is addition of information. In the specific case of AgriProKnow some subject-oriented views contain calving information. To this end a subject-oriented view is combined with data from the calving cube, thus indicating the time since the

last calving (DayOfLactation) and the number of the last calving. Calving information is often used in queries.

The subject-oriented view definition is an SQL query that is used to create a materialised view. The prop-erties of subject-oriented views are specified using a RESTful web service. An analysis view generator cre-ates the subject-oriented views according to the con-figuration. The analysis view generator is implemented as PL/SQL stored procedures in the database.

The definition of semOLAP patterns facilitates query-ing of the analysis view. A semOLAP pattern describes the structure of a query for a specific purpose, e.g., a certain kind of comparison. A semOLAP pattern consists of several elements with well-defined semantics. A pattern expression defines the translation of a sem-OLAP pattern into a target language; in the case of AgriProKnow, the target language is SQL. Pattern ex-pressions are specified in a domain-specific language. The pattern expression language is based on the target language, e.g., SQL. A pattern expression can be seen as an SQL query which includes wildcards. Wildcards correspond to the pattern elements. The pattern defi-

33

nition also contains definitions of the range of the pattern elements. Pattern elements can be reused in multiple patterns. Finally, a semOLAP pattern defines its output in terms of pattern elements.

Analysts may instantiate semOLAP patterns by pro-viding concrete values for the pattern elements. Pat-terns can be instantiated on different levels of ab-straction, namely ROLAP and qb level. If instantiated at the ROLAP level analysts provide names of tables and columns representing facts, dimensions, levels and attributes of the multidimensional model. If in-stantiated at the qb level analysts provide qb, qb4o and qbgen elements. Furthermore, analysts may in-stantiate pattern elements with calculated measures and predicates. Calculated measures and predicates are defined by domain and database experts in order to unambiguously express the calculation of key per-formance indicators and the semantics of business terms for reuse in multiple queries. Both predicates and calculated measures are based on existing schema elements. The definitions of predicates and calculated measures consist of the used schema elements and

include an expression in the target language that is used for query generation. For ROLAP applications the expression must be in SQL and the schema elements are columns.

The definition of semOLAP patterns, calculated meas-ures and predicates as well as pattern instances is rep-resented in RDF. A RESTful web service allows users to upload these definitions into the sDWH. The definitions are stored in the triple store of the sDWH. An uploaded pattern instance may then be executed multiple times using a separate web service.

A ROLAP pattern instance translates into a SQL query. Translation is done using the pattern expression. The translation of pattern instances at the qb level require a mapping from qb, qb4o and qbgen elements to relational model elements. Execution of a ROLAP pattern instance results in a CSV file whereas execution of a qb pattern instance results in a qb data structure. The advantage of qb pattern instances lies in the semantic description of the query result in terms of the multidimensional model which facilitates the interpretation of the result.

References

[1] Richard Cyganiak, Dave Reynolds, and Jeni Tennison. The RDF Data Cube Vocabulary: W3C Recommendation. Ed. by World Wide Web Con-sortium. 2014. url: https://www.w3.org/TR/vocab-datacube/.

[2] Lorena Etcheverry, Alejandro Vaisman, and Esteban Zim_anyi. \Mo-deling and Querying Data Warehouses on the Semantic Web Using QB4OLAP". In: Data Warehousing and Knowledge Discovery. Ed. by Ladjel Bellatreche and Mukesh K. Mohania. Vol. 8646. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2014, pp. 45{56. isbn: 978-3-319-10159-0. doi: 10.1007/978-3-319-10160-6_5.

[3] E. Jahanshiri and S. Walker. \Agricultural Knowledge-Based Systems at the Age of Semantic Technologies". In: International Journal of Knowledge Engineering-IACSIT 1.1 (2015), pp. 64{67. issn: 23826185. doi:10.7763/ijke.2015.v1.11.

[4] Catherine Roussey et al. \Ontologies in agriculture". In: AgEng 2010, International Conference on Agricultural Engineering. 2010.

[5] Slobodanka Dana Kathrin Tomic et al. \agriOpenLink: Semantic Ser-vices for Adaptive Processes in Livestock Farming". In: International Conference of Agricultural Engineering. 2014.

Die Diplomarbeit wurde eingereicht an der Johannes Kepler University Linz, Department for Business Informatics - Data & Knowledge Engineering bei Univ.-Prof. Dipl.-Ing. Dr. techn. Michael Schrefl.

34

Während Literacy (Alphabetisierung) die Lese- und Schreibfähigkeit einer Gruppe von Personen beschreibt, kann Data Literacy analog dazu als die Fähigkeit defi-niert werden, mit Daten kritisch-produktiv umzugehen, d.h. Daten zu analysieren, aus Daten zu schlussfolgern, diese Schlussfolgerungen zu kommunizieren, und die Daten-Schlussfolgerungen anderer kritisch zu evalu-ieren. Data Literacy wird nicht nur für Unternehmen und NGOs immer wichtiger, sondern auch für jeden einzelnen Bürger. Es stellen sich nun die Fragen, wel-che Kompetenz-Dimensionen Data Literacy beinhaltet und welche Kompetenz-Stufen für diese Dimensionen existieren. Wie können diese Daten-Kompetenzen für Individuen und Organisationen gemessen werden? Um diese Forschungsfragen zu beantworten, entwickelt die Abschlussarbeit ein Reifegradmodell für Data Literacy mit dem Fokus auf Nicht-Regierungs-Organisationen (NGOs). Damit trägt die Arbeit dem zunehmenden aka-demischen Interesse an Datenkompetenzen Rechnung, indem die identifizierten Kompetenzen erstmalig in einem Reifegradmodell beschrieben werden. Basierend auf der Entwicklung eines ersten Modellentwurfs mit

Hilfe einer strukturierten Literaturrecherche wurde das Modell durch Action Design Research über mehrere Iterationsschleifen weiterentwickelt. Die Arbeit unter-sucht dabei als Fallstudien die sog. „Datenschule“, ein Weiterbildungsprogramm der Open Knowledge Foun-dation Germany, sowie eine weitere kooperierende NGO. Eine sorgfältige qualitative Inhaltsanalyse von fünf semi-strukturierten Experteninterviews in der Ent-wicklungsphase des Artefakts und vier Testfeedbacks führten zur Entwicklung eines Data Literacy Modells, welches elf Kompetenzbereiche absteckt, die auf vier Ebenen betrachtet werden können. Die Abschlussarbeit beschreibt die Evolution des Modells über verschie-dene Iterationsphasen hinweg und stellt ein Reife-gradraster zur Verfügung, welches zusätzlich durch ein Excel Tool zur Selbstevaluierung ergänzt wird. Die Ergebnisse der Studie demonstrieren, dass ein Reife-gradmodell zur Beschreibung von Datenkompetenzen ein hilfreiches Werkzeug ist, um das Bewusstsein für Datenfähigkeiten zu erhöhen und gleichzeitig über das Thema aufzuklären.

OrganizationsHelena Sternkopf

Literacy in Non-Governmental

Maturity Model for Data

Development of a

Doing Good with Data:

Diese Masterabeit wurde an der Hochschule für Wirtschaft und Recht Berlin am Lehrstuhl Unter-nehmensführung und Organisation bei Prof. Dr. Roland Müller eingereicht.

ERSTER PLATZ

35

Die Lambda-Architektur wird im Rahmen von Big Data als der Architektur-Ansatz schlechthin gehandelt. Wel-chen Platz hat diese neuere Architektur neben einem his-torisch-etablierten Konzept wie dem Data-Warehouse? Dieser Frage geht diese Masterarbeit nach und bietet einen Lösungsansatz für die Zusammenführung von den beiden Architekturen in ein Referenzmodell.

Die größte Erkenntnis zu den beiden Architekturen wurde durch die Ausarbeitung des Referenzmodells gewonnen. Die Grundidee der Lambda-Architektur be-steht bereits seit langem im Data-Warehouse in Form des Schichtenmodells mit Operational-Datastore-Er-weiterung. Die 3 Ebenen der Lambda-Architektur zur Datenverarbeitung und -bereitstellung (Batch-, Ser-ving- und Speed-Layer) verwenden jedoch die neueren Big-Data-Technologien.

Darüber hinaus wurde festgestellt, dass ein BI- System künftig durch Big-Data-fähige Komponenten und Tech-nologien erweitert werden kann (und sogar muss), um Unternehmen durch vertiefte Einblicke in eigene Operationen sowie bessere und schnellere Analyse-methoden die Wettbewerbsfähigkeit zu sichern. Das Referenzmodell der Arbeit bietet einen Lösungsansatz für einen Technologienmix, der u.a. neuartige Daten-banksysteme, Speicherungstechniken, Integrations-methoden und Verarbeitungsmechanismen enthält.

Aline Venckus (ex-Krampl)

Referenzmodell einer

Lambda-Architektur

Warehouse-Architekturen –

Technologien auf Data-

Auswirkungen von Big-Data-

Diese Masterarbeit wurde an der Universität Duisburg-Essen, Mercator School of Management, am Lehrstuhl für Wirtschaftsinformatik insb. Business Intelligence bei Prof. Dr. Peter Chamoni eingereicht.

tdwi.eu

TDWI e.V.Lindlaustr. 2c53842 TroisdorfTel. +49 (0) 2241 2341 212Fax. +49 (0) 2241 2341 199Email [email protected]

Documents

TDWI AWARD - sigs.de · In der heutigen Zeit steht die Pharmaindustrie einer Vielzahl von Herausforderungen gegenüber: der Zu-nahme sowie Beschleunigung des Wettbewerbs, der fortschreitenden