117
TECHNISCHE UNIVERSITÄT MÜNCHEN FAKULTÄT FÜR INFORMATIK Lehrstuhl für Datenbanksysteme Exploratory Knowledge-Mining from Complex Data Contexts in Linear Time Samuel Joseph Maurus Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Alfons Kemper Prüfer der Dissertation: 1. Univ.-Prof. Dr. Claudia Plant Universität Wien, Österreich 2. Univ.-Prof. Dr. Hans-Joachim Bungartz Die Dissertation wurde am 07.11.2016 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 27.01.2017 angenommen.

TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

TECHNISCHE UNIVERSITÄT MÜNCHENFAKULTÄT FÜR INFORMATIKLehrstuhl für Datenbanksysteme

Exploratory Knowledge-Mining fromComplex Data Contexts in Linear Time

Samuel Joseph Maurus

Vollständiger Abdruck der von der Fakultät für Informatik der TechnischenUniversität München zur Erlangung des akademischen Grades eines

Doktors der Naturwissenschaften(Dr. rer. nat.)

genehmigten Dissertation.

Vorsitzender: Univ.-Prof. Dr. Alfons Kemper

Prüfer der Dissertation: 1. Univ.-Prof. Dr. Claudia PlantUniversität Wien, Österreich

2. Univ.-Prof. Dr. Hans-Joachim Bungartz

Die Dissertation wurde am 07.11.2016 bei der Technischen Universität Müncheneingereicht und durch die Fakultät für Informatik am 27.01.2017 angenommen.

Page 2: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,
Page 3: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Abstract

Automated measurements are being taken in many areas of society. Typically, the scaleis large and the structure complex. Even within a single application, data are oftencollected in various forms (e.g. graph structures, relational tables and multivariate time-series) and over all the fundamental scales of measurement. Despite the heterogeneity,such stores can hold profound insights about patterns found in the domain. Thisthesis is concerned with exploratory data mining – the development of unsupervised andautomatic algorithms to extract this knowledge.Research Approach: Our research is primarily driven by the “top challenges” identifiedby the data-mining community. These challenges highlight five aspects of practically-observed complexity on which we focus: 1) heterogeneous data types and measurementscales, 2) missing information, 3) clutter and noise, 4) high dimensionality, and 5)high-bandwidth time-series data. To help address these concerns, we present novelmethods for practical data mining tasks having these complexities. Secondly, drivenby the pressing need to develop algorithms that scale efficiently with size of the data,we present linear-time algorithms for our problems and discuss their properties. Weempirically evaluate each against the state-of-the-art with respect to 1) synthetically-generated data, 2) real-world data, and 3) run-time behavior.Results: We present problems, frameworks, algorithms and statistical tests to extractknowledge in various forms. We extract latent patterns in the complex context ofincomplete heterogeneous measurements using our Ternary Matrix Factorization (TMF)problem and our Matrix Factorizations over Discrete Finite Sets (MDFS) framework. Weextract clusters in the complex context of high-dimensional data with “extreme” clutterusing our algorithm SkinnyDip. Finally, we extract anomalous system events from thecomplex context of high-bandwidth time-series data using our approach BenFound.Contributions: From a research perspective, we show how dimensionality-reductionthrough matrix factorization can be performed over heterogeneous data types andmeasurement scales, and in doing so complete the theoretical unification of a set ofrelated data-mining techniques. We show how an elegant “mode-hunting” approachcan help to cluster data with high dimensionality and extreme levels of clutter. Finally,we show how anomaly-detection can be performed on high-bandwidth time-series datawithout the need for a parameterized model to describe the underlying process.

From a practical perspective, our contributions are fourfold. Firstly, all of our

i

Page 4: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Abstract

techniques can outperform the state-of-the-art with respect to standard quality met-rics. Secondly, each is highly scalable, with aggressive optimization and empirically-demonstrated linear run-time growth in the size of the data. Thirdly, our techniqueshave no “obscure” parameters and thus contribute to the holy grail of “parameter-free”data mining. Finally, we present numerous examples on real-world data, and provideall prototypes and source code in online, publicly-accessible repositories for direct use.All results are reproducible.Limitations: None of our proposed algorithms are able to solve optimally in the generalcase (NP-Hard). Indeed, we are only able to provide an approximation factor for onesub-problem under special conditions. Our algorithms are therefore based on heuristicsand their evaluation empirical. A number of additional assumptions and practicallimitations exist and are discussed.

ii

Page 5: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Zusammenfassung

In beinahe allen Bereichen der Gesellschaft werden automatisierte Messungen vorgenom-men. Typischerweise ist die Datenmenge groß und die Struktur komplex. Sogar in-nerhalb eines Anwendungsfalls werden die Daten oft in unterschiedlichen Formen(z.B. als Graphstrukturen, relationale Tabellen, multivariate Zeitreihen) und über allefundamentalen Skalenniveaus hinweg gesammelt. Trotz ihrer Heterogenität könnendie Datensätze tiefe Einblicke in das Anwendungsgebiet ermöglichen. Diese Disser-tation befasst sich mit dem explorativen Datamining, das heißt der Entwicklung vonunüberwachten, automatischen Algorithmen zur Extraktion dieses Wissens.Forschungsansatz: Unsere Forschung wird in erster Linie durch die in der Datamining-Gemeinschaft als „größten Herausforderungen“ geltenden Probleme angetrieben. DieseHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach-tenden Komplexität, auf welche wir uns fokussieren: 1) heterogene Datenarten undSkalenniveaus, 2) fehlende Informationen, 3) Stördaten und Rauschen, 4) hohe Dimen-sionalität, und 5) Zeitreihendaten mit hoher Bandbreite. Um die Herangehensweise andiese Probleme zu erleichtern, stellen wir neue Methoden für den praktischen Umgangmit Datamining-Aufgaben dieser Komplexität vor. Angetrieben durch den dringendenBedarf an effizient mit der Datengröße skalierender Methoden, entwickeln wir außerdemlineare Algorithmen für unsere Problemstellungen und diskutieren ihre Eigenschaften.Wir vergleichen die entwickelten Algorithmen empirisch mit wissenschaftlich etabliertenAlgorithmen hinsichtlich 1) synthetisch generierter Daten, 2) realer Daten, und 3) ihresLaufzeitverhaltens.Ergebnisse: Wir präsentieren Problemstellungen, Frameworks, Algorithmen und statis-tische Tests, um Wissen verschiedener Arten zu extrahieren. Wir extrahieren latenteMuster im komplexen Kontext unvollständiger heterogener Messungen mittels unsererTernäre Matrixzerlegung (TMF)-Problemstellung und mittels unseres Frameworks derMatrixzerlegung über diskreten endlichen Mengen (MDFS). Wir extrahieren Clusterim komplexen Kontext hoch-dimensionaler Daten mit „extremen“ Stördaten mittelsunseres SkinnyDip-Algorithmus. Schließlich extrahieren wir anomale Systemereignisseim komplexen Kontext von Zeitreihendaten mit hoher Bandbreite mittels unseres Ben-Found-Ansatzes.Beiträge: Aus Forschungssicht zeigen wir, wie Matrixzerlegung über heterogene Date-narten und Skalenniveaus hinweg durchgeführt werden kann. Wir vervollständigen

iii

Page 6: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Zusammenfassung

dadurch die theoretische Vereinigung einer Reihe an verwandten Datamining-Techniken.Wir zeigen, wie ein eleganter „mode-hunting“-Ansatz beim Clustern von Daten mithoher Dimensionalität und extremem Niveau an Stördaten helfen kann. Schließlichzeigen wir, wie eine Anomalieerkennung bei Zeitreihendaten mit hoher Bandbreiteohne Verwendung eines parametrisierten Models, das den zugrunde liegenden Prozessbeschreibt, durchgeführt werden kann.

Aus praktischer Sicht zeigen wir empirisch, dass unsere Techniken den „State-of-the-art“in Bezug auf Standard-Qualitätskriterien übertreffen können. Jede Technik ist außerdemhoch skalierbar und hat keine „verschleierten“ Parameter. Schließlich präsentierenwir zahlreiche auf realen Daten basierende Beispiele und stellen alle Prototypen undden Quellcode in online öffentlich zugänglichen Repositories für den unmittelbarenGebrauch bereit. Des Weiteren sind alle Ergebnisse reproduzierbar.Einschränkungen: Keiner der von uns vorgeschlagenen Algorithmen stellt im allge-meinen Fall ein optimales Lösungsverfahren dar. In der Tat können wir nur einenApproximationsfaktor für ein Teilproblem unter speziellen Bedingungen liefern. UnsereAlgorithmen basieren somit auf Heuristik und deren empirische Evaluation. Eine Reiheweiterer Annahmen und praktischer Einschränkungen existieren und werden diskutiert.

iv

Page 7: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Acknowledgments

To Professor Claudia Plant, my supervisor. I am indebted to you for your fairness,honesty, flexibility, optimism, motivation, experience and time. Your words werethoughtful and inspiring during our numerous fruitful discussions. It was a pleasure tobe a member of your “research start-up” at Helmholtz Zentrum München, attend thepremier conferences with you, and to visit and meet your new team for a short researchstay in Vienna. You are a successful and relatively young Professor, and I highly respectthe spirit and skillfulness you continue to show in the highly-competitive research fieldof data mining. To Professor Christian Böhm, likewise thank you for all the pricelessdiscussions, manuscript reviews, suggestions, optimism and experienced judgement. Iparticularly wish to thank you for the passion with which you engaged and “jumpedstraight in” to many of the thesis ideas.

To PD Dr. Wolfgang zu Castell at Helmholtz Zentrum München. I enjoyed ourdiscussions on a wide range of topics, particularly the challenging research questionsthat you were pondering in your own projects. I thank you for providing feedback onmuch of my work from a mathema(x ∈ {t, g})ician’s perspective. Finally, I appreciateyour efforts and support at the organizational level of the stack, and particularly for latertaking up the role of my Helmholtz thesis-committee supervisor. To Dr. Jens Baumert atHelmholtz: thank you also for being part of my thesis committee, and for allowing methe chance to work with you and your team on one of your diabetes research projects.To Helmholtz Zentrum München in general: thank you for creating a highly-attractiveenvironment in which to complete a PhD (quiet, spacious, green and focused).

To the others with whom I enjoyed my time at Helmholtz over the years: Dr. DavidEndesfelder, Dr. Marion Engel, Nina Hubig, Annika Tonch, Alexandra Derntl, SebastianGoebl, Wei Ye, Linfei Zhou, Ruben Seyfried, Kristof Schröder, Hannah Schrenk, BernhardTandler, Renate Frieß, Walter Huss, Jürgen Grabow, and numerous interns/students.You all played various positive roles in my development and I thank you. To ProfessorAlfons Kemper and Professor Hans-Joachim Bungartz at the TUM: many thanks foragreeing to take part in my examination committee. Finally, the last thanks on the“professional” side go to the peers who took the time to respond to my requests for an

Page 8: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Zusammenfassung

implementation of their algorithm(s): Pauli Miettinen, Jilles Vreeken, Radim Belohlavek,Jason Rennie, Greg Hamerly, Junming Shao and Kohei Hayashi.

To Stefanie Ringelhan: thank you for your unwavering care and support throughthe “ups and downs”. You gave me context, taught me to “look over everything first”and take my time. Thank you particularly for the assistance in grasping the Germanlanguage, and for your particularly motivating comments while I was learning toski. To Stefanie’s parents, Franz and Regine: you are über-generous and always sopositive and welcoming. May the Canasta sessions continue for many years to come!To Mark and Verena: you guys are great friends, neighbors and supporters. To yourchildren, Noah and Lea, for the laughs during our hour-long sessions of “Verstecken”and “Pferdespielen”!

To Chopin, Schubert and Mozart: your Nocturnes, Etudes, Impromptus and Sonatasshowed me the way to an enchanting artistic escape at the end of each technical day. Tomy teachers Yamile and Mangfred, pianists studying at the Musikhochschule München:thank you so much for your motivation and guidance as I continue to try and tame theseworks. To Jon, Ryan, and Derek, my childhood friends from Melbourne: it’s great tostill be in such close contact after all these years. May our adventures continue for manyyears to come! To the Geschwister Scholl Wohnheim: you offered me the ideal Munichstudent accommodation during the majority of my studies, thank you all so much!

To my amazing family in the Allgäu, particularly Ehrentraud and Albert Freudling.Your love and caring is genuine and infectious, and it’s always “wunderschön” when wemeet! Finally, to my loving family in Australia: our strong bonds render the distancesbetween us insignificant. It’s always special when we get together. To my motherRosemary in particular: I continue to appreciate and be in awe of the extra energy youinvested in raising my three sisters and me. Our triumphs are yours as well.

vi

Page 9: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Publication Preface

The contributions of this thesis are based on the following five first-author papers:

A Maurus, S. and Plant, C., 2014, December. Ternary matrix factorization. InProceedings of the 2014 IEEE International Conference on Data Mining (ICDM’14). IEEE.

Note: This contribution received the single Best Paper Award from 727 submissions.The overall acceptance rate for full research-track papers was 9.7%. ICDM is a top-tierdata-mining conference.

B Maurus, S. and Plant, C., 2016, January. Ternary Matrix Factorization: problemdefinitions and algorithms. Knowledge and Information Systems (KAIS), 46(1).Springer.

Note: This contribution is an extension of the 2014 ICDM paper above. It includes at least30% new material. KAIS is a leading data-mining journal (2014 impact factor 1.782).

C Maurus, S. and Plant, C., 2016, December. Factorizing Complex Discrete Data“with Finesse”. In Proceedings of the 2016 IEEE International Conference on DataMining (ICDM ’16). IEEE.

Note: The acceptance rate for short research-track papers was 11.1% (from 910 submis-sions).

D Maurus, S. and Plant, C., 2016, August. Skinny-dip: Clustering in a Sea of Noise.In Proceedings of the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD ’16). ACM.

Note: KDD is a top-tier data-mining conference. The 2016 acceptance rate for fullresearch-track papers was 8.9% (784 submissions). A poster and video was showcasedin addition to an oral presentation.

E Maurus, S. and Plant, C. Let’s See your Digits: Anomalous-State Detection usingBenford’s Law. Submitted to the Research Track of the 2017 SIAM InternationalConference on Data Mining.

Note: SDM is a top-tier data-mining conference. At the time of writing, this contributionhad been submitted for peer review (pending acceptance).

vii

Page 10: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Publication Preface

The following paper with major contributions as second author was additionally pub-lished during the course of the doctoral studies. This publication does not, however,form part of the work at hand.

Ye, W. Maurus, S. Hubig, N and Plant, C., 2016, December. Generalized IndependentSubspace Clustering. In Proceedings of the 2016 IEEE International Conference on DataMining (ICDM ’16). IEEE.

Note: The acceptance rate for regular research-track papers was 8.5% (from 910 submissions).

viii

Page 11: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Contents

Abstract i

Zusammenfassung iii

Acknowledgments v

Publication Preface vii

1 Introduction 11.1 Clarifying the Buzzwords: Data Mining, Knowledge Discovery in Databases

and Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Role of Exploratory Data Mining in Society . . . . . . . . . . . . . . . 31.3 Current Challenges in Data Mining . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Challenge 1: Developing a Unified Theory of Data Mining . . . . . 41.3.2 Challenge 2: Scaling Up for High-Dimensional Data and High-

Speed Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.3 Challenge 3: Mining Time-Series Data . . . . . . . . . . . . . . . . . 71.3.4 Challenge 4: Mining Complex Knowledge from Complex Data . . 7

1.4 Goals of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Remarks on the Document Structure . . . . . . . . . . . . . . . . . . . . . . 9

2 Preliminaries 112.1 The Fundamental Scales of Measurement . . . . . . . . . . . . . . . . . . . 112.2 Blind Source Separation, Latent Patterns, and Finite Mixtures . . . . . . . 13

2.2.1 Independent Component Analysis . . . . . . . . . . . . . . . . . . . 142.2.2 Principal Component Analysis (via the Singular Value Decompo-

sition) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 Non-negative Matrix Factorization . . . . . . . . . . . . . . . . . . . 16

2.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Partition-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Density- and Spectral-Based Clustering . . . . . . . . . . . . . . . . 20

ix

Page 12: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Contents

3 Literature Review 253.1 Matrix Factorizations over Discrete, Finite Sets . . . . . . . . . . . . . . . . 253.2 Boolean Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Combinatorics Problems Related to Boolean Matrix Factorization . 283.2.2 Missing-Value Boolean Matrix Factorization . . . . . . . . . . . . . 30

3.3 Ordinal Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Clustering in the Context of Noise/Clutter . . . . . . . . . . . . . . . . . . 33

3.4.1 Peer Article: Detecting Features in Spatial Point Processes with Cluttervia Model-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4.2 Peer Article: Efficient Algorithms for Non-Parametric Clustering withClutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Anomaly and Change-Point Detection in Time-Series Data . . . . . . . . . 353.5.1 Numerical Techniques from Data-Mining and Signal-Processing . 363.5.2 Specialized Techniques for Social Media . . . . . . . . . . . . . . . 37

4 Research Approach 394.1 Literature Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Idea Synthesis, Problems Definitions and Algorithm Design . . . . . . . . 404.3 Software Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.5 Comparison Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.6.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.6.2 Controlled Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 454.6.3 Real-World Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 464.6.4 Run-Time (Scalability) Experiments . . . . . . . . . . . . . . . . . . 47

4.7 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Results and Discussion 495.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.1 Reflecting on our Challenges and Goals . . . . . . . . . . . . . . . . 495.1.2 Elaboration on the Specific Results from Papers A to E . . . . . . . 52

5.2 Implications for Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3 Implications for Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4.1 Non-Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4.2 Limited Approximability Results . . . . . . . . . . . . . . . . . . . . 795.4.3 Limited Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . 805.4.4 Algorithm Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 80

x

Page 13: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Contents

5.4.5 Non-Determinism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.4.6 Performance Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . 815.4.7 Side-Effects of our Complexity-Reducing Assumptions . . . . . . . 825.4.8 The Multiple Comparisons Problem . . . . . . . . . . . . . . . . . . 84

5.5 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Conclusion 89

Bibliography 91

List of Figures 101

List of Tables 103

xi

Page 14: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,
Page 15: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

1 Introduction

1.1 Clarifying the Buzzwords: Data Mining, KnowledgeDiscovery in Databases and Data Science

The terms “data mining”, “knowledge discovery in databases” (KDD) and “data science”are often used loosely, so it is useful to begin with some clarifications.

We follow the definitions as given in [FPS96]. Specifically, we understand “datamining”1 as being the application of specific algorithms to prepared data for the purposeof either prediction or description. Prediction involves finding patterns that can assistin forecasting the behavior of a phenomenon (or some entities). Description involvesfinding useful explanatory patterns that can be presented to a user in a digestible,understandable form. Of course, the boundaries between these types need not be sharp.For example, a predictive data-mining algorithm that yields a decision-tree with readablebranching rules is usually more descriptive and interpretable than, say, a feed-forwardartificial neural network containing a cryptic set of numeric weights.

In this thesis we are primarily interested in descriptive data mining, that is, theextraction of interpretable and digestible patterns in a given data set. More specifically,this thesis focuses on learning these patterns in an unsupervised or exploratory way. Weassume no a priori “labeled” data. This implies no training phase, in which a systemmight “learn” about the domain from a representative set of such labeled instances.The knowledge-discovery processes on which we focus are also not hypothesis driven,as is often the case in classical statistical analysis. In short, we focus on variants of thefundamental unsupervised and exploratory data-mining problems: finding associations,clustering objects and detecting anomalies.

The term “knowledge discovery in databases” refers to the high-level workflow thattransforms raw data into proven domain insights. Data mining is a single step in thisworkflow. The other steps include selection (of a data subset for analysis), preprocessing,transformation and interpretation/evaluation. The broad workflow is depicted in Figure1.1. Again, it is important to realize that “data mining” is one of many steps in the KDDprocess. As the primary contributions of this thesis are novel data-mining methods, wewill usually only treat the other tasks of the KDD workflow to the extent necessary for

1We use the terms “data mining” and “knowledge mining” interchangeably.

1

Page 16: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

1 Introduction

Data

Target Data

Preprocessed Data

Transformed Data

Patterns

Knowledge

Selection

Preprocessing

Transformation

Data Mining

Interpretation/Evaluation

Figure 1.1: The Knowledge Discovery in Databases workflow, inspired by Figure 1 in[FPS96].

demonstrating our ideas.

Finally, the term “data science” is typically understood to be more abstract again.It is often defined as the field concerned with processes and systems which extractknowledge or insights from data in various forms. Many techniques from machinelearning and classical fields such as statistics, mathematics and signal-processing fitthis definition. Indeed, the term “data science” is often argued to be a “buzzword” forstatistics. The interested reader can find a detailed discussion between the concepts of“data science” and “statistics” in [Dha13].

In summary, this thesis makes methodological contributions to the field of data mining,so to avoid confusion we will mostly refrain from using the terms “data science” and“knowledge discovery in databases”.

2

Page 17: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

1.2 The Role of Exploratory Data Mining in Society

1.2 The Role of Exploratory Data Mining in Society

Data mining techniques play an increasingly important role in society [KB11; Kri+07;BY09]. In (e-)commerce, data mining is used as a basis for many purposes, includingthe generation of cross-selling recommendations [AIS93] and for summarizing customersentiments based on large numbers of reviews [HL04]. In healthcare, applicationsrange from the detection of health-insurance claims fraud to the evaluation of treatmenteffectiveness [K+11]. In science and engineering, data mining finds applications inbioinformatics [Wan+05], genetics [Kan+02], medicine [CM02], education [SVM06] andelectrical power engineering [McG+02]. In online social networks, data mining is usedfor many tasks including link-prediction [LK07], community detection [TL10], frauddetection [YWB11] and spam detection [Ben+10].

The contributions of this thesis are application-agnostic, but we do note a numberof common properties from the list just mentioned. In many of these applications, theFour V’s of big data [Buh+13; SS13] are evident. Volume refers to the scale of the data,now larger than terabytes and petabytes, which outstrips traditional store and analysistechniques. Velocity refers to the bandwidth at which data is being streamed (e.g. 1TBof trade information is collected during each trading session on the New York StockExchange). Variety refers to the different forms of data, including measurements thatare made over fundamentally different scales. Finally, veracity refers to the uncertaintyand poor quality of data. Humans cannot be expected to manually analyze big data.“Hence, KDD is an attempt to address a problem that the digital information era made afact of life for all of us: data overload” [FPS96, p. 38]. In this thesis, the scalability of ourproposed methods to such big data applications in society is a key consideration.

1.3 Current Challenges in Data Mining

As illustrated by the DBLP2 metrics in Figure 1.2, many fundamental problems in thefield of data mining remain in an active state of research.

Take cluster analysis, for example. One commonly-found definition states that Clus-tering is the task of partitioning a set of objects such that objects in the same group are more“similar” to each other than those in other groups. With such an innocent-looking definition,why does this task continue to attract such a growing number of academic publications?That is, why is it so challenging? Why haven’t we solved it yet?

One answer is that it is difficult to find a more precise definition of the clusteringproblem that is universally valid. The notion and measurement of “similarity”, forexample, may vary by application, as may the mechanism for evaluating a candidate

2dblp.uni-trier.de, a computer-science bibliography.

3

Page 18: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

1 Introduction

●●

●●●●

●●●

●●●●●●●●●●●●●

1980 1990 2000 2010

050

010

0015

0020

0025

00

Publications over time ("clustering")

Year

Pub

licat

ion

Cou

nt (

DB

LP)

●●

●●

●●●

●●●

●●

●●

●●●

●●●●●●●●●

1970 1980 1990 2000 20100

100

200

300

Publications over time ("anomaly")

Year

Pub

licat

ion

Cou

nt (

DB

LP)

2000 2005 2010 2015

1020

3040

5060

Publications over time ("frequent itemset")

Year

Pub

licat

ion

Cou

nt (

DB

LP)

Figure 1.2: Counts of publications over time that include “clustering”, “anomaly” or“frequent itemset” in the title respectively (source: DBLP).

solution. Other questions soon become evident: What about the notion of “noise” and“outliers”? Must the partitioning be strict? How do we select the number of clusters?

To arrive at a well-defined problem for which an algorithm can be designed anddeployed in practical situations, we must make assumptions that help us to answersuch questions. The validity of those assumptions in turn depends on the nature ofthe application in question. Different arguments can be made for favoring differentassumptions, thus increasing the possibilities for developing more specialized clusteringtechniques.

Other data mining problems are analogously difficult (anomaly detection, association-rule mining, graph mining). This leads us to perhaps the top challenge in modern datamining, namely the development of a unified theory of data mining. In the followingsubsection we discuss this challenge in more detail. In the subsequent subsections welist a further three “top” challenges on which we focus. Taken together, these fourchallenges correspond to the top four open challenges for data mining as enumeratedby Yang and Wu [YW06].

1.3.1 Challenge 1: Developing a Unified Theory of Data Mining

This challenge was identified as “numero uno” in Yang and Wu’s highly-cited work ondata-mining challenges [YW06]. Specifically, they note that the current state of data-mining research is often criticized as being too “ad-hoc”. Indeed, numerous techniqueshave been designed for individual problems. A theoretical framework that unifies thevarious data-mining tasks could therefore be considered as a “holy grail” of research in

4

Page 19: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

1.3 Current Challenges in Data Mining

this field.Needless to say, this is an ambitious goal that may prove very difficult, if not impossi-

ble, to attain in practice. Regardless, we can remain optimistic and be inspired by thevarious elements of what such a direction would entail. For example, we can take a stepcloser to this goal if we keep in mind the motto: induce, deduce and reduce. This is themotto particularly embraced in Papers A, B and C of this thesis. Specifically, we should:

• Induce general frameworks for data-mining problems based on similaritiesfound in specialized techniques. Ideally, all possible instances of the special-ized problems would be provably reducible to an instance of the framework problem.If the algorithms for the new framework additionally outperformed their specializedcounterparts with respect to both efficiency and effectiveness, then researchers andpractitioners alike would welcome the retirement of a further set of redundanttools from the overwhelming population of data-mining algorithms. Additionally,researchers would be inspired to deduce additional useful applications from theframework. Problems from these applications could then be solved by the samealgorithm.

• Reduce the number of assumptions made by state-of-the-art data-mining ap-proaches. It would be naïve to think that we can design a useful data-miningalgorithm that is void of assumptions. However, we should take the initiative toreview the need for some of the most fundamental assumptions made in datamining, and challenge ourselves to relax them when appropriate. For example, fora given relational data set, many matrix-factorization techniques make the implicitassumption that all features are measured over the same scale (typically the ratioscale). In the area of non-hierarchical, vector-based cluster analysis, all techniquesknown to this author assume a particular multivariate “distance” or “similarity”measure on the space (e.g. Euclidean in k-means, or the Gaussian kernel in spectralclustering). We should exercise our scientific curiosity by questioning these basicassumptions.

• Reduce the need for obscure parameters and excessive “tuning”. Algorithmparameters can be a gift and a curse. When a parameter relates to a concept thata practitioner can readily comprehend (e.g. the number of desired clusters k), itsvariation can yield a set of potentially-useful results on the same data. At the otherextreme, it can be a daunting task to set a parameter that is only used internally,has no obvious relationship to the result and is infinitely variable (we will see thatthe τ parameter required by the Asso algorithm for Boolean Matrix Factorizationis such a parameter). We should take care to design algorithms that reduce therequirement for parameters in this latter category.

5

Page 20: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

1 Introduction

1.3.2 Challenge 2: Scaling Up for High-Dimensional Data and High-SpeedData Streams

Challenge number two on Yang and Wu’s list is first and foremost related to the curseof dimensionality [ZSK12; IM98; BC57; KKZ09]. We understand this term as referringto a number of difficulties encountered when analyzing data in high-dimensional spaces.One difficulty could be termed the “distance concentration effect”, and refers to thefact that, as the volume of the space increases with increasing dimensionality (makingthe available data more sparse), there tends to be little difference in the “similarity”between different pairs of objects [KKZ09]. Coupled with the fact that the presenceof irrelevant features may conceal relevant information, the effectiveness of many data-mining algorithms fails to scale to high-dimensional data.

In addition to appropriately filtering attributes (e.g. those with zero entropy) inthe preparation phase of the KDD workflow, data-mining algorithms should considermechanisms through which the curse of dimensionality can be tamed. In clustering, forexample, it is seldom useful to apply a full-space clustering method to a data set witha moderate-to-large number of dimensions (e.g. 10 or above) [KKZ09]. An algorithmdesigned to search for the most useful parts of the data, generally in the form of alow-dimensional subspace, can yield better results. Paper D makes contributions in thisdirection.

Yang and Wu also use the words “high-speed” in their naming of this challenge,which implies an additional need to focus on efficiency. It is an unfortunate truth thatconcrete formulations of many data mining problems imply that finding their optimalsolution is not computationally tractable. Even k-means, which takes a rather simplifiedview of the abstract clustering problem, is provably NP-Hard [SI84]. Assuming thatcomputer science will not be blessed any time soon with a favorable result akin toP = NP, we must resort to heuristics in order to enable the analysis of any data havingnon-trivial size.

Of course, the use of a heuristic alone does not automatically classify an algorithmas “scalable”. Many heuristic-based data mining techniques have super-linear timecomplexity in the size of the data set. For example, we will see that a number ofnon-parametric anomaly- and change-point-detection algorithms have quadratic timecomplexity in the number n of data objects. For high-speed data streams like Twitter,where over 6000 new objects (“Tweets”) arrive every second (see Paper E), super-lineartime complexity may well be prohibitive.

To help address Challenge 2, we will subscribe to some further “guiding principles”when designing our data-mining algorithms:

• Consider the curse of dimensionality, and work to include mechanisms for miti-gating its effects.

6

Page 21: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

1.3 Current Challenges in Data Mining

• Strive for algorithms that have a practically linear run-time complexity in the sizeof the data.

• Use algorithmic paradigms that lend themselves well to parallelization on high-performance computing infrastructure.

1.3.3 Challenge 3: Mining Time-Series Data

Temporal data with trends, seasonality and noise are commonplace in modern infor-mation systems [LAF15]. Particularly in the domains of intrusion detection, credit-cardfraud, medical diagnoses and law enforcement, there is often a need to raise a “redflag” when the real-time data deviates from the expected distribution or patterns. Thedetection of the related change-points, anomalies and “events” in time-series data is animportant element for modern information systems with large volumes of traffic andstrict “uptime” requirements. At Yahoo!, for example, it is critical for the integrity ofthe business to perform real-time monitoring of millions of production-system metrics[LAF15].

Along with Challenge 2, Challenge 3 on Yang and Wu’s list relates to the increasingrequirement for monitoring live data for patterns in a real-time manner. For applicationslike Yahoo!, it is not sufficient to periodically schedule offline processing alone.

Interestingly, many time-series mining techniques consider measures relating to thedistribution of the absolute values of the metrics in question. In tune with our reducemotto from Section 1.3.1, it can sometimes be useful to question basic assumptions likethese. More precisely, we ask: Is there a lens through which we can view time-series datathat focuses specifically on what is “natural” and “unnatural”? This curious questionwill be investigated in Paper E.

1.3.4 Challenge 4: Mining Complex Knowledge from Complex Data

Yu and Wang’s fourth problem relates to handling “complex” data. They note thatgraphs are one form of complex data that has become especially prevalent in socialnetworks, and that more research needs to be performed in this direction. Paper Cconsiders a certain kind of graph structure in this light.

Another form of complexity is found in relational data that has heterogeneous featuresmeasured over fundamentally different scales. This heterogeneity complicates mattersbecause it rules out the application of techniques that assume completely real-valuedmeasurements, or completely categorical measurements, for example. More researchneeds to be made into methods that can mine patterns from data sets containingmeasurements from a variety of the different fundamental scales (nominal, ordinal,interval, ratio). This is the topic of Papers A, B and C.

7

Page 22: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

1 Introduction

A further form of complexity relates to the amount by which the signal in the data ishidden by noise. The real world is noisy to varying extents. Noise is usually unstructuredand contributes little to understanding the patterns or associations between the variablesand phenomena in a domain. In the context of noise, data-mining techniques shouldremain robust. For example, a clustering technique that assigns cluster membership tolarge volumes of clutter or noise points is not particularly useful. Complexity in theform of clutter and noise is thus a key consideration in our work (particularly Paper D).

Finally, Yang and Wu note that we must make sure that we pay attention to the“interestingness” and interpretability of the patterns that we mine. In line with thedefinition of descriptive data mining (Section 1.1), a user must be able comprehend themeaning of the discovered patterns in the context of their domain. This includes makingsure that the set of patterns presented to the end-user has a digestible cardinality (e.g.“top 10”), and that each pattern contains information that can be directly translated toreal-world domain concepts. If a data-mining algorithm learns a representation of asystem, but that representation is in turn difficult to comprehend, then the algorithm’susefulness is limited. Delivering interpretable patterns is thus a key consideration forour work (particularly in Papers A, B and C).

1.4 Goals of this Thesis

To summarize the challenges from the previous section, this thesis focuses on makingdata-mining contributions that address a number of aspects of complexity. Each helpsto advance the state-of-the-art in data mining. Specifically, we aim to develop problems,frameworks, algorithms and statistical tests that

1. can subsume a number of existing approaches without the requirement for addi-tional obscure parameters (Challenge 1),

2. can support input data with high dimensionality whilst scaling linearly in time(Challenge 2),

3. can be deployed in real-time for high-bandwidth time-series data (Challenge 3),

4. can yield interpretable results on heterogeneous data sets containing measurementsfrom a number of scales and high levels of noise (Challenge 4), and

5. are demonstrably superior to the state-of-the-art in controlled and real-worldsettings.

8

Page 23: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

1.5 Remarks on the Document Structure

1.5 Remarks on the Document Structure

This is a publication-based dissertation. Each individual publication is embedded as anappendix, accompanied by a short introduction concerning the topic, publication outlet,acceptance status, re-use license and author contributions.

The remainder of this dissertation is organized as follows. Chapter 2 reviews a numberof the basic technical results and problems on which this work builds. This includes abrief review of the fundamental scales of measurement, a review of the basic algorithmsfor Blind-Source Separation and a review of the main paradigms for cluster analysis.Chapter 3 presents a review of the literature that is directly relevant to the techniquesproposed in this thesis, including Boolean and Ordinal Matrix Factorization, robustclustering algorithms for high-clutter data contexts, and anomaly- and change-point-detection algorithms. Chapter 4 discusses our research approach. Chapter 5 states anddiscusses the results of our work and highlights their impact on research and industry.Limitations of our work, as well as directions for future research, are likewise covered inChapter 5. Chapter 6 gives concluding remarks.

9

Page 24: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,
Page 25: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2 Preliminaries

2.1 The Fundamental Scales of Measurement

In this thesis, a number of contributions relate to knowledge-discovery from data that hasbeen collected over heterogeneous scales of measurement. For this reason, and despitethe risk of triviality, a short review of the fundamentals of these scales is warranted.

S. S. Stevens notes in his seminal 1946 article that “measurement exists in a varietyof forms” and that “scales of measurement fall into certain definite classes” [Ste46, p.677]. His proposed typology has since become the most widely-adopted classificationfor levels of measurement in the natural sciences. Importantly, the different scalesexhibit different properties and permit different mathematical operations. The followingsubsections briefly review the necessary preliminaries for each type of measurementscale.

Ratio Scales

Physical quantities are often measured on a ratio scale. For example, measurements ofelectric charge, rates of flow, distance, temperature (in Kelvin) and mass are all done onthe ratio scale. What do they all have in common?

Firstly, each scale has a meaningful zero point. A mass of zero corresponds to theabsence of matter. A distance of zero corresponds to the separation of a location inspace from itself. A temperature of zero (Kelvin) corresponds to the minimum possiblethermal motion of atoms and molecules, and so on. This meaningful zero point isvaluable because it allows us to make sensible statements about the ratio between twomeasurements on that scale. For example, it is reasonable to state that “20K is twice ashot as 10K”, or “200m is twice as long as 100m”.

We can also reasonably make statements about the “difference” between two measure-ments made on a ratio scale. “The difference between a mass of 10g and a mass of 6g is4g” is a sensible statement, for example.

Said differently, the ratio scale is the scale with the greatest amount of “metadata”. Ofthe four fundamental scales of measurement, it permits the greatest number of math-ematical operations that can be sensibly applied. These operations include numericaladdition and multiplication.

11

Page 26: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2 Preliminaries

Interval Scales

Many scales used in daily life are measured on an interval scale. We find two examplesin temperature (Celsius) and time (measured after the AD 0 epoch). Compared tothe examples from the previous section (ratio scale), we recognize that the zero pointon these scales is arbitrary. That is, it is arbitrary to select water’s freezing point atatmospheric pressure as the zero point for temperature, and arbitrary to select thenativity date as the zero point for time (indeed, many civilizations have done otherwise).

For this reason, it makes little sense to talk about ratios between two interval-scalemeasurements. It is not useful to assert that “40 degrees Celsius is twice as hot as 20degrees Celcius”, nor that “the year 2016 is half as old as the year 4032”. We shouldtherefore refrain from using numerical multiplication on such measurements.

We can, however, still talk about degrees of difference between values measured oninterval scales. It is sensible to say that the difference in temperature between 40 and 30degrees Celcius is the same as the difference in temperature between 20 and 10 degreesCelcius.

Ordinal Scales

Yet more restrictive is an ordinal scale, perhaps most often used for measurements madein survey research. In this thesis, for example, we work in part with data from surveyresearch questionnaires that are designed for measuring opinions (Paper C). Items insuch surveys involve a statement and a dichotomy (e.g. “Disagree” or “Agree”), andthe participant is tasked to select a value1 from a scale imposed over that dichotomy(commonly called Likert items). A typical five-level Likert item might be 1) Stronglydisagree, 2) Disagree, 3) Neither agree nor disagree, 4) Agree, and 5) Strongly agree.

In contrast to interval scales, we cannot sensibly talk about precise degrees of differ-ence on an ordinal scale. That is, it makes little sense to propose that the differencebetween “Strongly agree” and “Agree” is exactly the same as the difference between“Agree” and “Neither agree nor disagree”. Therefore, even if these scales have numer-ical labels, we should refrain from using numerical operations such as addition andmultiplication. Ordinal scales do, however, impose a rank ordering of their elements.

Nominal (Categorical) Scales

The label for this scale derives from the Latin root nom, meaning “name”. A scale ishence termed “nominal” if it involves differentiating between items based only on a

1For simplicity here we ignore the case where the participant is permitted to leave the answer blank, orselect the “Don’t know” option.

12

Page 27: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2.2 Blind Source Separation, Latent Patterns, and Finite Mixtures

system of qualitative classification or categorization. Measuring the religion of a humanbeing is an example of a measurement done over a nominal scale. Such a scale wouldlikely include the set of labels {Christianity, Islam, Judaism}.

It makes little sense to “mix” or “add” two religions, compute a quantitative “dif-ference” between them, or impose an objective “ordering”. For these reasons, neitheraddition nor multiplication may be performed on nominal measurements. Assigninga numerical value to each nominal category for purposes of identification should bedone with caution; it can lead to a false belief that the measurements may be open tothe same interpretation as given to one of the more powerful scales from the previoussections. In general, the only permissible statement regarding the relationship betweentwo measurements made on a nominal scale is that of equality.

A curious observation, and one that is particularly relevant for this thesis, is thatthe values from two- and three-valued logic (Boolean and Ternary logic) are, by thisdefinition, measured over a nominal scale (or a trivial dichotomous ordinal scale). TheBoolean values of false and true are often given alternative labels like no and yes, failureand success, and so on. Perhaps the most common labels, however, are 0 and 1. Thispreference is most likely due to the benefits of compactness (a single character in eachcase), however we will see in this thesis that this representation can, in the context ofdata-mining, have disadvantages (Section 3.1).

2.2 Blind Source Separation, Latent Patterns, and FiniteMixtures

Blind Source Separation (BSS) is the abstract task of extracting a set of source signalsfrom a set of mixed observations [YHX14]. With this definition, a number of concretetechniques fall under the BSS banner. We explicitly note that we do not use the BSSterm as a synonym for Independent Component Analysis (as is done in some technicalcommunities). For our purposes, “sources” can be understood as the “patterns” wewish to find, and the common property held by each BSS technique is that observationsare formed through the mixture of sources (patterns). In the definition of BSS as weconsider it, no information is given about the form of the source signals or the mixingmechanism, so the problem is highly underdetermined.

Papers A, B and C of this thesis focus on the problem of BSS for non-ratio-scale data,so it is useful to briefly review the fundamental concepts and commonly-used BSStechniques for data that has been measured on the ratio-scale.

13

Page 28: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2 Preliminaries

2.2.1 Independent Component Analysis

Various techniques exist to solve special cases of BSS. Each makes different assumptionsregarding the nature of the source signals. The “Cocktail Party effect” is a frequently-used example for motivating the study of BSS in signal processing [Bro00]. It refers tothe well-known phenomenon of being able to target one’s attention to a single auditorystimulus at a party (e.g. the monologue of the partner), despite the presence of numerousother significant auditory stimuli (music, speeches, other conversations). The sourcesat the party are the various stimuli in the room, including musical instruments andhuman speakers. A given observer (e.g. a microphone, or a human ear) measuressound resulting from the weighted numerical sum (mixture, or “superposition”) of thesesignals, where the weighting is influenced by factors such as each source’s intensityand the distances of the sources from the observer. Importantly, computing a weightedsum implies “weighting” and “summation”, which can only be done sensibly if thecorresponding operators (“multiplication” and “addition”) exist for the associated levelof measurement. In the Cocktail Party example, sound intensity is a physical quantitymeasured on the ratio scale, so a finite linear mixture of audio signals is a sensibleconcept.

With the Cocktail Party application in mind, Figure 2.1 shows a simple applicationof Independent Component Analysis (ICA), which assumes that the source signals arestatistically independent from each other and non-Gaussian. The concrete mechanismby which statistical independence is measured can vary, which leads to different formsof ICA. Often, an information-theoretical approach is taken which seeks to minimizethe mutual information between the sources. Other approaches are also possible, suchas maximum likelihood estimation, maximization of non-Gaussianity, and tensorialmethods [HKO04].

ICA has numerous other applications. The human brain often emits signals fromdifferent regions, which are typically understood to have been “mixed” in a linear waywhen measuring brain activity using sensors attached to the outside of the head. It isoften medically sensible to assume that the sources behave independently, so an ICAcan help to uncover the signals emitted by the different brain regions. In econometrics,performing an ICA on parallel time series can help to decompose them into independentcomponents in order to gain an insight into the driving mechanisms of the data. For thetask of feature-extraction in image processing, ICA can help to find features that are asindependent as possible [HKO04].

14

Page 29: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2.2 Blind Source Separation, Latent Patterns, and Finite Mixtures

0 200 400 600 800 1000

−1.

0−

0.5

0.0

0.5

1.0

Original Signals

0 200 400 600 800 1000

−1.

0−

0.5

0.0

0.5

1.0

0 200 400 600 800 1000−1.

0−

0.5

0.0

0.5

Mixed Signals

0 200 400 600 800 1000

−1.

0−

0.5

0.0

0.5

1.0

0 200 400 600 800 1000

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

ICA Source Estimates

0 200 400 600 800 1000−1.

5−

1.0

−0.

50.

00.

51.

01.

5

Figure 2.1: An example of Blind-Source Separation using Independent ComponentAnalysis (using the R package fastICA [MHR10]). Note the ambiguities: ICAis generally not able to reconstruct the amplitude and the sign of the originalsignals.

2.2.2 Principal Component Analysis (via the Singular Value Decomposition)

In Figure 2.2 we see an example of Principal Component Analysis (PCA), which canbe understood as another kind of BSS technique (based on our definition). In thiscase (spatial data) a PCA helps us to identify the directions in the data that are mostresponsible for the variance. Usually, taking just a small subset of these directionsgives us the “primary sources” for explaining much of the variability in the data.For visualization tasks, projecting the original data onto these directions can give adigestible (in terms of being low-dimensional and comprehensible by a human) andhighly-informative (in terms of being able to visualize the “spread” or variance) view.

15

Page 30: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2 Preliminaries

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

Figure 2.2: PCA finds orthogonal directions in the data, ordered such that each succes-sive direction explains the maximum possible remaining variance.

2.2.3 Non-negative Matrix Factorization

The title of the seminal article on Non-negative Matrix Factorization (NMF) is “Learningthe parts of objects by non-negative matrix factorization” [LS99]. Indeed, NMF is atechnique that aims to find a parts-based, not holistic-based, representation of objects ina data set. Given a data matrix D ∈ Rn×m where each of the n rows represents an objectwith m features, the form of the decomposition is:

D ≈ H ·W. (2.1)

The matrix W ∈ Rk×m is often named the basis matrix because it contains the kmost fundamental “parts”. The parts are mixed together in various ways to form eachobserved object in D. The recipe according to which the parts are combined is prescribedby the corresponding row in the “encoding” or “usage” matrix H ∈ Rn×k. The mixingmechanism is linear and respects the classical matrix product:

dij ≈ ∑a=1...k

hiawaj. (2.2)

Performing Blind-Source Separation on faces data is often used to intuitively explainthe difference between PCA and NMF. Consider the top 25 basis vectors found by twodecompositions of the CBCL faces data [HPP00] in Figure 2.3. The basis vectors on the

16

Page 31: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2.3 Cluster Analysis

Figure 2.3: Top 25 basis vectors found by performing PCA (left) and NMF (right) on theCBCL faces data [HPP00].

left are found by PCA; those on the right by NMF. This result is a neat visual aid forcomprehending the objective of NMF. That is, the NMF basis vectors are more in linewith a parts-based representation of faces. For example, we can see NMF basis vectorsthat focus only on the eyes (row 1, column 4), the nose (row 2, column 1), the chin (row1, column 3) and the cheeks (row 2, column 2). The PCA “eigenfaces” are more holisticin comparison.

Why do approaches like Principal Component Analysis not enable such a parts-basedrepresentation? The answer lies in the constraints that NMF enforces on the matrices Hand W. As the name suggests, neither H nor W is allowed to contain negative entries.This implies that only additive sources and mixtures are allowed. Representations learnedby approaches like PCA generally involve cancellations between positive and negativenumbers. This complex mechanism lacks an intuitive meaning in such a “faces” example.By forbidding subtractions, NMF is compatible with the intuitive notion of “combiningparts to form a whole”.

2.3 Cluster Analysis

Cluster analysis involves the abstract task of partitioning a set of objects into groups, or“clusters”, such that objects in any given group are more “similar” to one another thanthey are to objects in other groups.

As discussed in Section 1.3, this definition of “clustering” is a high-level one that

17

Page 32: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2 Preliminaries

requires concrete elaboration. No “silver bullet” technique exists, because the interpreta-tion of the problem typically depends on the application in question. In this thesis, werestrict our focus to non-hierarchical clustering of vector data. Figure 2.4 illustrates fourdifferent sets of 2D vector-data in which the concept of a “grouping” can differ.

●●

●●

●●

● ●●

●●

●●

● ●

●●●

●●

●●

●●●●

●●●

●●

●●

●● ●

●●●

●●

●●

●●●●

●●

●●

● ●●

●●

●●

●●●●

● ●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●● ●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ● ●

●●

●●●●●

●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●● ●

●●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●● ●●

●●●

●●

●●

●●

●●●

●● ●●

●●

●●

●●●

●●●●●● ●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

● ●●

● ●●

● ●●

●●

●●●

● ●●●

●●

●●

●●

●●

●●

●●

● ●●●●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●● ●●

●●●

●●● ●●●●●● ●

●●

●●●●

● ●

●● ●

● ●●●

●●

●●●

●●

●●

●●

●●●

●●● ●

●●

●● ●

●●●●

●●●

●●●

● ●

●●

●●

●●

● ●

●● ●●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●●

●●

●●

●●

●● ●

●●

●●●●

●●

●●

●●

● ●

● ●●

●●

●●

●●●●

●●

● ●●

●●

●●

●●

●●●●

●●●●●

● ●●

●●●

●●●

●●●

●●

●●●

● ●

● ●●●

●●

● ●●

●●

● ●●

●●

●●

●● ●

●●

●●

●●

●●●

● ●

●●●

● ●●

● ●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

● ●

●● ●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●● ●

●●

●●

● ●

● ●

●●●● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●● ●

●●

●● ●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●●

●●

● ●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

● ●●

● ●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

● ●

●●●●● ●●●

●●

●●

●●

●● ●

●●●

●●

●●●● ●●●●●●

●●

●●●●

●● ●●

●●

●●●

●●

●●●●

● ●●

●●●●●●●

●●●

●●

●●●●●

●●

●●●

●●●

●●

● ●

●●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●● ●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●●

●●●

●●

●●

●●●●●●

●●●

●●

●●

●●●

● ●●

●●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●●●

●●●●●

●●

●●●●●

●●

●●●●

●●●

●●●●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●●

●●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●●●●

●●●●●●●

●●●●●

●●●

●●

●●

●●

●●

●●

● ●●

●●●

●●

● ●

●●

●●●

●●●● ●

●●

●●

●●

● ●●●

●●

●●

●●

●● ●

●●●

●●●

● ●

●●●●

●●

● ●●

●●●

● ●

●●

●●

●●●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

● ●

●●

●●●

●●●●

●●

●●

●●●

●●●

●●

●●●●●

●●●

●●

●●●●●●

●●●

●●

●●●●

●●

●●●

●● ●

●●

●●●

●●●

●●●

●●

● ●

●●

●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

● ●

●●●

●●●●●●

●●

●●

●●●●

●●●

●●●●

●●

●●●●

● ●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●●●

● ●

●● ●

● ●●

●● ●

●●●

●●●●

●●

●●●

●●

●●

●●

● ●●●●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

Figure 2.4: Four different two-dimensional spatial data sets.

2.3.1 Partition-Based Clustering

On the left of Figure 2.4 we find perhaps the simplest of the four data sets. This dataset was generated by sampling over four bivariate Gaussian distributions (each with adifferent mean and standard deviation). Although the groups have different sizes, theyhave comparable spatial extent and are well separated (in the Euclidean sense of theword).

It is clear to the naked eye that it would be useful in this case to group (or “partition”)objects based on their absolute position in space. That is, we could assign a “prototype”vector for each group, and assign each object’s cluster membership based on its closestprototype (using the Euclidean distance). This is one partitioning approach, also knownas vector quantization. The prototype vectors might also be called “centroids”, hence afurther name for this kind of clustering paradigm. In Figure 2.5 we see an example ofthe popular k-means partition-based clustering algorithm [Mac+67] applied to the dataon the left of Figure 2.4.

Vanilla k-means requires that we specify the number k of prototypes (clusters). Ingeneral this may not be known a priori. Importantly for the work in this thesis, centroid-based partitioning techniques like k-means are typically also deficient of a noise concept(Challenge 4). That is, centroid-based partitioning techniques often assume that eachobject in the data set has a sensible membership in exactly one cluster.

The second data set in Figure 2.4 is not the best fit to the centroid-based partitioningmodel because the spatial extents of the two clusters vary considerably. We observe in

18

Page 33: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2.3 Cluster Analysis

●●

●●

●●

●●

● ●●

●●

●●

● ●●

●●●

●●●

●●●●

● ●●

●●

●●

●● ●

●●●

●●

●●

●●●

●●●●

●●

●●

● ●●●

●●●

●●●●●●●

●● ●●

●●●

●●

●●

●● ●

●●

● ●●

● ●

●●

●●●

●●

●●●

●●

●●

●●

●● ●

●●●● ●

●●●

●● ●

●●

●●

●●●●●

●●●●

●●

●●●

●●●

●● ●● ● ●

●●

●●●●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●

● ●

●●●●

● ●●

●● ●

●●●●●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●●

●●●●

●●

●●

●●●

●●● ●

●●●

●●

●●●

●●●●●● ●

●●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●●●

●● ●●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●

●● ●

● ●●

●●

● ●●

● ●●

●●

●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●●●●●

● ●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●●●

●●

●●

●●

●●

●● ●●●

● ● ●●● ●

●●

●●

●●●

●●●●

●●

●●

● ●●●

●●

● ●●

●●●

●● ●

●●

●●●

●●●

●●●

●●● ●●●● ●●

● ●● ●●● ●●● ●

●●

●●

●●

●●●●

● ●

●● ●

●● ●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●● ●

●●

●● ●

●● ●●

●●

●●●

●●●●

● ●●

●●●●

●●

●●● ●●

●● ●●● ●

●●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●

● ●

●●●

●●

●●

●●●●●

● ●

●●

●●●● ●

●●

●●● ●●●

●●

●●

● ●●

●●

●● ● ●

●●

●●

●●●●●●

●●

● ●●

●●●

●●

●●

● ●●●●●●●●

●●● ●

●●●

●●●

●●●

●●●

●●●

● ●●

● ●●●

●●

● ●●

●●●

● ●●

●●

●●●● ●

●●

●●

● ●

●●

●●● ● ●

●●●

●● ● ●●

● ●

●●

●●

●●

●● ●●

● ●

●●●

●● ●●

●●

●●

●●

●●●

● ●●

●●●●

●●

●●

●●

● ●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●● ●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●●

●●

● ●

●● ●

●●

● ●

●●● ●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

● ●●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●●

● ●●●

● ●●

●● ●

●●

●●

● ●

●●

●● ●●

●● ●● ●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

● ●●●●●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●● ●●

● ●

●●●

●●

●●

●● ●

●●

●●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●●●

●●

●●●

● ●

●●

● ●

● ●

●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●● ●

● ●●

● ●●

●●

●●

●●

●●

● ●● ●

●●

●●

●●

●●

●●

●●

●●

●●

Figure 2.5: An example k-means result. The locations of the cluster prototypes (alsoknown as “centers” or “centroids”) are given by the yellow + markers.

Figure 2.6 that this complicates matters somewhat for k-means, so we look for a differentapproach. One solution is to model each of the k clusters using a statistical distribution.One might call such an approach a “distribution-based” clustering paradigm. For ourdata we observe that each cluster can be well approximated by a Gaussian distribution(with different parameters). In particular, the top cluster could be described with ascalar variance in each axis direction, and the stretched elliptical cluster on the bottomwith an appropriate 2× 2 covariance matrix.

Each data object is then assigned a vector of length k, the elements of which representthe membership probabilities for each of the k cluster distributions. This implies asoft clustering (no concrete membership assignments), although a hard assignment cantrivially be achieved by assigning each object to its most likely cluster.

This approach is clearly parametric. The user is required to specify the number ofclusters k and the type of distribution in advance. To practically solve the problem,the popular Expectation-Maximization (EM) [DLR77] algorithm is often deployed. Thedistribution form of the clusters is typically selected as Gaussian. EM iterates to achievethe (local) maximum likelihood parameters of these distributions.

Although classical distribution-based clustering techniques have no explicit notion ofnoise, one might interpret an object with consistently “low” membership probabilitiesas a noise object.

19

Page 34: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2 Preliminaries

●●

●●●

●●●

●●●

●●

● ●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ●●

●● ●

●●

●●

●●

●● ●

●●●

●●

●●●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●●

●● ●

●●

●●

● ●

●●

●●

●● ●

●●

●●●

●●

● ●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

K−Means

●●

●●●

●●●

●●●

●●

● ●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ●●

●● ●

●●

●●

●●

●● ●

●●●

●●

●●●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●●

●● ●

●●

●●

● ●

●●

●●

●● ●

●●

●●●

●●

● ●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

EM Clustering

Figure 2.6: k-Means (left) and hard-EM (right) clustering algorithms applied to a com-mon data set. EM models each cluster using a bivariate Gaussian withdifferent distribution parameters.

2.3.2 Density- and Spectral-Based Clustering

A rather different clustering paradigm is to consider the localized object density ofpoints, recognizing that such a measure is considerably higher inside a cluster than it isoutside a cluster. The most highly-cited density-based clustering method is DBSCAN[Est+96]. It formalizes this notion using localized concepts like the number of pointswithin a given point’s spatial “neighborhood”, and whether or not two points can beconsidered to be “reachable” and “connected” to each other through other local points.

Density-based approaches have a number of advantages. Firstly, operating at a locallevel, they do not require the user to specify the number of clusters to find. Secondly,they are not restricted to convex-shaped clusters: the notions of “reachability” and“connectivity” enable clusters to grow “naturally” in the direction of all points thatsatisfy the propagation conditions. Thirdly, they have a clear concept of noise: any pointthat does not meet the conditions for inclusion in a cluster is labeled as noise. In Figure2.7 we see an example DBSCAN result on a noisy, synthetic data set with non-convexcluster shapes.

Despite its numerous advantages, DBSCAN its no silver bullet. DBSCAN is not alinear-time algorithm (Challenge 2), still requires some measure of distance on the space(the Euclidean distance is used in [Est+96]), needs the user to specify thresholds for

20

Page 35: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2.3 Cluster Analysis

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

● ●●●●●

●●

●● ●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●●

●●

●●● ●

● ●

● ●●

●●

●●●

●● ●●

●●●●●●

●●●●

● ●

●●● ●●

● ●●

●●●

●●● ●

●●●

●●

●●●●●●

● ●●

●●

●●●

●● ●

●●

●●

●●●●

●●

●●● ●● ●●

●●

●●

●●●

●● ●●

●● ●

●●

●●●

●●●

●● ●

●●

●●●● ●

●●

● ●

●● ●

●●●

● ●●●

●● ●

●●●●

●●●

● ●●● ●

●●

●● ●●●●

●● ●●

●●

●●

●●●● ●

●● ●●

●●●● ●

● ●●

●● ●

●●●

● ●

●●●

●● ●●● ●● ●

●● ●●

●●●●

●● ●●

●● ●●

●●

● ●●● ●●

●●●●

●● ●●

●●●●

● ●●● ●

●●

●●●

●●

●● ●● ●●

●● ●

●●●

●●●● ●●●●

●●

●●

● ●●

●● ●●●

●●

●●

●●●● ●

● ● ●

●●●

●●●

●●●●●

●●

●● ●

●●● ●

● ●●● ●●

● ●●●

●●

●●

●●

● ●●

●●

●● ●● ●

●●●

●●● ●

●●●

●●●● ●● ●● ●●

●● ●● ●

●●●●● ●● ●

●●●

●●

●●

●●●

●●●

●●

●● ●●

●● ●● ●●● ●

●●

● ●●●

●●●

●●

●●●

●●

● ●

●●

●● ●● ●●

●●● ●

●● ● ●●●

● ●● ● ●●

● ●

●● ●

●● ●●●

●●

●●

●● ●

● ●●

● ●● ●●

●● ●

●●

●●

● ● ●●●

● ●

●● ●●● ●

●●

●●●

● ●● ●

● ●●

●●●

● ●●●● ●

● ● ●●● ●

●●

●●

●● ●

● ●

●●

●●●●

●●●

●● ● ●●

●●●

●●

● ●

● ●

●● ●● ●

● ●●

●●

● ●●

●●

●● ●

●● ●●●

● ●●●● ●

● ●●●

●●●

● ● ●● ●

●●

● ●

●●●

●●●

●●

●●●●

●●●

● ● ● ●● ●●●

●●

●● ● ●● ●

●●●●

●●

●●

●●●

●●

●● ●●●

● ●●

●●

● ●

● ●●

●●

●● ● ●● ●

●●

●●

●●●

●●

●●

●●●●

●●● ●●●

●● ●

●●

●●

● ●●

●●●●● ●●

●●

● ●●●

●●

●●

●●

● ●

●●

● ●

●●●

●●●

●● ●●●●●

● ●●●

●●●

●●

● ●●●● ●●●● ●

●●

●●●●

●●

●● ●● ●

●●

●●

●●●

●●●●

● ●●

●● ● ●● ●●

● ●● ●●

●●

●●

●●

●●

●●●

●●● ●

●●●●●● ●●

● ●

●●

●●●

●●● ●●

● ●●

●●

●● ●

●●

●●● ●●

●●● ●●

●● ●●●

●●

● ●

● ●●●

● ●● ●●

●●

●●

●●●●● ●●●●●

●●

● ● ●●● ●●

● ●● ●

●●●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●●●●

●● ●

●●●

●●

●●●

●●●

●●

●●●

●●

●● ●●●

●●● ●

● ●● ●

●●●●

● ●●

● ●

● ●●●

●●

● ●● ●●

●●

● ●●●

●●● ●● ●

●●●● ●

●●

●●●●●

●● ●

●● ●●

●●

●●●●

●●●

● ●●●● ●●

●●

●●● ●● ●●●

●● ●●

●●●●●

●● ●●●● ●●

●●●

● ●●●●

●●

●● ●● ●●

●●

● ●●● ●●●● ● ●

●●● ●●

●● ●●

● ●

●●● ●●

● ●●● ●

●●

●●●

●● ●

●●

●● ●●●● ●● ●●● ●●

●●● ●

●●●

●●

●●●

●●●

●●

● ●●

●●●●

●●●

● ●●

●●● ● ●●

●● ●●●

●●

●●●●● ● ●●●

●● ●●

● ●●● ●

● ●●

●●

●●

●●●

●●●●

●●

●● ●●

●●

●●● ●●

● ●●

●●●

● ●● ●

● ●● ●●●

●●●●●● ●

●●●●

●●●

●●

●●●

●●●

●●● ●●●

●●● ●

●● ●

●●● ●

●●

●● ●

●● ●

●●●

●●●

●●

●● ●

●●

●●

●● ●●

● ●●●●

●●

●●●

● ●●●● ●

●●●●

●●

● ● ●

●● ●●

●●

●● ●●

●●● ●●

●●●

●●

●●●

●● ●

●●●

●●●●

● ●●●●

●●

●●●●●●

● ●

●●

●●

●● ●

●● ●●●● ●● ●●

● ●

●●

●● ●●● ●●

● ●●

●●● ● ●● ●

●● ●

● ●●

●●

●●●

●●

●●●

●●●

●●

● ●

●●●●●

● ● ●●

●●

●●

●●●●●●

●●●

●●● ●

●●

● ●●

● ●●

● ●●●

● ●● ● ●

●●●● ●● ●

●●●

●●●

●●●●

● ●●●

●●●

●●●

● ●

●● ●● ●● ●●

●●

●●●●

●●

● ●●●

●●● ● ●●● ●

●●●

●●

●●

●●

●● ●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●● ●

● ● ●●●●●

●●●●●

●●

●●●● ●●

●● ●

●●

●● ●● ●●

●●●●

●●● ●

●● ●

●● ●

●●●● ● ●

●●●

● ●●

● ●●

●● ●

●●●

●●●

● ● ●

●● ●●●●

●●●

●●●

●●●●

●● ●● ●●●

●●

●● ●

●●

●●

● ●●●

●● ●

●●●● ●

●●

●● ● ●●

●●●

●●●

●●

●●● ●

●●

●●●

●●●

●●

●● ●

●●

●●

●●●●

●●

●●●

● ●

●●

●● ●

●●

●●●

●●

●●

●● ●

●● ● ●

●● ●●

●●

●●

●●

● ●●●● ●●● ●●

●●●●●

●● ●

● ●●

●●

●●

●●●● ●●

●●

●●

●● ●●

●●

●●●●

●●

●●

●●●

●●

● ●●●●● ●●●

●●

●●

●●● ● ●●●

●●● ●

●● ●●●

●●

●●●●

●● ●

● ●●●

●● ●

●●

●●●●

●●

●●●

● ●●●●● ●

●●●

● ●●

●●

● ● ●●●

●●●

● ●

●● ● ●●

●●

●●●● ●●●

●●

●●●

●●

●●

●●●

●●● ●● ●●

● ●● ●● ●● ●

●●

●●●

● ●● ●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●● ●●●

● ●

●●

● ●●

● ● ●● ●●●

●●●

●●● ●●

● ●●

●●●●

●●●

●●●

●● ●●● ●

●●

● ●●

●●

● ●●

●●

●●

●●●●●

● ●

●●

● ●

● ● ●●●●

●●

● ●●

●●

●●

● ●●

●●

●●

●●●● ●

●●

● ● ●●●●

●●

●●

●●

●●

●●● ●●●

●●● ●●● ●●●●

●●

●●

●●●●●● ●

●● ●●● ● ●●

●●●●

●●

● ●● ●●

●●● ● ●●●●●

●●

● ● ●●●

●●●

●● ●●●

● ● ●●●●

●●

●● ●

● ●●

●●

●●

●●●●

●●●

●●

● ●●

●●

●● ●●

●●●

●●●

●●

●●●

●●

●●

● ●●●●●

● ●●● ●●

●● ●●

●●

●●●

● ●●

● ●●● ● ●

●●

● ●●●●

●● ●●●● ●

●●

●●● ● ●

●● ●●●● ●●

●●●

●● ●

●●● ●●

●●●●

●●● ●●

●●● ●

● ●

●●

●●● ● ●● ●

●●

●●

●● ● ●●

●●

●●● ●

● ●●

●●

●●

●●

●●

●● ●

●●●● ●●● ●

●●

●●●

●●

●●

●●●

●●

●● ●

●●

●●

● ●

● ●●

●●

●● ●● ●●

●●

●●

●●

●●●

● ●●●●

● ●●●●

●● ●●●

●●● ●

●●

●● ●●●

●●

●●● ●

●●● ●

● ● ●

●●●

● ●●

●●

●●

●●● ●●

● ●●

● ●●

●●

●● ●

●●

●●● ● ●●●

●●

●●

● ●

●●

●●●

●● ●

● ●●●

●●

●●●● ● ●●

●● ●

●●●●

●●●●

● ●●●●

● ●●

● ●

●●●●●●

●●

●●

●● ●

●●

● ● ●

●●

●●●

●● ●

● ●●

●●●

●●

●●

●●

● ●

● ●●●●

● ●●

●●●

●●●

●●●●

●●●

●●●

●●

●●● ●●●

●●

●● ●

●● ●● ● ●

●●● ●●●●●●

● ●●●

● ●●●

●●

● ●●

●●●

● ●● ● ●●

●● ●

●●

●●

●●

●●●

●●●

●●

●● ●● ●●

●●●●

●●●●

●●●

●●●●

●●

●●●

●● ●●

● ● ●●●

●●●●

● ●●

●●●

●●● ●● ●

●●●

● ●●●●

● ●●

●●●●

●●

●● ●

● ●●●

●● ●●●

●●

●● ●

● ●●●

●● ●●●

●● ●●●

● ●●●●

●●

●● ● ●●●

●●●● ●● ●● ●

●●

●●

●●

●●●

●●

●●

●●

●● ●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

● ●

●●

Raw data

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

● ●●●●●

●●

●● ●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●●

●●

●●● ●

● ●

● ●●

●●

●●●

●● ●●

●●●●●●

●●●●

● ●

●●● ●●

● ●●

●●●

●●● ●

●●●

●●

●●●●●●

● ●●

●●

●●●

●● ●

●●

●●

●●●●

●●

●●● ●● ●●

●●

●●

●●●

●● ●●

●● ●

●●

●●●

●●●

●● ●

●●

●●●● ●

●●

● ●

●● ●

●●●

● ●●●

●● ●

●●●●

●●●

● ●●● ●

●●

●● ●●●●

●● ●●

●●

●●

●●●● ●

●● ●●

●●●● ●

● ●●

●● ●

●●●

● ●

●●●

●● ●●● ●● ●

●● ●●

●●●●

●● ●●

●● ●●

●●

● ●●● ●●

●●●●

●● ●●

●●●●

● ●●● ●

●●

●●●

●●

●● ●● ●●

●● ●

●●●

●●●● ●●●●

●●

●●

● ●●

●● ●●●

●●

●●

●●●● ●

● ● ●

●●●

●●●

●●●●●

●●

●● ●

●●● ●

● ●●● ●●

● ●●●

●●

●●

●●

● ●●

●●

●● ●● ●

●●●

●●● ●

●●●

●●●● ●● ●● ●●

●● ●● ●

●●●●● ●● ●

●●●

●●

●●

●●●

●●●

●●

●● ●●

●● ●● ●●● ●

●●

● ●●●

●●●

●●

●●●

●●

● ●

●●

●● ●● ●●

●●● ●

●● ● ●●●

● ●● ● ●●

● ●

●● ●

●● ●●●

●●

●●

●● ●

● ●●

● ●● ●●

●● ●

●●

●●

● ● ●●●

● ●

●● ●●● ●

●●

●●●

● ●● ●

● ●●

●●●

● ●●●● ●

● ● ●●● ●

●●

●●

●● ●

● ●

●●

●●●●

●●●

●● ● ●●

●●●

●●

● ●

● ●

●● ●● ●

● ●●

●●

● ●●

●●

●● ●

●● ●●●

● ●●●● ●

● ●●●

●●●

● ● ●● ●

●●

● ●

●●●

●●●

●●

●●●●

●●●

● ● ● ●● ●●●

●●

●● ● ●● ●

●●●●

●●

●●

●●●

●●

●● ●●●

● ●●

●●

● ●

● ●●

●●

●● ● ●● ●

●●

●●

●●●

●●

●●

●●●●

●●● ●●●

●● ●

●●

●●

● ●●

●●●●● ●●

●●

● ●●●

●●

●●

●●

● ●

●●

● ●

●●●

●●●

●● ●●●●●

● ●●●

●●●

●●

● ●●●● ●●●● ●

●●

●●●●

●●

●● ●● ●

●●

●●

●●●

●●●●

● ●●

●● ● ●● ●●

● ●● ●●

●●

●●

●●

●●

●●●

●●● ●

●●●●●● ●●

● ●

●●

●●●

●●● ●●

● ●●

●●

●● ●

●●

●●● ●●

●●● ●●

●● ●●●

●●

● ●

● ●●●

● ●● ●●

●●

●●

●●●●● ●●●●●

●●

● ● ●●● ●●

● ●● ●

●●●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●●●●

●● ●

●●●

●●

●●●

●●●

●●

●●●

●●

●● ●●●

●●● ●

● ●● ●

●●●●

● ●●

● ●

● ●●●

●●

● ●● ●●

●●

● ●●●

●●● ●● ●

●●●● ●

●●

●●●●●

●● ●

●● ●●

●●

●●●●

●●●

● ●●●● ●●

●●

●●● ●● ●●●

●● ●●

●●●●●

●● ●●●● ●●

●●●

● ●●●●

●●

●● ●● ●●

●●

● ●●● ●●●● ● ●

●●● ●●

●● ●●

● ●

●●● ●●

● ●●● ●

●●

●●●

●● ●

●●

●● ●●●● ●● ●●● ●●

●●● ●

●●●

●●

●●●

●●●

●●

● ●●

●●●●

●●●

● ●●

●●● ● ●●

●● ●●●

●●

●●●●● ● ●●●

●● ●●

● ●●● ●

● ●●

●●

●●

●●●

●●●●

●●

●● ●●

●●

●●● ●●

● ●●

●●●

● ●● ●

● ●● ●●●

●●●●●● ●

●●●●

●●●

●●

●●●

●●●

●●● ●●●

●●● ●

●● ●

●●● ●

●●

●● ●

●● ●

●●●

●●●

●●

●● ●

●●

●●

●● ●●

● ●●●●

●●

●●●

● ●●●● ●

●●●●

●●

● ● ●

●● ●●

●●

●● ●●

●●● ●●

●●●

●●

●●●

●● ●

●●●

●●●●

● ●●●●

●●

●●●●●●

● ●

●●

●●

●● ●

●● ●●●● ●● ●●

● ●

●●

●● ●●● ●●

● ●●

●●● ● ●● ●

●● ●

● ●●

●●

●●●

●●

●●●

●●●

●●

● ●

●●●●●

● ● ●●

●●

●●

●●●●●●

●●●

●●● ●

●●

● ●●

● ●●

● ●●●

● ●● ● ●

●●●● ●● ●

●●●

●●●

●●●●

● ●●●

●●●

●●●

● ●

●● ●● ●● ●●

●●

●●●●

●●

● ●●●

●●● ● ●●● ●

●●●

●●

●●

●●

●● ●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●● ●

● ● ●●●●●

●●●●●

●●

●●●● ●●

●● ●

●●

●● ●● ●●

●●●●

●●● ●

●● ●

●● ●

●●●● ● ●

●●●

● ●●

● ●●

●● ●

●●●

●●●

● ● ●

●● ●●●●

●●●

●●●

●●●●

●● ●● ●●●

●●

●● ●

●●

●●

● ●●●

●● ●

●●●● ●

●●

●● ● ●●

●●●

●●●

●●

●●● ●

●●

●●●

●●●

●●

●● ●

●●

●●

●●●●

●●

●●●

● ●

●●

●● ●

●●

●●●

●●

●●

●● ●

●● ● ●

●● ●●

●●

●●

●●

● ●●●● ●●● ●●

●●●●●

●● ●

● ●●

●●

●●

●●●● ●●

●●

●●

●● ●●

●●

●●●●

●●

●●

●●●

●●

● ●●●●● ●●●

●●

●●

●●● ● ●●●

●●● ●

●● ●●●

●●

●●●●

●● ●

● ●●●

●● ●

●●

●●●●

●●

●●●

● ●●●●● ●

●●●

● ●●

●●

● ● ●●●

●●●

● ●

●● ● ●●

●●

●●●● ●●●

●●

●●●

●●

●●

●●●

●●● ●● ●●

● ●● ●● ●● ●

●●

●●●

● ●● ●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●● ●●●

● ●

●●

● ●●

● ● ●● ●●●

●●●

●●● ●●

● ●●

●●●●

●●●

●●●

●● ●●● ●

●●

● ●●

●●

● ●●

●●

●●

●●●●●

● ●

●●

● ●

● ● ●●●●

●●

● ●●

●●

●●

● ●●

●●

●●

●●●● ●

●●

● ● ●●●●

●●

●●

●●

●●

●●● ●●●

●●● ●●● ●●●●

●●

●●

●●●●●● ●

●● ●●● ● ●●

●●●●

●●

● ●● ●●

●●● ● ●●●●●

●●

● ● ●●●

●●●

●● ●●●

● ● ●●●●

●●

●● ●

● ●●

●●

●●

●●●●

●●●

●●

● ●●

●●

●● ●●

●●●

●●●

●●

●●●

●●

●●

● ●●●●●

● ●●● ●●

●● ●●

●●

●●●

● ●●

● ●●● ● ●

●●

● ●●●●

●● ●●●● ●

●●

●●● ● ●

●● ●●●● ●●

●●●

●● ●

●●● ●●

●●●●

●●● ●●

●●● ●

● ●

●●

●●● ● ●● ●

●●

●●

●● ● ●●

●●

●●● ●

● ●●

●●

●●

●●

●●

●● ●

●●●● ●●● ●

●●

●●●

●●

●●

●●●

●●

●● ●

●●

●●

● ●

● ●●

●●

●● ●● ●●

●●

●●

●●

●●●

● ●●●●

● ●●●●

●● ●●●

●●● ●

●●

●● ●●●

●●

●●● ●

●●● ●

● ● ●

●●●

● ●●

●●

●●

●●● ●●

● ●●

● ●●

●●

●● ●

●●

●●● ● ●●●

●●

●●

● ●

●●

●●●

●● ●

● ●●●

●●

●●●● ● ●●

●● ●

●●●●

●●●●

● ●●●●

● ●●

● ●

●●●●●●

●●

●●

●● ●

●●

● ● ●

●●

●●●

●● ●

● ●●

●●●

●●

●●

●●

● ●

● ●●●●

● ●●

●●●

●●●

●●●●

●●●

●●●

●●

●●● ●●●

●●

●● ●

●● ●● ● ●

●●● ●●●●●●

● ●●●

● ●●●

●●

● ●●

●●●

● ●● ● ●●

●● ●

●●

●●

●●

●●●

●●●

●●

●● ●● ●●

●●●●

●●●●

●●●

●●●●

●●

●●●

●● ●●

● ● ●●●

●●●●

● ●●

●●●

●●● ●● ●

●●●

● ●●●●

● ●●

●●●●

●●

●● ●

● ●●●

●● ●●●

●●

●● ●

● ●●●

●● ●●●

●● ●●●

● ●●●●

●●

●● ● ●●●

●●●● ●● ●● ●

●●

●●

●●

●●●

●●

●●

●●

●● ●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

● ●

●●

DBSCAN

Figure 2.7: A DBSCAN result on a data set (over the unit square) containing clusters ofnon-convex form. The parameters used were ε = 0.01, minPts = 5. Lightgray is used to represent points labeled by DBSCAN as noise.

minimum cluster sizes and density, and has difficulties extracting clusters of varyingdensity. More recent variants of density-based clustering, like OPTICS [Ank+99], helpto mitigate the latter two limitations2.

The third data set in Figure 2.4, commonly known as the “spirals” data, shows anothersituation in which clusters need not necessarily have convex boundaries. In such cases itis not sensible to assign cluster membership based on prototype vectors or Gaussiandistributions. Again, a more local “connectivity”-based approach can prove useful insuch situations.

Spectral clustering is the name given to techniques based on graph-theoretical notions,and that exploit the spectral decomposition of matrices derived from the data set. Thesymmetric matrix typically used is the so-called “affinity matrix” A ∈ Rn×n, whichencapsulates the “similarity” information between all objects in the data. The similarityis typically measured using the Gaussian kernel. Specifically, the affinity between twodata points with vectors ~xi and ~xj is

aij = e−‖~xi−~xj‖2

2σ2 . (2.3)

2In solving these problems, however, it could be argued that OPTICS introduces others. It produces ahierarchical clustering that typically requires manual interpretation.

21

Page 36: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2 Preliminaries

σ is a scale parameter, typically interpreted as a measure of when two points areconsidered “similar”. The selection of σ is commonly done manually, however it canalso be made to vary in space and can be determined automatically [ZP05].

Spectral clustering can be formulated in various ways, but the classical goal is to findthe first k eigenvectors of the symmetric, normalized Graph Laplacian

Lnorm := I − D−12 · A · D 1

2 , (2.4)

where D is the graph’s degree matrix. The k eigenvectors are used as columns inconstructing the matrix V ∈ Rn×k. The rows of V are then interpreted as the new dataobservations in k-dimensional space. In this space, it turns out that grouping by compactclusters corresponds to minimizing the inter-cluster affinity (that is, maximizing thetotal “separation” of the clusters) in the original graph context. The final step to identifythe clusters is hence to apply a traditional compactness-based clustering method, likek-Means or EM, on this k-dimensional space.

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ● ●●

● ●

●●

●●

●● ●

● ●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

Spectral Clustering

Figure 2.8: The spectral clustering result on the spirals data (using the R kernlab package[Kar+04]). We note that density-based algorithms like DBSCAN can find asimilar solution.

Although shown to be empirically successful for image segmentation and a varietyof exotic spatial data sets (e.g. snakes, letters, and the spirals in Figure 2.8), spectralclustering in its basic form has a number of limitations. These include the difficulty inhandling noise (Challenge 4, especially if e.g. k-means is used in the final clustering

22

Page 37: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

2.3 Cluster Analysis

stage), the requirement for the user to specify the number of clusters k and the scale σ

(Challenge 1), the difficulty in handling multi-scale data, and the high computationalcost (Challenge 2). Indeed, spectral clustering experiences problems on the DBSCAN-example data in Figure 2.7 because of the noise (note that DBSCAN performs wellon the spirals data). More recent approaches, including the highly-cited “self-tuning”(automated) method for spectral clustering [ZP05], have made progress on some of theselimitations.

23

Page 38: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,
Page 39: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3 Literature Review

In this chapter we reflect on results from the specialized, state-of-the-art literature thatis more directly relevant for the work in this thesis.

3.1 Matrix Factorizations over Discrete, Finite Sets

In Section 2.2 we discussed the concept of using matrix factorizations from linear algebraas a tool for Blind-Source Separation. We briefly reviewed Independent ComponentAnalysis, Principal Component Analysis and Non-negative Matrix Factorization as threetechniques commonly used to solve concrete realizations of the abstract BSS problem.All three impose restrictions on the nature of the factor matrices in order to be ableoptimize an objective function that is assumed to coincide with what a practitioner wouldconsider as useful. In the Cocktail Party example, the assumption that the audio sourcesare statistically independent is reasonable, so ICA’s objective function is a sensiblechoice. In contrast, if we wish to visualize a high-dimensional, real-valued spatial dataset, we might obtain a useful view by following PCA’s objective (selecting the two orthree directions that explain the greatest variance). If a “parts-based” representation issensible, we might select NMF.

Consider now the simple data matrix in Figure 3.1, which records the courses taken byfour students (the courses are Programming , Electromagnetism , Mathematics ,Molecular Dynamics ). The first three students study the disciplines of ComputerScience, Physics and Chemistry respectively. The last student studies Numerical Simula-tion, an interdisciplinary program involving a mix of the other three “pure” disciplines.To reflect the fact that these measurements are not made on the ratio scale, the entries ofthe matrix are denoted either f or t for false and true respectively (rather than 0 and 1).

As exploratory data miners, we hope to learn something from this data. We pose thequestion: Can we perform Blind-Source Separation on this data in order to extract someuseful and intuitive “sources” (latent patterns)?

Our first (naïve) step in this direction might be to map our nominal labels to numericalvalues and apply a numerical technique from linear algebra. We select the commonly-used mapping f 7→ 0 and t 7→ 1. Figure 3.2 shows the effective factors obtained afterapplying SVD to the data obtained and keeping only k = 3 of the most significanteigenvalues.

25

Page 40: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3 Literature Review

1 t f t f

2 f t t f

3 f f t t

4 t t t t

Figure 3.1: A simple data matrix that records the participation of students in courses.

We note two key limitations of the SVD result. Firstly, it is approximate. For this data,there exists no exact rank-three decomposition when the classical matrix product is used.Secondly, the factors contain both negative, zero and positive entries.

From a knowledge-discovery perspective, it is the latter limitation that is perhaps themost critical. It has a direct impact on the interpretability of the factors in the context ofthe data domain (Challenge 4). That is, how are we supposed to map the factors foundby SVD back into meaningful insights into the context of students and courses? Giventhe SVD factors here, there is no trivial mapping scheme that will help us learn aboutour “source” disciplines of Computer Science, Physics and Chemistry.

The Non-negative Matrix Factorization result (also in Figure 3.2) is an improvement.Although it is still only an approximation, it constrains its factors to non-negative entries,which can often aid interpretability. In this case we see entries in the factors rangingfrom 0 to approximately 1.14. To get a clearer picture, we might use simple rounding tointerpret the result in the context of the domain. We color a cell red if its entry is closerto zero than it is to 1, and green otherwise.

With this rounding, and even if we perform normalization, the factors still fail toreflect our “source” disciplines (errors relative to the upcoming BMF decomposition arelabeled with an exclamation mark). Why is this? To answer this question, we reflect onthe fundamental assumption that NMF makes about the data we provide. NMF assumesthat the data is measured on a ratio scale. NMF, and indeed all techniques from linearalgebra, rely on the classical matrix product which performs numerical addition andmultiplication. The crux is that performing numerical addition and multiplication makelittle sense for logical data, because logical data is not measured on a ratio scale. Rather,two-valued logical data is more sensibly mixed and weighted using the semantics ofBoolean disjunction and conjunction. Concretely, the Boolean axiom of 1 + 1 = 1 (where1 represents logical truth and + represents logical disjunction) is in conflict with itsarithmetic counterpart 1 + 1 = 2 (where 1 is a real value and + is numerical addition).When mixing is involved, this has the effect of “pushing down” the values in the NMF

26

Page 41: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3.2 Boolean Matrix Factorization

Data matrix “Usage” matrix “Basis” matrix

Singular Value Decomposition

Prog. Elec. Math. Mol.Dyn.

1 1 0 1 0

2 0 1 1 0

3 0 0 1 1

4 1 1 1 1

1 -0.4 -0.8 0.0

2 -0.4 -0.4 -0.7

3 -0.4 -0.4 0.7

4 -0.7 0.0 0.0

·

-1.115 -1.115 -1.932 -1.115

0.816 -0.408 0.000 -0.408

0.000 -0.707 0.000 0.707

Non-negative Matrix Factorization

1 0.00 1.33 0.00

2 1.14 0.00 0.00

3 1.14 0.00 0.00

4 0.78 0.32 1.05

·

0.000 0.439 0.878 0.439

0.750 0.000 0.750 0.000

0.725 0.625 0.072 0.625

Boolean Matrix Factorization

=

1

2

3

4

Figure 3.2: Finding latent disciplines from students and courses.

factors, which can lead to misleading results on final interpretation. We hence arriveat the seminal discrete matrix-factorization technique for non-ratio-scale data in thecontext of data mining: Boolean Matrix Factorization.

3.2 Boolean Matrix Factorization

Let B = {f, t} be the set of Boolean logic values. Let the binary function �B : B × B 7→[0, 1] be a contrast or “dissimilarity” measure over this set such that for all a, b ∈ B,�B (a, b) 7→ 0 if a = b, otherwise 1 (i.e. the Boolean XOR function). The Boolean matrixproduct A�B B of two matrices A ∈ Bn×k, B ∈ Bk×m is an analog to the classical matrixproduct, whereby arithmetic addition and multiplication are exchanged for logicaldisjunction ⊕B and conjunction ⊗B :

(A�B B)ij =(ai1 ⊗B b1j

)⊕B · · · ⊕B

(aik ⊗B bkj

). (3.1)

The Boolean Matrix Factorization problem [MV14; Mie09] is stated as follows:

27

Page 42: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3 Literature Review

Problem (BMF: Boolean Matrix Factorization). Given a Boolean data matrix D ∈ Bn×m

and positive integer k, find “usage” and “basis” matrices U ∈ Bn×k and B ∈ Bk×m thatminimize

‖D�B (U �B B)‖1, (3.2)

where �B is applied entry-wise to produce a “residual” matrix, and ‖·‖1 is the entry-wise1-norm (simple sum of all matrix elements).

Returning to Figure 3.2, we see how the BMF treatment of the students and coursesdata offers numerous advantages. Firstly, we see how BMF’s logic-based mixture modelenables us to achieve an exact “rank”1-three decomposition (zero reconstruction errorwith respect to Equation (3.2)). Secondly, and perhaps most importantly, the factormatrices are directly interpretable in the context of the domain (Challenge 4). The basismatrix exposes our latent Physics, Computer Science and Chemistry disciplines, and theusage matrix shows how the fourth student learns a mix of the core disciplines. Finally,we note that the pre- and post-processing steps (mapping and rounding) necessaryfor SVD and NMF are not required by BMF. Indeed, such mapping decisions are theroot cause of the problems found when applying SVD and NMF to these data: thesemappings introduce ratio-scale semantics on non-ratio-scale data.

Asso is a state-of-the-art algorithm specifically designed for solving BMF [Mie+08].The algorithm is so-named because it involves the computation of Association accuracies.Specifically, an entry aij of the association matrix A represents the confidence of theassociation between the ith and jth column, as defined in association-rule mining [AIS93].Initially, the entries of the non-symmetric matrix A are hence real values between zeroand one. Heuristically, A can be converted to a binary matrix using a rounding thresholdτ, and its rows then considered as candidate Boolean basis vectors. k of these candidatebasis vectors are then selected in a greedy fashion to arrive at an approximate solution.In the best case, Asso computes a solution in time O(knm2). The computation of theassociation matrix is particularly expensive: the association score is calculated for eachpairwise combination of columns, requiring time O(nm2). Asso requires the valuesof k and τ to be defined by the user, and depends on floating-point arithmetic for thecreation and manipulation of the raw matrix A (despite the fact that the original data istwo-valued discrete).

3.2.1 Combinatorics Problems Related to Boolean Matrix Factorization

The BMF problem is related to a number of problems from combinatorics. Thesecombinatorics problems will help us in our work as well. For reference, we hence detailthese problems in the following subsections.

1See [MV14] for a discussion on the Boolean rank of a matrix.

28

Page 43: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3.2 Boolean Matrix Factorization

Set Cover

The Set Cover (SC) problem is one of the classic 21 NP-Complete problems published byKarp [Kar72].

Problem (Set Cover). Given a “universe” set of elements U = {1, 2, . . . , m} and a collectionS = {S1,S2, . . . ,Sn} of sets such that their union is U , find the smallest sub-collection C ⊆ S

such that the union of its sets is also U .

Example (Set Cover). Consider the problem instance with U = {1, 2, 3, 4, 5} and S =

{{1} , {1, 2, 3} , {2, 5} , {3, 5} , {4, 5} , {3, 4}}. It is clear that the union of all sets in S isequal to U . The smallest sub-collection of S which still fully covers U , and hence the solutionto the problem, is C = {{1, 2, 3} , {4, 5}}.

The optimization version of the set-cover problem is NP-Hard [BJ08] (the decisionversion is NP-complete). The baseline greedy polynomial-time algorithm for approxi-mating set cover simply involves selecting those successive sets which cover the largestnumber of remaining uncovered elements. The algorithm terminates when all elementsare covered. In [Sla96] it is shown that the approximation ratio for this greedy algorithmis exactly ln n− ln ln n +O(1), making it essentially the best-possible polynomial-timeapproximation algorithm for set cover.

Red-Blue Set Cover

The Red-Blue Set Cover (RBSC) problem [Car+00] is a generalization of the Set Coverproblem. The objective of RBSC differs from that of SC in that the cost of a solutionconsiders not the number of covering sets, but the number of covered “penalty” elementsfrom a second “red” set:

Problem (Red-Blue Set Cover). Given two disjoint sets R ={

r1, r2, . . . , rρ

}and B ={

b1, b2, . . . , bβ

}, as well as a collection of sets S ⊆ 2R∪B , find the sub-collection C ⊆ S that

covers all elements in B but covers the minimum possible number of elements in R.

Example (Red-Blue Set Cover). Consider the problem instance with B = {1, 2, 3, 4, 5},R =

{6, 7, 8} and S = {{1} , {1, 2, 3, 7, 8} , {2, 5} , {3, 5, 6} , {4, 5, 7} , {3, 4, 6}}. The solution isC = {{1} , {2, 5} , {3, 4, 6}}, which has the minimum cost of 1.

In [Car+00] it is shown that the SC problem can be reduced to the RBSC problem, andthat the RBSC problem is at least as hard as the SC problem.

29

Page 44: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3 Literature Review

Positive-Negative Partial Set Cover

The Positive-Negative Partial Set Cover (±PSC) problem is a more recent generalizationof the RBSC problem (which in turn is a generalization of the SC problem). Its introduc-tion was motivated by the BMF problem [Mie08a]. Compared to RBSC and SC, ±PSCrelaxes the strict requirement of complete cover. It searches for a solution that representsthe best balance between covering positive elements and not covering negative elements.

Problem (Positive-Negative Partial Set Cover). Given two disjoint sets P and N , as wellas a collection of sets S ⊆ 2P∪N , find the sub-collection C ⊆ S that minimizes the cost function

cost±PSC(P ,N ,S) =

∣∣∣∣∣P \(⋃C∈CC)∣∣∣∣∣+

∣∣∣∣∣N ∩(⋃C∈CC)∣∣∣∣∣ . (3.3)

Example (Positive-Negative Partial Set Cover). Consider the problem instance with P =

{2, 3, 5, 7}, N = {1, 4, 6} and S = {{1, 2, 6} , {2, 4} , {1, 4, 5, 6, 7} , {4, 5, 7}}. The sub-collection of S achieving the minimum cost is C = {{2, 4} , {4, 5, 7}}. The minimum cost is 2,because this solution covers one negative element (4) and fails to cover one positive element (3).

In [Mie08a] it is shown that RBSC can be reduced to ±PSC, that the ±PSC problemis at least as hard as RBSC, and that a polynomial-time approximation algorithm for±PSC exists that achieves an approximation factor of 2

√(|S|+ |P|) log |P|.

3.2.2 Missing-Value Boolean Matrix Factorization

Missing-Value Boolean Matrix Factorization (MVBMF) is a variant of Problem BMF thatconsiders the case where the data matrix is incomplete.

Problem (MVBMF: Missing-Value Boolean Matrix Factorization). Let BMV = B∪{?},where ? indicates a missing value. Given a Boolean data matrix D ∈ Bn×m

MV and positive integerk, find “usage” and “basis” matrices U ∈ Bn×k and B ∈ Bk×m that minimize

‖D�B (U �B B)‖1, (3.4)

where �B and the norm ‖·‖1 this time consider only the known elements in D.

This problem has applications in collaborative filtering [SK09] and the mining of“roles” from incomplete data during an organization’s migration to role-based accesscontrol (RBAC) [Vav+].

Assomv is the name we use to denote the Asso extension that supports missing values

[YM12]. The major contribution that this work makes to Asso is in the algorithmstep where the association matrix is computed. To handle missing values, the original

30

Page 45: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3.3 Ordinal Matrix Factorization

definition of association confidence is modified to include a probabilistic component.The resulting association matrix is then converted to binary (using the same τ parameteras Asso). The remaining part of the algorithm is similar to Asso, although certainadditional data structures are needed to track the elements of the data matrix for whichthe cover need not be computed. Asso

mv also has quadratic run-time complexity withrespect to the dimension m (again due to the need to compute the full associationmatrix). The method is evaluated on synthetic data with up to 99% missing values inthe data matrix [YM12].

One key advantage of Assomv is that it addresses both the prediction and description

task simultaneously. That is, the factors U and B are 1) Boolean, which admits theirinterpretation in the context of the domain to help describe potentially-valuable latentpatterns (description), and 2) multiplied using the Boolean matrix product, which helpsto “fill in the blanks” in the data matrix (prediction).

Other techniques exist for the collaborative-filtering problem that focus primarilyon prediction. Maximum-Margin Matrix Factorization (MMMF) [SRJ04] is such a tech-nique. MMMF finds a factorization of the same basic structure (“usage” and “basis”matrices), however the factors contain real values and are multiplied using the classicalmatrix product. The main contribution of this work is the formulation of the binarycollaborative-filtering problem (in which the data matrix has missing values) in terms ofstandard optimization problems. To this end, the dimensionality k is kept unboundedand the factorization is regularized with a low-norm constraint. For the learning processit is shown that, when keeping one of the matrices fixed, each separate linear predictionproblem decomposes into a standard support-vector machine problem if the hinge-losserror (appropriate for binary data) is chosen. This technique has been shown to be highlysuccessful on the prediction problem, however fails to carefully address the descriptionproblem for Boolean data. Like SVD and NMF, the use of real-values and a linearmixture model ultimately renders the factors difficult for interpretation in the contextof the Boolean domain. Additionally, and unlike SVD and NMF, MMMF leaves thedecomposition rank k unbounded, which may result in a large and indigestible set of“features” in the factorization.

3.3 Ordinal Matrix Factorization

Ordinal Matrix Factorization was first presented in the context of data-mining researchin [BK13]. Like BMF, the authors consider the problem of performing Blind-SourceSeparation on data measured over a non-ratio scale. In this case, the data underconsideration is measured over ordinal scales.

To motivate the need for OMF, Figure 3.3 shows an ordinal example where the “fre-

31

Page 46: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3 Literature Review

Data matrix “Usage” matrix “Basis” matrix

Non-negative Matrix Factorization

Energy Loud. Protect. Affec.

GER

SCH

SIB

BUL

GER 1.17 0.87 0.15

SCH 0.01 1.59 0.33

SIB 0.19 0.12 1.40

BUL 0.97 0.76 1.12

·

0.031 0.296 0.784 0.128

0.557 0.510 0.136 0.005

0.357 0.536 0.071 0.728

Ordinal Matrix Factorization

=

GER

SCH

SIB

BUL

Figure 3.3: Finding latent canine temperaments from dogs and traits.

quency” {Never, . . . , A Great Deal} of dog traits (energy , loudness , protectivenessand affection ) is measured for four breeds (GERman Shepherd, SCHipperke,

SIBerian Husky and BULL Terrier). From the domain we are aware of the latent dogtemperaments “hyperactive” (energetic and loud), “playful” (loud and affectionate) and“watchful” (loud and protective), which we might hope to uncover with a rank-threedecomposition. By imposing a five-element numerical scale (0, 0.25, 0.5, 0.75, 1) on ourordinal data, we might again try NMF. If we again use simple rounding to get a clearerpicture, we struggle to clearly see our latent concepts. The first basis vector, for example,fails to pair loudness with protectiveness to highlight “watchfulness”. OMF’s exact de-composition, on the other hand, clearly highlights the temperaments in the basis matrix,with the usage matrix confirming that Bull Terriers strongly exhibit all three. NMF’sresult is again attributable to its use of a classical linear mixture model (arithmeticaladdition and multiplication). A more in-depth OMF motivation on real-world caninedata is given in [BK13].

Let L ={

0, 1s+1 , . . . , 1

}represent an s-element partially-ordered set bounded by 0 and

1. Scales of this nature are often found in survey-research. For example, we often find“Likert Items” inspired by Miller’s Law2 [Mil56] with the possible values:

strongly disagree disagree neutral agree strongly agree

When we map such choices to the numerical values L ={

0, 14 , 1

2 , 34 , 1}

for the sake of

2The observation that the number of objects an average person can hold in working memory is approxi-mately 7± 2.

32

Page 47: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3.4 Clustering in the Context of Noise/Clutter

convenience, we should take care not to think that the classical arithmetic operationscan then be applied. For example, we cannot make a statement like “Neutral is twice asmuch agreement as Disagree” for this data because we have no meaningful “zero” value.It is this mechanism that causes the NMF interpretation problems in Figure 3.3.

Instead, values on such scales can be weighted and mixed using the Łukasiewiczoperations [BK13]. The ordinal matrix product �L hence uses the operation a⊕L b =

max(a, b) in place of addition, and the operation a⊗L b = max(0, a + b− 1) in place ofmultiplication. Let the binary function�L : L×L 7→ [0, 1] be a contrast or “dissimilarity”measure over L (in [BK13] the function is defined as �L(a, b) = |a− b|). The formaldefinition of the OMF problem is then [BK13]:

Problem (OMF: Ordinal Matrix Factorization). Given an ordinal data matrix D ∈ Ln×m

and positive integer k, find “usage” and “basis” matrices U ∈ Ln×k and B ∈ Lk×m thatminimize

‖D�L (U �L B)‖1, (3.5)

where, like BMF, �L is used entry-wise to produce a “residual” matrix, and ‖·‖ is the entry-wise 1-norm (simple sum of all matrix elements).

In [BK13] the algorithm GreEss is introduced to solve OMF. The algorithm is based onformal-concept theory. Importantly, we note that GreEss achieves the optimal from-belowdecomposition. That is, GreEss is able to compute the optimal solution from the subsetof OMF solutions that give a reconstructed data matrix not exceeding the original matrixD in any entry. Although not mentioned in the original article, our empirical analysissuggests that this effectiveness comes at a price: the time complexity of GreEss is inO(sn2m3).

3.4 Clustering in the Context of Noise/Clutter

Although noise is considered to some extent by a number of clustering techniques (e.g.DBSCAN, or Robust Information-Theoretic Clustering [Böh+06]), there are far fewerthat consider the case where global “clutter” heavily outweighs the number of clusteredpoints (the right-most case in Figure 2.4). We are aware of only two techniques thatspecifically focus on this kind of scenario. We discuss these techniques in this section.

3.4.1 Peer Article: Detecting Features in Spatial Point Processes with Cluttervia Model-Based Clustering

In [DR98] the authors consider the problem of detecting surface-laid minefields on thebasis of many images from reconnaissance aircraft (Figure 3.4). Minefields are often

33

Page 48: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3 Literature Review

laid in a structured way. The authors approach the problem as one of clustering in thepresence of significant “clutter”. They propose a distribution-based method which seesminefields modeled as multivariate normal distributions, and “clutter” represented by aspatial Poisson process. The EM algorithm is used to find an approximate solution tothe problem, with approximate Bayes factors employed to select the number of clusters.The resulting Mclust-EM algorithm is shown to produce good results for up to twosuch “minefield” clusters in the presence of considerable “clutter”. Examples are alsogiven for detecting seismic faults based on an earthquake catalog, although this data ismuch less “cluttered”.

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

Minefield data

Figure 3.4: A simulated minefield (left) in the presence of “clutter” (like metal objectsor rocks). This data set was generated based on similar data presented in[DR98].

Primarily driven by the minefield application, Mclust-EM is unfortunately onlyusable for data in two dimensions and is parametric (assumption of Gaussian clusters).

3.4.2 Peer Article: Efficient Algorithms for Non-Parametric Clustering withClutter

Compared to [DR98], the development of the method in this article [WM02] is notdriven by a particular application. The authors argue that situations with a combinationof noisy background and clusters are found in a variety of applications, and that anew approach is needed. One application that the authors do use as motivation is the

34

Page 49: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3.5 Anomaly and Change-Point Detection in Time-Series Data

clustering of galaxies (Figure 3.5). Each “bright spot” in such a data set is a galaxycontaining billions of stars. Astrophysicists are interested in clustering galaxies, buta noisy background of field galaxies interferes with traditional clustering techniquesbased on mixture models, vector quantization or graph-theory.

Figure 3.5: A view of galaxy data from the Sloan Digital Sky Survey [Alb+16; Eis+11] asconsidered in [WM02].

The key contribution of [WM02] is a refined version of the Cuevas, Febrero andFraiman (CFF) algorithm [CFF00]. The CFF algorithm first determines the subset of datapoints that are in high-density regions using a non-parametric density estimator. This isfollowed by a clustering step, whereby such high-density points are agglomerated. Therefinements made are aimed at partially addressing the computational problems of theCFF algorithm. However, a key limitation is that it still needs to be executed “hundredsof times” [WM02, p. 4] with different combinations of the three required parameters.

3.5 Anomaly and Change-Point Detection in Time-Series Data

The topics of “change-point detection”, “anomaly detection” and “event detection” fortime-series data (Challenge 3) have seen numerous contributions in the data-mining,signal-processing and statistics literature. In this section we firstly review a rangeof generic techniques for numerical time series. Afterwards, we review text-based(Natural Language Processing) methods for topic and event detection in social-media

35

Page 50: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3 Literature Review

and microblogging services like Facebook and Twitter.

3.5.1 Numerical Techniques from Data-Mining and Signal-Processing

A number of recent contributions to the data-mining anomaly-detection literature havebeen made by research teams of internet-based companies. Twitter, for example, hasreleased two popular techniques: Twitter Anomaly Detection [JKM14] and TwitterBreakout Detection [VHK14]. A result from the former technique is shown in Figure3.6, where we can see that it identifies intuitive outliers (red points) in a signal thatcontains noise, short-term seasonality and long-term trend. The underlying algorithmuses statistics based on the energy spectral density of the signals.

Sep 27 Sep 29 Oct 01 Oct 03 Oct 05

5010

015

020

025

0

Twitter Anomaly Detection

●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●

●●

●●●●●

●●●

●●

●●●●

●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

Figure 3.6: Twitter Anomaly Detection finds significant deviations in signals containingnoise, short-term seasonality and long-term trend.

Yahoo! is another company which has recently “open sourced” a framework foranomaly detection [LAF15]. It is named the Extensible Generic Anomaly DetectionSystem (EGADS). Subscribing to the view that there is no “silver bullet” approach,it contains a suite of anomaly-detection techniques in a single package. The overallarchitecture consists of two primary components: the time-series modeling module(TMM) and the anomaly-detection module (ADM). The TMM predicts expected futurevalues of a given time series. These predictions are consumed by the ADM, which inturn computes anomaly scores.

The recent statistical literature contains further approaches. The eDiv approach

36

Page 51: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3.5 Anomaly and Change-Point Detection in Time-Series Data

presented in [MJ14] finds segments in time-series data, the boundaries of which representchange points. To this end it employs a binary bisection method and a permutation test.eDiv is primarily an offline method: its computational complexity is in O(kn2), where kis the number of estimated change points and n the number of observations. From thesame work we find eAgglo, the bottom-up variant of eDiv, and another probabilisticpruning method cp3o. These techniques likewise have quadratic run-time complexity inn. The work presented in [Rig10] considers the same problem and presents the techniquepDPA. pDPA is able to find the solution that globally minimizes the sum of the quadraticlosses in each segment. pDPA likewise has a worst-case quadratic run-time in n.

3.5.2 Specialized Techniques for Social Media

Motivated by large-scale applications like Twitter and Facebook, the data-mining com-munity has recently been focusing on the task of detecting “events” or “topics” in socialmedia. In [Rit+11; Rit+12] the algorithm TwiCal is presented. TwiCal exploits part-of-speech tagging, named entity recognition, temporal expression resolution (e.g. “nextFriday”), event-tagging and event classification in order to generate an open-domaincalendar of significant events (see Figure 3.7). In [HTK15] we find an approach thatis more focused on the real-time detection of topics in Twitter (as opposed to trying togenerate a calendar) in the presence of noise. To handle the high bandwidth of Twitter(Challenge 2), the authors reformulate the Non-negative Matrix Factorization problemin a stochastic manner. The reformulation enables the use of stochastic gradient descentupdates in linear time with respect to the number of non-zero entries in the matrix.

37

Page 52: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

3 Literature Review

Figure 3.7: Partial screenshot of the Twitter Status Calendar (statuscalendar.com) – ademonstration of the system described in [Rit+11; Rit+12].

38

Page 53: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4 Research Approach

This thesis is based on the research and development of practical methods for exploratory,unsupervised data mining. Research questions were initially spawned based on thecommunity-identified challenges enumerated in Section 1.3. Given a research question,literature was reviewed, existing algorithms were collected and example data setswere synthetically generated or harvested from various sources. Given these resources,an iterative phase of method- and algorithm-design was performed to identify andovercome the limitations of existing approaches to the problems in question. Ourwork did not focus on individual data sets alone, nor a hypothesis for explaining aphenomenon in a particular application domain. The following sections discuss theelements of our research approach in more detail.

4.1 Literature Reviews

Particularly in the accelerating field of data-mining research, it is important to frequentlyreview the literature for a given research question or problem. As with all research, ourcontributions build on those of others. We adopted the following strategies during ourliterature reviews:

• Perhaps our primary sources of literature were the proceedings of the premierconferences in the field. These conferences are 1) ACM SIGKDD Conference onKnowledge Discovery and Data Mining, 2) IEEE International Conference onData Mining, and 3) SIAM Data Mining. Proceedings are published annually,with articles indexed by topic-area (e.g. cluster analysis). These three establishedconferences are highly-selective and tend to attract the greatest attention andhighest-impact contributions from data-mining researchers and practitioners (moreso than any data-mining journal, for example).

• “Bottom-up” search was performed by seeking concrete implementations of algo-rithms for the given research question in established data-mining software systems.These systems included 1) the Environment for Developing KDD-ApplicationsSupported by Index-Structures (ELKI) [AKZ08], 2) the Waikato Environment for

39

Page 54: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4 Research Approach

Knowledge Analysis (Weka) [Hal+09], and 3) the Comprehensive R Archive Net-work (CRAN) [Hor12]. Given a software implementation, the source-code ordocumentation was consulted to extract the relevant publication(s).

• “Top-down” search was performed by seeking survey articles for a given researchquestion or problem definition.

• Established researchers in the field were contacted, who often recommendedrelated work. These researchers are listed in Table 4.3 and also in the acknowledg-ments section of this thesis.

In all cases, identified articles were additionally used as a starting point for backwardssearch (analyzing the references of each publication in a recursive manner). Takentogether, these strategies yielded a representation of the state-of-the-art with respect to agiven research question.

4.2 Idea Synthesis, Problems Definitions and Algorithm Design

Having performed literature reviews for a given research question, we began a creativeprocess of idea-generation and brainstorming in an attempt to address the weaknessesof the state-of-the-art. Early in this process, we would typically generate and focus on a“running-example” or “illustrative” problem that highlighted these weaknesses (we makeuse of such examples in our publications). This exercise helped to remove irrelevantcomplexity, assisted in refining the problem definition, and served as a springboardfor spawning algorithmic ideas for overcoming the weaknesses. In line with our thesisgoals, we only pursued ideas that led to linear-time algorithmic behavior. Finally, beforeinvesting time in writing a software implementation, we manually tested the idea onsmall (“toy”) datasets, and performed a further literature review to determine if a peerhad already proposed such an algorithm for the given problem.

4.3 Software Prototypes

Our research made extensive use of software prototypes for the practical implementationof the algorithms presented. Considering that all the problems we investigated cannot besolved optimally in general, a practical implementation gave insight into the performance(both in terms of effectiveness and efficiency) of a proposed heuristic on non-trivialproblem instances. Unsurprisingly, researching in this way was often iterative: Theimplementation was driven by changes to the algorithm, which in turn was driven by

40

Page 55: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4.4 Data Sources

investigations into the implementation’s performance. Finally, a prototype helped toconfirm theoretically-calculated approximation bounds for an algorithm.

The software prototypes developed for this dissertation made use of a variety ofprogramming languages and paradigms. All development work was done on a LinuxUbuntu system. For the initial stages of algorithm development we often selected R, ahigh-level interpreted programming language and statistical computing environmentfor rapidly prototyping data-mining ideas [Tip+15]. Java, an object-oriented compiledlanguage, was used for an implementation in Paper C. For Java programming we usedthe IntelliJ IDEA integrated development environment (IDE). For Papers A, B and C,further implementations were written in C++ in order to optimize performance. ForC++ programming we used the NetBeans IDE. All software prototypes are included inthe code repositories referenced from each paper.

4.4 Data Sources

We empirically evaluated our proposed methods on both synthetic and real-worlddata. Synthetic data was created by building data-generation tools that implemented agenerative model. In each case, the generative model was designed to accept a numberof parameters. These parameters controlled the properties of the generated probleminstance, like the size, ground-truth model order, the density of the patterns, and theamount of noise. The data-generation tools were intentionally non-deterministic, allow-ing multiple data sets to be generated at random for the same parameter combination. Asimple random sample of several such data sets (typically 20) was then used to evaluateeach algorithm in focus (more details in Section 4.6.2). We note that all data-generationtools were made available alongside our algorithm implementation in publicly-accessiblerepositories linked directly from footnotes in each paper.

For our real-world experiments, we used data from various sources. Some of this datawas from “benchmark” repositories for machine-learning and data-mining. Table 4.1shows the data sets used from these repositories. In other cases, we harvested data usingthe application programming interfaces (APIs) provided by online services. These datasets were subsequently published (with permission) in the respective source repositories(supplementary material) in each case. Table 4.2 gives a description for each of thesesources and provides remarks on the data-collection approach.

41

Page 56: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4 Research Approach

Repository and Data Sets Relevant PapersUCI Machine Learning Repository (archive.ics.uci.edu)

Breast Cancer, Congressional Voting Records, Dermatology,Hayes-Roth, Image Segmentation, Lenses, Lymphography, Pen-Based Recognition of Handwritten Digits, Promoter Gene Se-quences, Seeds, Soybean, SPECT Heart, Tic-tac-toe, Trains, 3DRoad Network (North Jutland, Denmark)

A,B,D

Rdatasets (github.com/vincentarelbundock/Rdatasets)

Whiteside (MASS), OldMaps (HistData), Coalition2 (Zelig), Motor(Boot), Prestige (Car)

D

Personality Tests (personality-testing.info/_rawdata/)

Relationships, Feminism, Assertiveness, Personality 1, Occupa-tion, Humor Styles, Mindfulness, Masculinity/Feminism, Person-ality 2, Sexual Self

C

Pew Research Center (pewforum.org)

Tolerance and Tension: Islam and Christianity in Sub-SaharanAfrica

A,B

Table 4.1: Existing real-world data sourced from public repositories

API/Remarks Relevant PapersStack Overflow (api.stackexchange.com)

Sourced were 1.02× 106 answers and their corresponding ques-tions between the dates 2010-01-01 and 2012-12-31. The dataset consists of 340 × 103 answers to the most “useful” ques-tions, 340× 103 answers to the most “non-useful” questions, and340× 103 answers to questions where the usefulness is “unknown”(zero votes).

A,B

Internet Movie Database (imdb.com/interfaces)

Sourced were 14690 films, each having genre flags, rating in-formation and filming locations in the form of raw text strings.The data-collection process explicitly excluded titles with the text“TV”. The shooting locations ontology was generated using theGoogle Places API (developers.google.com/places).

C

42

Page 57: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4.5 Comparison Techniques

Yummly (developer.yummly.com)

Sourced were the 296 recipes returned from an API-search formain-course recipes of American cuisine and tagged with theholiday summer. As the official Yummly ingredients ontology isproprietary, we constructed an ontology by hand based on thetractable number (693) of unique ingredients in these 296 recipes.

C

Twitter (dev.twitter.com/rest/public)

Sourced were 1) 14698 randomly-sampled Twitter user profilesand their five count-based metrics (follower count, status countetc.), 2) the 7701 tweets made against the hashtag #FathersDaybetween June 5, 2016 and June 20, 2016, 3) the 10901 tweets madeagainst the hashtag #PokemonGO between July 13, 2016 and July17, 2016.

E

Wikipedia (mediawiki.org/wiki/API:Recent_changes_stream)

Sourced were n > 2 million edits to Wikipedia pages streamedthrough stream.wikimedia.org (namespace “/rc”) between July6, 2016 and July 11, 2016. 404,365 of these edits were tagged asnon-bot edits.

E

GitHub (developer.github.com/v3/)

Sourced were 100,999 repositories created between January 1 2008and December 31 2015 with their attributes full_name, stargaz-ers_count and forks_count.

E

YouTube (developers.google.com/youtube/v3/)

Sourced were 138,529 videos with region code “US”, type “video”,video type “movie” and published between January 1 2012 andDecember 31 2016.

E

Table 4.2: Real-world data sourced from public Application Programming Interfaces.

4.5 Comparison Techniques

The broad data-mining tasks of finding associations, clustering objects and detecting anoma-lies have seen a large number of publications (Figure 1.2). Many of these publicationspresent novel data-mining methods and detail a corresponding new algorithm. Thereare hence many hundreds, if not thousands, of published data-mining algorithms inexistence.

To appropriately select comparison techniques for each of our proposed algorithms,we began by clearly defining the problem and/or the kind of data sets in focus. In the

43

Page 58: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4 Research Approach

case of FasTer and Finesse, we found comparable techniques in Asso, Assomv, PaNDa,

PaNDa+ and GreEss. That is, each searches for k patterns from discrete data that aremixed using non-linear operations (for Asso, Asso

mv, PaNDa and PaNDa+ the mixingsemantics are Boolean; for GreEss they are ordinal). In the case of SkinnyDip we foundonly a single implementation of an algorithm focused on the high-clutter case [DR98],so we therefore extended our search to include six parameter-free clustering algorithmsfrom different paradigms (SkinnyDip is likewise parameter-free, so we argued that thiswas a sensible property on which to base the selection of comparison techniques). Finally,in the case of BenFound, we selected ten techniques (classical benchmark techniquesin addition to state-of-the-art approaches) for numerical anomaly-detection, and a setof state-of-the-art techniques specifically for Twitter-based topic- and event-detection.The majority of the selected techniques were the result of work presented at one ofthe premier data-mining conferences (Section 4.1). The techniques were primarilyidentified through periodic reviews of the literature, however there were also instancesin which anonymous peer-reviewers suggested comparison techniques (this was thecase for PaNDa+ and Mclust-EM). We additionally thank Pauli Miettinen for makingthe technique GreEss known to us.

We did not implement the comparison techniques ourselves, but rather obtainedexisting implementations from the original author or from publicly-accessible reposi-tories. Table 4.3 shows the source for each algorithm implementation that we used incomparing to our work. We implemented “wrapper” scripts where necessary in orderto integrate each comparison technique into our experimental testbed.

A subset of the comparison techniques were “parameter-free” in the sense that theonly required input was the data set. The remaining techniques had hard parameter re-quirements. To be fair on the competition, we provided correct values for the parametersin each of these cases. For example, we provided the correct number k = 6 of clusters tothe EM algorithm when evaluating it on our “running example” data in Paper D. In thecases where the correct value of a parameter was uncertain (“obscure” parameter), wevaried the parameter over a generous range and executed the technique once for each.The concrete cases of this were: DBSCAN (Paper D, ε parameter) and Asso/Asso

mv

(Papers A, B and C, τ parameter). We subsequently reported the best result from theobtained set.

4.6 Experiments

The evaluation of our proposed methods was primarily empirical. In this section wediscuss a number of aspects of the approach taken when performing these experiments.

44

Page 59: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4.6 Experiments

4.6.1 Evaluation Metrics

In Papers A, B and C we adopted the evaluation metric established by the originalwork on Boolean and Ordinal Matrix Factorization [Mie09; BK13]. This metric can beconsidered to be a “residual”. It is the difference, or “error”, between the original datamatrix and the reconstruction. It is measured as the sum of the entry-wise contrasts(in the Boolean case it is simply the sum of false positives and false negatives in thereconstruction, i.e. Equation (3.2)).

In Paper D we required a metric to compare a computed clustering result againstthe corresponding set of “ground-truth” cluster labels. We selected Adjusted MutualInformation (AMIMAX) [VEB09], a state-of-the-art metric that is calculated based onthe information-theoretic measures of entropy and mutual information. Although thismetric is accepted by the data-mining community, a similar metric known as NormalizedMutual Information (NMI) [Yao03] is likewise used and considered acceptable. For thisreason, we additionally presented a subset of our results using the NMI metric (moreinformation in Section 5.1.2).

In Paper E, the results of the real-world experiments on Twitter and Wikipedia datawere qualitatively evaluated using supplementary data. In the case of the #FathersDayand #PokemonGO hashtags, we investigated the text of the corresponding tweets. Inthe case of Wikipedia, we investigated the corresponding editor comments. For thesynthetic experiments the evaluation metric was binary: Whether the synthetic processwas generating Benford or non-Benford data at a given time.

4.6.2 Controlled Experiments

The controlled experiments in Papers A,B,C and D all follow a similar approach. Firstly,a set of default generative-model parameters was selected for generating synthetic datasets. The selection of the default parameters was discussed and justified in each case.Given these default parameters, we systematically performed experiments by varyingeach parameter individually over a range (keeping the other parameters fixed). Thisapproach of designing and implementing a generative model, based on which systematicexperiments are performed, is commonly used in the data-mining research community(e.g. [MV14; YM12; Mie09; Mie+08]). We note that it was not tractable to run experimentsfor every combination of generative-model parameters (indeed, many parameters werecontinuous real numbers with theoretically-infinite variability).

For each algorithm under consideration, as well as each generative-model parameterconfiguration, we generated 20 random data sets using our generative model tool.Each of the 20 data sets were provided to the algorithm under consideration alongwith any required algorithm parameters (discussed in Section 4.5). The algorithm was

45

Page 60: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4 Research Approach

executed independently of all other experiments in each case. We collected 20 resultscorresponding to the algorithm’s output on each data set. We evaluated each of the20 results using the selected evaluation metric (Section 4.6.1), generating a set of 20evaluation metric values. Finally, we calculated the mean and standard deviation forthis set of 20 evaluation metric values. The mean (first moment) and standard deviation(positive square root of the second central moment) was then used for the plots inthe corresponding report. To reduce clutter, we note that we often omitted standard-deviation bars for every second data point in a plot. An example of such a plot is seenin Figure 5.5.

In the case of Paper B, we included for reference an indication of the statisticalsignificance of the difference between our results and that of the comparison. To thisend, we used the conservative Wilcoxon signed-rank test [Wil45] to assess whether asignificant difference existed between the mean evaluation-metric value for our proposedmethod and that of a selected comparison technique. The Wilcoxon signed-rank testis non-parametric (no assumption of normality is made). The resulting p-values for acomparison between FasTer and Asso

mv were made available in Figure 7 of Paper B. Inall but two cases the results were significant at the α = 0.05 level. In the case of Paper E,our controlled experiments were evaluated visually by comparing the times at whichBenFound and each comparison technique detected an anomaly or change-point.

4.6.3 Real-World Experiments

A suitable decomposition rank k needed to be selected in order to perform experimentson the real-world data for Papers A and B. Not knowing the “ground truth” modelorder, we used a heuristic approach. For the Stack Overflow data we ran FasTer for arange of k values. We then selected k = 6 based on the “kink” in the error-k curve, andused this value of k to compare to Asso and PaNDa. For the Africa and CongressionalVoting Records data we selected k based on the number of “classes” published in theoriginal data.

For Paper C the situation was similar. The decomposition ranks k = 6 and k = 9 wereselected for the Yummly and IMDB data based on the “kink” in the error-k curve (Figure5.13). For the ten ordinal survey-research data sets, we repeated the experiments for bothk = 10 and k = 20. For Paper D we note that the data sets used were “classification-style”data sets, that is, each object included a class label. We thus used the class labels asthe “ground truth” reference clustering result during the evaluation of each technique(a common approach adopted by the clustering community). Finally, for BenFound

we selected a constant window size of w = 2000 for our real-world experiments onTwitter data (based on a trade-off between maximizing statistical power and the abilityto promptly capture system dynamics). For the “red-flag” threshold α we selected the

46

Page 61: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4.7 Reproducibility

commonly-used value of 0.05.

4.6.4 Run-Time (Scalability) Experiments

For all run-time experiments presented, we consistently used the default generative-model parameters and systematically varied only the scale of the problem (in terms ofthe number of objects n, number of dimensions m and number of ground-truth patternsk). We note that many of the resulting instances required a number of days to “solve”,so it was not tractable to perform 20 repetitions of each experiment. The metric recordedfor each experiment was the algorithm’s response time in seconds. The run-timeexperiments were all performed in isolated “jobs” on the compute cluster at HelmholtzZentrum München. The compute architecture is based on IBM x3650 M3 nodes, eachhaving two Intel Xeon X5690 6-core processors (3.46 GHz) and hyper-threading enabled,giving 24 virtual cores in total. The serial experiments were performed on a singlevirtual core. The experiments on parallelization in Papers B and C used up to 24 virtualcores for strong- and weak-scaling scenarios.

4.7 Reproducibility

In the respective online repositories we provide detailed instructions for reproducing allof our published results. Repository links are provided as footnotes in each publication.All data sets that we sourced ourselves, or that are not publicly-available, are alsoprovided. In the case of Paper E, reproducibility is “built in” to the LATEX reportsource itself using the Knitr approach [Xie15]. That is, compiling the report involvesdynamically generating the plots, figures and tables using embedded R code. Weadvocate this transparent means of conducting research.

47

Page 62: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

4 Research Approach

Algorithm name Source Relevant PapersAsso Pauli Miettinen

mpi-inf.mpg.de/~pmiettin/src/DBP-progs/A,B,C

Assomv Pauli Miettinen (email request) A,B

GreEss Radim Belohlavek (email request) A,B,CPaNDa/PaNDa+ Claudio Lucchese

hpc.isti.cnr.it/~claudioA,B,C

NMF Jean-Philippe Brunetportals.broadinstitute.org

A,B,C

MMMF Jason Rennie (email request) A,BNMI (evaluation metric) Nguyen Xuan Vinh

sites.google.com/site/vinhnguyenx/publicationsD

k-Means, Single Link, Com-plete Link Clustering

R base package D

EM Clustering R EMCluster package DMclust-EM R Mclust package DDBSCAN R dbscan package DSTSC L. Zelnik-Manor and P. Perona

vision.caltech.eduD

DipMeans Argyris Kalogeratoskalogeratos.com/psite/material/dip-means/

D

PgMeans Greg Hamerly (email request) DRIC Christian Böhm (email request) DSync Junming Shao (email request) DFOSSCLU Sebastian Goebl (email request) DeAgglo, eDiv, cp3o R ecp package EpDPA R cghseg package EBinSeg R changepoint package ETwitter Anomaly Detection Twitter

github.com/twitter/BreakoutDetectionE

Twitter Breakout Detection Twittergithub.com/twitter/AnomalyDetection

E

EGADS Yahoo!https://github.com/yahoo/egads

E

Extreme Values R extremevalues package ETwitter NLP Alan Ritter

github.com/aritter/twitter_nlpE

Twitter Topic Detection(Streaming NMF)

Kohei Hayashi (email request) E

Table 4.3: Sources for implementations of comparison algorithms

48

Page 63: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

5.1 Summary of Findings

In the following two subsections we discuss our results from two viewpoints. First andforemost, we consider our overall results with respect to the challenges and goals thatwere laid out in Sections 1.3 and 1.4. Secondly, as some of the publications on whichthis dissertation is based were space-constrained to an extent, we elaborate for the sakeof completeness on each publication’s results and arguments separately.

5.1.1 Reflecting on our Challenges and Goals

The goals of this thesis were driven by the community-identified “top challenges” indata mining [YW06] (Sections 1.3 and 1.4). In following sub-sections we reflect on howwe have addressed these challenges and goals through our work.

Challenge 1: Developing a Unified Theory of Data Mining

The first challenge was related to the arguably “ad-hoc” nature of data mining. Althoughconceding that the development of a unified theory of unsupervised, exploratory datamining would be difficult, we embraced the elements of what such a direction wouldentail with the motto induce, deduce and reduce. Our first goal in this thesis was thus tocontribute to the induction of general frameworks, contribute to the deduction of furtheruseful applications, contribute to the reduction of the number of assumptions madein data-mining approaches, and contribute to the reduction in the need for obscureparameters.

We have shown how this goal can be met for a practical set of unsupervised, ex-ploratory data-mining tasks. Specifically, the induction of a general framework wassuccessfully completed through the culmination of papers Papers A, B and C. Theframework is named MDFS, and generalizes the problems of Boolean Matrix Factor-ization, Ternary Matrix Factorization and Ordinal Matrix Factorization. Each of thesespecialized problems is provably reducible to an instance of MDFS, and its algorithmFinesse was shown to outperform state-of-the-art techniques on many instances of thespecial cases in terms of both effectiveness and efficiency. Based on the general MDFS

49

Page 64: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

framework, we deduced additional applications of practical use (in particular, a featurebased on tree objects or “itemsets over an ontology”).

The reduction of commonly-made assumptions in data-mining approaches was demon-strated in Papers D and E. Specifically, the assumption that a multivariate distancemeasure is required when clustering vector data was questioned by our contributionSkinnyDip. SkinnyDip is a unique approach to clustering that exploits the dip test ofunimodality on systematically-selected univariate projections of the data set, thereby notrequiring a multivariate distance measure for clustering objects in a multidimensionalvector space. Furthermore, the assumption that anomaly-detection techniques requirethe absolute values of the sample as an input was questioned by BenFound, whichshowed that anomalous behavior can be detected with just leading-digit (or mantissa)information.

Finally, a reduction in the need for “obscure” parameters was demonstrated by all ofour methods. The FasTer and Finesse algorithms require the parameter k, however donot require a threshold for association confidence as it is used by Asso. The SkinnyDip

algorithm requires a threshold α for statistical significance, however requires neitherthe number k of clusters, nor a density threshold, nor a strict assumption about theform of the clusters. The BenFound algorithm for anomaly-detection in time-seriesdata requires a window-width w and threshold for statistical significance α, howeverrequires no parameterized model from which the measurements are assumed to be taken(the Benford distribution involves no parameters).

Challenge 2: Scaling Up for High-Dimensional Data and High-Speed Data Streams

The second challenge was related to the curse of dimensionality, and additionally to theneed for efficient and scalable algorithms. We subscribed to a set of “guiding principles”to address this challenge. Based on these guiding principles, our second goal in thisthesis was to 1) consider mitigations for the curse of dimensionality in our work, 2)strive for algorithms that have a practically linear run-time complexity in the size of thedata, and 3) use algorithmic paradigms that lend themselves well to parallelization.

Again we showed how this goal can be met for unsupervised, exploratory data-miningtasks. Specifically, we presented with SkinnyDip and SparseDip a novel mitigationfor the curse of dimensionality in the context of clustering vector data. SparseDip

searches for a subspace with coordinate directions that are maximally multimodal,which we argued to be a sensible choice from a clustering perspective. In this way, weare able to focus the actual SkinnyDip clustering on a low-dimensional view of the data,and help to mitigate the various negative effects otherwise faced when clustering inhigh-dimensional spaces (see Section 1.3.2).

The aim of developing linear-time algorithms was met in all of our work. FasTer,

50

Page 65: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

Finesse, SkinnyDip and BenFound each have a linear run-time complexity in the sizeof the input data (in the case of BenFound the input is the set of measurements madein one time window).

Finally, the ability to parallelize was demonstrated with our TMF algorithm FasTer

and our MDFS algorithm Finesse. We showed that we can achieve near-ideal speedupin weak- and strong-scaling scenarios up to 10 processors.

Challenge 3: Mining Time Series Data

The third challenge was related to the extraction of knowledge from temporal datawith trends, seasonality and noise. Our third goal in this thesis was thus to developapproaches that could be deployed in real-time for high-bandwidth time-series data. Moreprecisely, we posed the question of whether or not it was possible to define an anomaly-detection approach that focused, in a non-parametric way, on what is “(un)natural”.

Again we showed how this goal can be met with a novel contribution. Specifically,we presented with BenFound an online, linear-time algorithm for detecting significantdeviations from the “natural” state of a system. We identified in Benford’s Law anintriguing notion for measuring the “authenticity” of signals from many applicationdomains. By monitoring the conformity of the measured signal to the law, BenFound

can hence raise a red flag when the signal deviates significantly in an “unnatural” way.

Challenge 4: Mining Complex Knowledge from Complex Data

The final challenge was related to “complex” data in the form of high noise, andheterogeneous features measured over fundamentally different scales. The final goal ofour thesis was thus to advance the data-mining state-of-the-art by extracting interpretableknowledge from such complex data sets.

Again we showed how this goal can be met for unsupervised, exploratory data-miningtasks. Specifically, we presented with the Matrix Factorizations over Discrete Finite Sets(MDFS) framework an approach which supports interpretable Blind-Source Separationfor data with features measured over heterogeneous scales. We showed how complexknowledge can be extracted using this framework and its linear-time algorithm Finesse.

The aim of remaining robust to noise was well-met through our contributions. Wepresented systematic experiments for FasTer, Finesse and SkinnyDip in which theamount of noise was varied over a non-trivial range. In each case, our empiricalevaluation showed that our algorithms stand up well to the state-of-the-art with respectto varying levels of noise. SkinnyDip, in particular, was shown to be highly robust tonoise. It is able to extract meaningful clusters in scenarios where the overwhelmingmajority (e.g. 80%) of data points belong to a global “clutter” distribution.

51

Page 66: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

5.1.2 Elaboration on the Specific Results from Papers A to E

In following sub-sections we summarize and elaborate on the results of Paper A toPaper E separately.

Paper A: Ternary Matrix Factorization

TMF Shown to be Widely Applicable: We argued that measurements made on thenominal scale of ternary logic are often found in information systems. We listed threepractical applications in our work. The first practical application corresponds to Booleandata with missing (lost or indeterminable) values, which we termed the Missing ValueBoolean Matrix Factorization problem (MVBMF). Approaching this problem using TMFcan help to serve two purposes: 1) finding latent descriptive patterns in the data, and 2)imputing the missing values (“filling in the blanks”).

Missing values may arise in various ways [VS08]. The Missing Completely at Random(MCAR) case that we considered can be found when data collection is done improperly,mistakes are made in data entry, or data is damaged or corrupted.

A concrete example of TMF’s application to MVBMF problems is found in an indepen-dent study by Phillips Research Europe and the University of Technology in Eindhoven[Vav+]. The authors referenced our work, investigating the TMF problem (as publishedin this paper) in the context of organizational Role-Based Access Control (RBAC) withmissing values. In this context, “roles” need to be mined to help organizations thatwant to migrate to RBAC from low-level permission-based access systems. The fullpermissions database is seldom available directly [Vav+], so the set of permissionsassigned to users is often obtained by analyzing the system actions they perform. Theselogs are typically incomplete, hence MVBMF. Importantly, it is typical that a user cantake on multiple roles, hence the RBAC problem is best treated with logical mixingsemantics (TMF) rather than the classical linear mixing semantics used by techniquesfrom linear algebra. We discuss this study further in Section 5.3.

The second practical application of TMF relates to data in which the logical proposi-tion of unknown has a meaning other than “missing”. In Paper A we identified two suchcases with wide applicability: explicit Don’t Know responses in questionnaires [Poe+88;FB75], and null values in application databases [Zan82; Bis81].

TMF Shown to Subsume BMF: The third practical application of TMF relates to thefact that TMF subsumes BMF. Each instance of BMF consists of a data matrix D ∈ Bn×m

and a positive integer k. An instance of our introduced TMF problem consists of a datamatrix D ∈ T n×m and a positive integer k. We presented the reduction from BMF toTMF considering that 1) B ⊂ T , 2) the truth table for ternary logic subsumes that of

52

Page 67: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

Boolean logic, and 3) the set of possible TMF factor matrices is a superset of all possibleBMF factor matrices. We showed therefore that the applications of TMF implicitlyinclude all the applications of BMF (e.g. the students-courses example from Section 3.1,or the Role-Based Access Control problem without missing values [KSS03]).

FasTer Introduced to Solve TMF: We presented FasTer, a heuristic-based algorithmfor approximating the solution to instances of the TMF problem in O(k2nm) time. Ata high level the approach taken by FasTer is a “leapfrog” one: it solves first for Uwhilst keeping B fixed, then solves for B whilst keeping U fixed. This process is iteratedso long as the error continues to reduce. Termination is guaranteed because the TMFobjective function only has a finite number of possible values.

FasTer is non-deterministic because it begins with a stochastic initialization of B. Thatis, a single row vector from D is selected at random to form the first basis vector inB, after which k− 1 of the most “diverse” observations with respect to it are selectedfrom D for the remaining initial basis patterns in B. We noted that rigorous argumentssupporting this kind of initialization approach are given in [ÇM09; BMD09; TKB12].

We presented the pseudocode for FasTer and made an optimized, documented C++implementation available for download and re-use. The implementation has been suc-cessfully used in one independent study to date [Vav+].

FasTer Shown to be More Effective than the State-Of-The-Art: Our empirical analysisof FasTer involved comparing it to the state-of-the-art algorithms for discrete matrixfactorizations (Asso, Asso

mv, PaNDa and GreEss) on TMF, BMF and MVBMF problems.For reference, we also compared to the linear-algebra (“real-valued”) techniques SingularValue Decomposition, Non-negative Matrix Factorization and Maximum-Margin MatrixFactorization (MMMF, see Section 3.2.2). Our generative model included parameters thatenabled us to systematically vary the number of “ground truth” source patterns k, theirdensity (ρt and ρu), the density α of the matrix U the percentage η of randomly-injectednoise. Inspired by the original work on BMF [Mie09], we selected sensible defaults foreach of these parameters.

The results showed that FasTer outperformed Asso, Assomv, PaNDa and GreEss

nearly consistently on these data. In the case of TMF, the quantitative improvementthat FasTer offered over Asso and PaNDa was substantial. One reason for this is thatAsso and PaNDa cannot directly solve TMF problems. That is, to compare to Asso

and PaNDa, we first needed to encode the ternary data matrix into binary format. Toachieve this, we mapped each ternary value to a binary triple: f 7→ (0, 0, 1), u 7→ (0, 1, 0),t 7→ (1, 0, 0), thereby tripling the matrix dimension m. Although at first seeming to be asensible choice, this encoding is in fact unfair to both Asso and PaNDa. Specifically, itresults in non-equivalence between the BMF and TMF optimization goals (see Paper B).

53

Page 68: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

The TMF empirical analysis in Paper A therefore evaluated Asso and PaNDa using anobjective function that was different to the one that Asso and PaNDa believed they weresolving. A fair encoding was later detailed in papers Papers B and C. The experimentsin Papers B and C in turn made use of this fair encoding (the results still showed thatFasTer is superior on such data).

In the case of MVBMF, we observed that FasTer consistently outperformed Assomv

on data sets with up to 99% missing values. Like Asso, the Assomv variant responds

negatively to an increase in the density ρt of the “ground truth” basis patterns. Asis generally the case, all algorithms produced poorer-quality results for an increasingdecomposition rank k. Informally, the decompositions clearly become “more difficult”for increasing k. More formally, we see that the approximation factor increases withrespect to k for the greedy algorithm that solves the ±PSC covering problem (see Section3.2.1, noting that FasTer solves similar problems in a greedy way during each of itsiterations).

In the case of BMF, we observed that Asso is the closest competitor and producedFasTer-similar results for varying “rank” k, noise η and usage-matrix density λ. Asso

and PaNDa, however, both yielded poorer results when the density of the “groundtruth” patterns increased, whereas FasTer remained relatively stable in this respect. Inthe case of Asso, the reason for this sub-optimal behavior is given in [Mie+08, p. 1354]:“If for all i so that bpi = 1 there is q so that also bqi = 1, then we cannot find row bp·from A”. In our context, as we increased the density of the “ground truth” patterns, theprobability of basis vectors sharing t values for any given column increased. The Asso

heuristic, based on computing the association confidences between columns (Section3.2), is less effective in such cases.

The real-valued techniques SVD and NMF consistently outperformed all discretetechniques on the BMF problems with respect to the objective function (Equation (3.2)).This was expected, and echoes the results seen in [Mie09]. Even though SVD and NMFuse the arithmetic mixing operators (normal addition and multiplication), they havea higher degree of freedom for selecting values in their factor matrices. In the case ofSVD, the algorithm can also solve optimally with respect to its objective function. Forthe MVBMF experiments, the results from the MMMF method painted a similar picture:MMMF is likewise based on a real-valued decomposition and uses the classical matrixproduct. Of course, SVD, NMF and MMMF are not a fair comparison in our contextbecause their factors have clear interpretation weaknesses in the context of the domain(see Section 3.1). The error with respect to the objective function therefore fails to tellthe whole story.

To get a better idea of the whole story, we compared FasTer to Asso and PaNDa onthree real-world ternary data sets. In each case, FasTer outperformed Asso and PaNDa

in terms of the reproduction error. Additionally, we qualitatively analyzed FasTer’s

54

Page 69: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

results in the context of the domain, offering intuitive interpretations for the discoveredknowledge. Additionally, we quantitatively compared FasTer to Asso

mv on ten real-world Boolean data sets (on which we injected up to 99% missing values). FasTer

consistently outperformed Assomv on these data.

Finally, we note that FasTer often outperformed NMF and SVD on the TMF problems.In this case, two arbitrary schemes were selected to map the ternary values in the datato real values for use with NMF and SVD. Clearly it is not possible to represent theresults of the ternary truth tables using classical arithmetic operations on these values,so the results for both NMF and SVD can be attributed to their difficulties in trying toseparate the ternary sources by using linear mixing mechanisms only.

FasTer Shown to be more Efficient than the State-Of-The-Art: Our theoretical time-complexity analysis of the FasTer algorithm showed that it has a worst-case run-timecomplexity in O(nmk2). This result was based on the assumption that the number ofhigh-level “refine-and-alternate” iterations is independent of n, m and k. We verifiedthis assumption empirically. We also noted that the theoretical run-time dependency onk would in fact have been cubic, had we not exploited bitwise operations to reduce acritical calculation from O(k) to O(1) operations.

We executed experiments to compare the run-time of FasTer against the run-timeof Asso, PaNDa and GreEss. These experiments confirmed the run-time growth ofFasTer to be in O(nmk2). Both Asso and PaNDa have linear run-time complexity inn, however have quadratic time complexity in m (in agreement with the respectivetheoretical analyses). In summary, FasTer’s run-time grows linearly with the size of thedata; that of Asso and PaNDa is super-linear. GreEss is the most expensive (quadraticgrowth in n, cubic growth in m). Finally, we presented results showing how the global(user-supplied) randomization-round-count affects the quality of the results on problemsof varying size (n, m and k).

Paper B: Ternary Matrix Factorization: Problem Definitions and Algorithms

Approximation Factor Proven for TMF Sub-Problem Under Certain Conditions: Aprimary contribution of Paper B was the investigation of the ±PSC problem in thecontext of TMF. Specifically, we observed that the Ternary Usage Problem (TUP – theproblem that solves for each row of U during any given FasTer iteration when B is fixed)exhibits similarities to the ±PSC problem. We posed the question: Can we reduce ±PSCto TUP? If this could be shown to be possible, we could express each TUP problem asa ±PSC problem and use the algorithm from [Mie08a] with a known approximationfactor (see Section 3.2.1).

We proved in this paper that the answer depends on the form of our contrast function

55

Page 70: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

�T , which in turn depends on how the semantics of the ternary logical propositions areinterpreted for a given application. For one concrete case, where the ternary propositionsare understood to be ordinal in nature (like a Likert item) and the proposition of u is“between” f and t, the reduction succeeds. For the other common case, where the ternarypropositions are understood to be equally different from one another, we provided aproof that a reduction is not possible (Theorem 3 in Paper B).

FasTer±PSC Introduced for Solving TMF: Based on the successful reduction from ±PSCto TUP in one case, we introduced FasTer±PSC as a FasTer variant. Instead of theoriginal greedy heuristic, FasTer±PSC uses the ±PSC algorithm from [Mie08a] to solveeach TUP instance. The provable approximation factor suggested that the effectivenessof FasTer±PSC would be greater than that of vanilla FasTer.

The use of the ±PSC algorithm, however, came at the cost of scalability. Specifically,our theoretical analysis of the FasTer±PSC run-time showed that its worst-case run-time is O(k2nm(m + k) log(m + k)), compared to the worst-case run-time complexity ofvanilla FasTer (O(k2nm)).

FasTer±PSC Shown to Outperform FasTer (Effectiveness) Under Certain Conditions:We showed that FasTer±PSC can give very competitive results, significantly outperform-ing vanilla FasTer, particularly for small-to-moderate values of k. For k = 12, 14, 16, forexample, the reconstruction error was very close to the error attributable to the noise.Said differently, FasTer±PSC achieved near-optimal decompositions for the default pa-rameter values.

Unfortunately, FasTer±PSC yielded to vanilla FasTer for large k and large ρt (the den-sity of true values in the “ground-truth” basis matrix B). This behavior was attributed tothe use of the ±PSC algorithm, the approximation factor for which is known to growwith both k and ρt (see Section 3.2.1).

Parallellizing FasTer Delivered Promising Speedup Results: Another primary con-tribution of Paper B was the implementation of a parallel version of FasTer. Usingempirical evidence of “diminishing returns”, we first argued that the use of paralleliza-tion for improving the TMF solution accuracy (through more initialization rounds) isinefficient. We also found memory usage to be of secondary concern (the storage of theproblem data structures can exploit sparsity, and FasTer requires neither floating-pointarithmetic nor Asso-style large intermediate storage). It was thus decided to exploitthe shared-memory, multiprocessor programming OpenMP [Boa08] API and focus onstrong- and weak-scaling scenarios.

After identifying the elements of FasTer that were good candidates for parallelization,we discussed the advantages and disadvantages of the various scheduling strategies

56

Page 71: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

static, dynamic and guided. We presented speedup and efficiency results in strong-and weak-scaling scenarios. To this end, both the processor (virtual core) count andscheduling strategy were varied. Speedup and efficiency was shown to be near-idealfor up to approximately eight processors. At 22 processors the speedup reached amaximum of approximately 13. Overall, we did not witness a large variation in thespeedup and efficiency between the scheduling strategies. We did, however, witnesssuper-linear speedup for one case of static scheduling. This result was not particularlysurprising – we subsequently clarified the low-level cache-based mechanisms that canlead to such an effect when static scheduling is in place.

Paper C: Factorizing Complex Discrete Data “with Finesse”

BMF, TMF, OMF and Beyond (MDFS as a Unifying Framework): Perhaps the primarycontribution of Paper C was the development of a general matrix factorization problemthat subsumes BMF, TMF and OMF, and provides a formal structure for additionaldecompositions of this type (helping to address Challenge 1).

After reviewing the problem definitions of BMF, TMF and OMF, we highlighted theirsimilarities and began pursuing a generalization. One question in particular remaineda hurdle: what should be the structure and purpose of the usage matrix? At this stageit is warranted to explain the intuition behind our eventual decision to recommend aBoolean usage matrix for the general case.

In Figure 3.2 (Section 3.1) the interpretation of the BMF usage matrix is clear – a rowindicates which basis patterns are “mixed” (disjunction) to explain the correspondingdata-set observation. The OMF example from Figure 3.3 is also readily understood – thelargest scale value ( ) is the indicator value in this case.

In the seminal OMF article [BK13] it is implicitly argued that the OMF usage matrix(like that in Figure 3.3) should not restrict entries to the two extremes (in a binaryfashion), but rather allow values from the full ordinal scale. The OMF multiplicationoperator ⊗L is used to this end. For completeness we intended to support this kind ofrelaxation in our generalization, however it is worthwhile to briefly discuss it here froman interpretability perspective.

Consider a synthetic data set over the five-element ordinal scale . Such ascale could correspond to the format of a typical Likert item in a questionnaire (e.g.= strongly disagree, = disagree, = neutral, = agree, = strongly agree).For illustrative purposes we assume the data set has n observations and four attributes.The four elements in a row represent a respondent’s answers to four Likert items. Weassume that the data set has an exact OMF decomposition of rank k = 3. The ordinalusage row vectors u1·, . . . , un· contain entries from the same five-element ordinal scaleand prescribe the recipe for “mixing” the ordinal basis row vectors b1·, b2· and b3· to

57

Page 72: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

form each data set observation d1·, . . . , dn·:

d1· 1/4 1/4 2/4 3/4...

dn· 0 0 0 0

=

u1· 1 3/4 1/4...

un· 1/4 2/4 3/4

�b1·0 1/4 2/4 3/4b2·2/4 1/4 2/4 0b3·1/4 0 1/4 1/4

Our focus here is the last data observation: how does the decomposition explain dn·?From the last usage vector un· we see it explained by “a lower agreement with b1· mixedwith a moderate agreement with b2· and mixed with a higher agreement with b3·”. Basedon the multiplication operator suggested in [BK13], specifically Łukasiewicz’s strongconjunction connective (a⊗ b) = max( , a + b− ), we mathematically understandthe “strongly disagree” result (dn· = ).

Although we stress that this is a pathological example, it does raise the question ofwhether or not this interplay of two non-linear operations (one for “mixing” and onefor “weighting”) is perhaps too esoteric or unintuitive for practitioners to grasp. Inthis work we hence focused primarily on the OMF variant which restricts the usagematrix entries to be binary. The ternary case (TMF) also works along these lines:the problem definition stipulates that the usage matrix should be binary, based onthe justification that “unknown” propositions in the usage matrix may be difficult tocomprehend (see Paper A). Finally, such a restriction also offered us a bonus: it enabledus to handle heterogeneous data sets (Challenge 4), which is another key advantageof our framework.

With all this in mind, the pieces were in place to introduce our framework, namedMatrix Factorizations over Discrete Finite Sets (MDFS). We presented the formal re-quirements for the types of data that are supported by the framework. Specifically, afeature is supported by our framework if it satisfies our definition of being a “mixable”feature. A “mixable” feature must be one that is measured over a finite set of valuesand one that admits a particular algebraic structure. This algebraic structure must havea binary, closed “mixing” operator (the analog to arithmetical addition) and a contrastfunction for measuring (dis)similarity between members of the set. Finally, each featuremust include an indicator for whether or not it supports an analog to multiplication (e.g.Boolean conjunction), provide that binary function if so, and also provide a mechanismfor generating new candidate values during the search for patterns (in the trivial casethis could enumerate all items of the corresponding set).

Additional Features Deduced for the MDFS Framework: MDFS is a unifying frame-work for Blind-Source Separation on “mixable” discrete data. Compared to BSS tech-niques from linear algebra, MDFS uses factorizations employing a specialized matrixproduct that respects the “mixing” semantics of each data type. MDFS subsumes BMF,

58

Page 73: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

TMF and OMF, which each use a specialized matrix product for their respective targetdata types.

Having induced a general framework, we were then able to ask the question: Canwe deduce further data types that satisfy our requirements for a “mixable” set? A furthercontribution from this paper was the investigation of a new feature type that lendsitself well to this kind of decomposition. The feature could be called “itemsets over anontology”, or simply a “tree” feature.

We find such features in many modern-day information systems. In our paper wepresented the real-world example of recipe data from yummly.com. For any given recipe,Yummly measures information over various measurement scales. The “flavor” flags, forexample, are Boolean, and the “rating” measurement is an ordinal one. Additionally,each recipe includes a set of ingredients, and these ingredients are nodes in Yummly’singredients ontology (Figure 5.1). This is a concrete example of an “itemsets over anontology” or “tree” feature.

Ingredient

Vegetable

Bulb

Onion

Shallot Brown Scallion

Root

. . .

Meat

. . .

Dairy

Cheese

Hard

. . .

Brined

Feta Sirene Svecia

Butter

. . .

Cream

. . .

Figure 5.1: A basic culinary ingredients ontology. The bold sub-tree represents (therecipe with) the ingredients list scallions, feta, sirene and butter.

By factorizing this heterogeneous data matrix using MDFS, we were able to exposelatent sources that led to useful information in the domain context. For example, thebasis matrix included associations of the form: “Recipes like pies and quiches thatinclude meat and dough products are often associated with a high rating and a longpreparation time”, or “recipes with chili are often associated with a piquant flavor”.

Finesse Introduced to Solve MDFS: We presented Finesse, a heuristic-based algorithmfor approximating the solution to instances of the MDFS problem in O(k2nm |Fl |2) time(where Fl is the discrete feature set having the largest cardinality). At a high level thealgorithmic paradigm is similar to that taken by FasTer (“leapfrog” or “alternate and iter-ate”). However, Finesse no longer dictates a strict data type and strict mixing semantics.It handles features in an abstract way, enabling concrete extensions to additional datatypes that support our formal requirements for a “mixable” set. It thus supports inputdata sets with features measured over a variety of discrete measurements scales. To thisend, the relevant software interfaces can be implemented. Our publicly-available imple-

59

Page 74: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

mentation includes out-of-the-box support for Boolean, ternary, ordinal and tree features.

Finesse Shown to Find Intuitive Patterns on Large-Scale, Heterogeneous Data: Inaddition to the aforementioned Yummly example, we further illustrate Finesse’s real-world applicability, and in particular its scalability, through a novel investigation ofthe popular imdb.com (Internet Movie Database) movie data here. In particular, we arecurious about film shooting locations, and ask how they associate with genres, ratings anda technical widescreen attribute.

Whilst geographical places (suburbs, political states, countries and so on) are relatedthrough a hierarchy, the IMDB data files only contain simple text strings for each film’sshooting locations. The Shawshank Redemption (1994), for example, is listed as havinghad one scene shot in St. Croix, U.S. Virgin Islands. To prepare the data for MDFS,we used Google’s Places API1 for extracting the hierarchical geographical componentsassociated with these strings. We generated the 82646-node “filming locations ontology”from this information. Note that one could argue that the requirement of an ontology isa disadvantage of our technique, however it is also clear that many applications havethe ontology for their data ready-made (e.g. Yummly uses an ingredients ontology forimproving the user’s search experience on their website).

Each row vector in the IMDB data matrix represents a single movie over a num-ber of features, the first being an ontology-itemset feature for the shooting locations.Analogously to the aforementioned ingredients example, a matrix value in the firstcolumn is a subtree of the 82646-node locations ontology (for example, the subtreeencapsulating various neighborhood shooting locations in Paris and New York). Anumber of Boolean features follow, recording the presence or absence of genre tags like“Action” and “Romance”. Rating information is next – here IMDB measures on a scaleof 1.0 to 10.0 with increments of 0.1, so this feature is ordinal with |L| = 91. Finally,a ternary feature is used for the widescreen attribute as it contains missing (unknown)values which we wish to impute. This heterogeneous data set (together with the filminglocations ontology) is in the public repository for reuse.

The repository also contains the results for 3 ≤ k ≤ 16. Figure 5.2 shows the rank-nine (k = 9) decomposition, again selected for interpretation because it correspondsto the “elbow” point on the error-k curve (Figure 5.13). This decomposition of then = 14690 titles2 for which rating, location and genre information was available requiredapproximately two hours running on a single virtual core (Intel Xeon X5690 3.46 GHz).Despite the data set being larger than the Yummly example, the reconstruction error waslower (6%) due to the relative sparsity of the genre features. This IMDB example also

1developers.google.com/places2Titles including “(TV)” were explicitly excluded.

60

Page 75: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

illustrates that it is not uncommon to find large real-world ontologies (82646 nodes here)where Finesse’s scalable approach is more tractable than the use of binary encoding anda BMF technique (simple encoding here would give n ·m > 1× 109, making techniqueslike Asso and PaNDa+ intractable for use).

Shoo

ting

loca

tions

Act

ion

Rom

ance

Fam

ily

Crim

e

Adv

entu

reFa

ntas

y

Sci-F

i

Dra

ma

Com

edy

Mys

tery

Ratin

g

Wid

escr

een

b1·

b2·

b3·

b4·

b5·

b6·

b7·

b8·

b9·

( )

( )

( )

( )

( )

( )

( )

( )

( )

6.9/10

6.8/10

7.6/10

7.0/10

7.0/10

5.9/10

6.2/10

5.6/10

7.2/10

Figure 5.2: The basis matrix corresponding to the rank nine (k = 9) decomposition ofthe IMDB data.

We offer a brief interpretation. The American motion-picture industry is unsurpris-ingly very present, with movie shooting locations often found directly in Los Angeles( ) as well as in California ( ) and nationally in general ( ). Movies likeMob City (2013), having elements of crime, drama and mystery and shooting locationsin Los Angeles (b5·), have slightly better overall ratings than comedies like Fired Up!(2009) from that area (b1·). These in turn are generally associated with better ratingsthan action-drama films (b8·) like Blue Thunder (1983). Perhaps owing to its panoply ofgeoclimatic regions, Canada and its cities (e.g. Vancouver ) are identified in b7· asfrequently being the set for action-adventure films like Journey to the Center of the Earth(2008). The intuitive latent association of Buenos Aires ( ) with romance is visible inb6·, mapped by the usage matrix to titles like Verónica: El rostro del amor (1982) and Elamor tiene cara de mujer (1964). Interestingly, widescreen films having shooting locationsin America and Europe ( ) frequently enjoy higher ratings (b3·). Earth Story (1998) issuch an example – this film was shot in widescreen but this information is not includedin the IMBD database (Finesse was able to impute it using the ternary feature). Ofcourse, many titles are explained by a mixture of patterns. Napoléon (2002), for example,is an adventure-drama filmed in widescreen on the European and North-Americancontinents with a rating of 7.3. It is explained in the usage matrix by b2· ⊕ b3· ⊕ b9·.Finally we note that the rating distribution is in line with the average film rating of 6.9on IMDB.

61

Page 76: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

Finesse Shown to be More Effective Than the State-Of-The-Art: Our results showedthat Finesse outperforms the state-of-the-art techniques Asso, GreEss and PaNDa+(note that PaNDa+ [LOP14] is an improvement over the PaNDa algorithm investigatedin Papers A and B) almost consistently.

Our first focus in this paper was on ordinal data sets, like those found in surveyresearch. Again we used a generative model inspired by real-world data. Specifically,we harvested and analyzed the distribution of entries in ten studies with Likert items(See Table 4.1). Analysis of these questionnaires showed that they had an approximately-uniform distribution of response values, so we designed our generative model suchthat the default parameters generated data matrices with an approximately uniformdistribution of response values.

Varying each parameter systematically, we found that Finesse outperformed bothAsso, GreEss and PaNDa+ in terms of reconstruction error. To compare to Asso andPaNDa+, we encoded the ordinal data in Boolean form. We again used an encodingscheme that ensured equivalence between the objective functions being solved by eachalgorithm. GreEss was the only other technique that could work with an ordinaldata matrix directly, so no encoding was required. Unfortunately, however, it wasonly tractable to compare to the computationally-intensive GreEss on small data (e.g.n = m = 50).

Figure 5.3: GreEss achieves the optimal decomposition on this synthetic, no-noise ordi-nal data. The basis matrix is shown below, and the usage matrix to the right.It correctly identifies the three ground-truth patterns (k = 3).

Although GreEss yields near-optimal results for the case of zero noise (and thus

62

Page 77: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

outperformed Finesse), our results showed that its performance significantly degradedas realistic levels of noise are added (it yields to Finesse at less than 10% noise). Tounderstand why, we consider the GreEss decomposition for a simplified ordinal dataset with η = 0% (no noise) in Figure 5.3.

We see from Figure 5.3 that the decomposition is exact. Without the need for aparameter k, GreEss impressively finds the three latent patterns (bottom), variousmixtures of which (right) precisely reproduce each row in the data set. We now makethe problem more realistic by lightly “salting” our data matrix: one element from eachof the 42 non-zero columns is set to zero (less than 2% noise). GreEss’ decompositionafter this change is in Figure 5.4 (this time displayed canonically).

Figure 5.4: GreEss requires 24 basis vectors to exactly reconstruct this synthetic, noisydata set.

We see from Figure 5.4 that GreEss now needs to generate 24 basis vectors to exactlyreconstruct the data set. The three ground-truth basis vectors found in the zero-noisecase (Figure 5.3) are now seen fragmented. The reason for this is that GreEss preciselymodels the signal and the noise, using a from-below approach where over-coverage isprohibited [BK13]. This practically means that small quantities of “salt” noise aresufficient to cause the fragmentation of the true signal in the decomposition. Asso,Finesse and PaNDa+ are more liberal in this respect, tolerating patterns which maysomewhat over- or under-cover the data in the hope of separating the signal from thenoise. Such examples show that this balanced approach, the basis of which is formulatedin the Positive-Negative Partial Set-Cover problem [Mie08a], is more appropriate forrealistic levels of noise.

We also compared Finesse to both Asso and PaNDa+ on the ten aforementionedreal-world survey data sets. Again, we were unable to compare to GreEss here due tothe size of the data. The “ground-truth” model-order k was not known for these data,so we ran experiments for both k = 10, 20. In all cases (20 experiments in total), Finesse

63

Page 78: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

outperformed both Asso and PaNDa+.Finally, we also repeated the BMF experiments with the new PaNDa+ algorithm.

Compared to the experiments in Papers A and B, these experiments used a differentgenerative model. Specifically, the default parameters generated a data matrix with anequal distribution of Boolean f and t values. The results are shown in Figure 5.5. Finesse

outperformed both Asso and PaNDa+ in all cases.

Finesse More Efficient Than the State-Of-The-Art: The theoretical worst-case run-timecomplexity of Finesse is linear in the size of the data, whereas Asso and PaNDa+ (likePaNDa) are both quadratic in the number m of features. GreEss is yet more complex: itsrun-time is quadratic in the number n of objects and cubic in the number m of features.This result was confirmed in our practical experiments on run-time/scalability.

In addition to our Java version, we investigated two performance-optimizations forFinesse. The first step was to rewrite the algorithm in C++ and utilize SSE4 intrinsics(Streaming SIMD Extensions 4) for vectorization. Specifically, Finesse can exploitbitwise AND, bitwise OR, bitwise XOR and POPCNT3. Recent advances in SIMD instructionsets on general-purpose CPUs process up to 512 bits at a time in this fashion (e.g.AVX-2 and AVX-512), significantly reducing Finesse’s absolute response time on asingle CPU. Our run-time experiments included “proof of concept” results for the SSE4(128-bit) case, which showed that the absolute response time of the optimized C++version was approximately 1

20 that of the Java version. This version outperformed thePaNDa+ algorithm, which has likewise been optimized in C++. The second optimizationof Finesse – the use of OpenMP for parallelization of Finesse’s inner loops on tenprocessors – yielded in a further speedup factor between 8 and 9 (analogous to theresults seen for FasTer parallelization in Paper B).

Paper D: Skinny-dip: Clustering in a Sea of Noise

Motivation: Data Sets with “Extreme” Clutter Expose Significant Limitations in Exist-ing Clustering Paradigms: We considered as a case-study the synthetic two-dimensionaldata set in Figure 5.6. 80% of the objects are sampled from a global, uniform “clutter”distribution. In the figure, the raw data is accompanied by attempts to cluster it usingthe popular Expectation-Maximization (EM), k-Means and Mclust-EM algorithms. Thequality of their results leaves much to be desired, particularly their treatment of theglobal “clutter” or “noise”. EM and k-Means are examples of well-known concretealgorithms in the partition-based clustering paradigm. In this paradigm, the separationof large volumes of clutter is often not a fundamental consideration. Centroid-based

3“Population count” – efficient calculation of the number of 1s in a bit string.

64

Page 79: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 500.5

1

1.5

2

·105

Asso

Finesse

PaNDa+

Decomposition rank k

Erro

r(M

DFS

Obj

ecti

veFu

ncti

on)

2 3 4 5 60.5

1

1.5

·105

Asso

Finesse

PaNDa+

Count λ of t values in U

Erro

r(M

DFS

Obj

ecti

veFu

ncti

on)

−2.4−2.2 −2 −1.8−1.6−1.4−1.2 −1

0.5

1

1.5

2·105

Asso

Finesse

PaNDa+

Distribution parameter ρ

6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

0.5

1

1.5

2

2.5

·105

Asso

Finesse

PaNDa+

Noise η (%)

Erro

r(M

DFS

Obj

ecti

veFu

ncti

on)

Figure 5.5: Boolean Matrix Factorization for synthetic data with varying data-generationparameters k, λ, ρ and η.

65

Page 80: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

techniques often fail to consider noise whatsoever, dictating that each point be assignedto its nearest centroid.

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

● ●

●●

●●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●●●

●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

●●●●●

●●● ●●●●

●●●

●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●●●

●●

●●●

●●●

● ●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●● ●

● ●●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

● ●●

● ●

● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●● ●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

Raw data

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

EM (hard) clustering

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

K−Means

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

MCLUST

Figure 5.6: Results from further clustering techniques on the “running example” datain Paper D. MCLUST is the algorithm used in [DR98] (discussed in Section3.4.1).

In Paper D we demonstrated analogous results using more specialized methodsfrom other clustering paradigms, like the density- and spectral-based DBSCAN andSelf-Tuning Spectral Clustering [ZP05] algorithms. In addition to our “running exam-

66

Page 81: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

ple”, we further motivated our work by showing real-world data sets which exhibit thecharacteristics on which we focus: high clutter and heterogeneous cluster shapes.

Formal “Dip Test” Identified as a Useful Tool for Data Mining: Given the sub-optimalresults from existing clustering techniques on such data, we considered alternativeapproaches to the problem. We investigated the intuitive notion that equates clusterswith the modes, or “modal regions”, of a multivariate distribution. Intuitively, thisconcept was a strong fit: modes are generally invariant to globally-added “clutter”,and can have a range of spatial extents. We noted that the idea of “hunting” forthe modes of a multivariate distribution had already been studied in the statisticalliterature [BP09; Ooi12], however had received somewhat less attention in the data-mining community. Specifically, “mode-hunting” clustering techniques had often beenbased on computing non-parametric density estimates of the data using methods likeKernel Density-Estimation (KDE) [Ros+56; Par62]. In [LRL07] it is explained that itis computationally-expensive to perform clustering based on KDE (quadratic-time inthe data set size n) and that KDE additionally requires the tuning of a “bandwidth”parameter.

In Hartigan’s dip test we identified existing statistical work that enabled the detectionof modal regions in a non-parametric, linear-time and parameter-free way. We thusbegan our novel contributions by investigating the dip in the context of extractingheterogeneous clusters in cluttered environments. The dip is a statistic that measuresthe departure from unimodality of a given (univariate) empirical sample [HH85]. The diptest is a statistical hypothesis test that uses the dip to address the null hypothesis “thegiven data was sampled from a unimodal distribution”. In addition to helping rejector accept this null hypothesis, the dip test yields a further piece of useful informationfrom its calculations, namely the interval corresponding to the most primary mode. Weobserved that the test has a list of advantages that mirror much of what we wish tosee in a general clustering technique. These include 1) no strict assumptions about theform of a “cluster” (i.e. non-parametric), 2) no required algorithm parameters, 3) highlyrobust to global clutter and outliers, 4) deterministic, 5) shift- and scale-invariant, and 6)linear run-time complexity in the size of the sample.

We also noted, however, that the test has two fundamental limitations: 1) it is applica-ble to univariate samples only, and 2) it identifies only one (primary) modal interval.Despite its limitations, we argued that its advantages make a dip-based clusteringapproach worthwhile pursuing. The remainder of our contributions in this work werebased on overcoming these limitations to produce a general clustering technique.

UniDip Introduced to Clutter Univariate Data: We took the first step in the directionof our goal by introducing UniDip, a recursive dip-based algorithm for clustering

67

Page 82: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

univariate data. The algorithm operates by exploiting the two fundamental questionsanswered “out-of-the-box” by the dip, namely: 1) Is the distribution from which the samplewas taken multimodal?, and 2) Where is the interval corresponding to the most primary mode?

Based only on this information, UniDip follows a recursive “divide-and-conquer”approach to extract the locations of all modal intervals in a univariate sample. It is non-parametric and requires no algorithm parameters (other than a threshold α for statisticalsignificance, which is set to 0.05 by default). In Figure 5.7 we see a demonstration ofUniDip’s ability to extract modal intervals formed from various distribution types. Itsresult on a busier sample is included in Paper D.

Raw Data (Histogram)

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

UniDip

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

Figure 5.7: The UniDip result on a univariate sample (from left: a Beta distribution, aGaussian distribution, a uniform distribution, and 50% “global” clutter overthe interval [0, 1]).

UniDip inherits all of the core advantages of the dip mentioned in the previoussection. It is a standalone technique, and can be used for finding all the modes of acontinuous univariate distribution. To the best of our knowledge, this non-parametric,noise-robust, linear-time and parameter-free technique for finding such intervals is anovel contribution.

SkinnyDip Algorithm Presented and its Properties Investigated: UniDip solves ourneed to identify all modal intervals in a sample. However, it only does so for univariatedata. The next step in the direction of our goal was to extend this approach to themultivariate case. The result, SkinnyDip, is a recursive dip-based algorithm for clustering

68

Page 83: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

continuous multivariate data. At a high level, one can describe SkinnyDip as a techniquethat recursively partitions the data, dimension by dimension, based only on the resultsof the univariate application of the dip. To do this, SkinnyDip applies UniDip onfiltered univariate projections of the data in a systematic way. SkinnyDip yields “modalhyperintervals” (e.g. rectangles in two dimensions) corresponding to each detectedcluster. Data falling within one of these areas is assigned the corresponding cluster label.Other data is labeled as noise/clutter.

Compared to other clustering techniques, SkinnyDip has a unique combination ofproperties. Its practical run-time is linear in the number of objects and dimensions.It requires no parameters, other than a threshold for statistical significance (in ourwork we always use the standard value of α = 0.05). It is deterministic and highlyrobust to clutter and outliers. Finally, and perhaps most interestingly, it requires nomultivariate measure of distance on the space. This allows SkinnyDip to extract clusterswith different spatial extents, like the long, thin rectangle in Figure 5.6. To the bestof our knowledge, SkinnyDip is the first general clustering technique for vector datathat does not perform multivariate distance computations between objects in the data.That is, in line with our motto in Section 1.3.1, it curiously questions a fundamentalassumption made by existing clustering approaches, namely that a multivariate measurefor the distance between objects is required on the space.

SparseDip Algorithm Presented for Optimal-Subspace Pursuit: We presented SparseDip

as an additional layer on top of SkinnyDip. The goal of SparseDip is to enable clusteringon high-dimensional spatial data sets by searching for a subspace in which SkinnyDip

can be applied. SparseDip does this by searching for directions in the data that aremaximally multimodal with respect to the dip statistic. Considering that multimodalityimplies separation, we argued that focusing on such directions is a useful heuristic forthe task of clustering.

By exploiting results relating to the continuity of the dip [KL05], we were able to makeuse of gradient ascent in the search for these maximum-dip directions. Unfortunately,the relevant function is non-convex, containing many local optima. For this reason, thechoice of the starting point for gradient ascent became an important factor.

To select candidate starting points for gradient ascent, we evaluated two options. Thefirst, a naïve approach, involved simple random-sampling of candidate starting points.The second involved using the nodes of a sparse grid. A sparse grid is an efficient datastructure that can be used to represent functions in high-dimensional spaces. Sparsegrids are often used in the field of numerical simulation in order to make tractable thecomputation of approximate solutions to partial differential equations (PDEs) in highdimensions [BG04]. In our case the end-goal was not to solve a PDE, nor to accuratelyrepresent or interpolate the function. We decided to evaluate sparse grids simply as

69

Page 84: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

a mechanism to find candidate starting points for gradient ascent. We empiricallycompared the sampling from sparse grids with simple random sampling and foundthat the sparse grids approach worked more favorably on a number of optimizationbenchmark functions (one example is shown in Figure 5.8). Additionally, the use ofsparse grids led to higher-quality results on all ten of our real-world data sets.

SparseDip builds an orthogonal basis using the maximum-dip directions that arefound. The search is terminated when no more significant directions are found (withrespect to a significance threshold α). The projection of the data onto this basis is thenpassed to SkinnyDip, which performs the clustering in that subspace.

Figure 5.8: Using a constant sample size of the Levy function, the curves show theminimum value found by varying dimensionality when using 1) simplerandom sampling and 2) sparse-grid-based sampling.

SkinnyDip Shown to be Robust Against “Extreme” Levels of Noise and Clutter: Wepresented results showing that SkinnyDip is able to extract intuitive clusters embeddedin high levels of noise and clutter (see Figure 5.9). These experiments suggested that theperformance of techniques from the remaining clustering paradigms was degraded atthis level of noise. In addition to experiments on synthetic data with varying amounts ofclutter, we presented a real-world case study using the North Jutland (Denmark) Road

70

Page 85: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

Network data from the UCI Machine-Learning Repository. This data is highly noisy,with insignificant road-segments spaced more or less evenly through the countryside.On this data we showed that SkinnyDip is able to extract intuitive clusters correspondingto high-settlement areas (cities). These clusters have by no means a Gaussian spatialform. The spatial form of the city of Frederikshavn, for example, is constrained byits proximity to the coast. SkinnyDip’s properties of being non-parametric and notrelying on a particular measure of distance in the multivariate space was of noticeableadvantage here.

SkinnyDip Shown to Outperform the State-Of-The-Art in Terms of Effectiveness:We empirically evaluated SkinnyDip against six state-of-the-art automated clustering al-gorithms from different paradigms (partition-based, density-based, spectral-based). Thefirst part of our evaluation was completed in a controlled way on synthetically-generateddata. The model we selected for synthetically-generating data was based on the com-plexities we were attempting to address: high levels of clutter and heterogeneous (albeitstill convex) cluster shapes (Challenge 4). The generative model included parametersthat gave us control over 1) the number of clusters, 2) the number of dimensions, and 3)the level of clutter.

Using Adjusted Mutual Information as our evaluation metric, we found that Skinny-Dip was able to outperform all comparison techniques consistently on our generateddata when working with problems of dimensionality up to m = 8. Above this level, wefound that the SkinnyDip result was not useful. Indeed, as the dimensionality of theproblem increased, the limitations of SkinnyDip’s “dimension by dimension” heuristicbecame significant. However, we argued that it is uncommon to find a real-worldclustering that exists in a non-redundant way in this many dimensions. We suggestedthe use of SparseDip as a pre-processing option for practical data sets of non-trivialdimensionality. As discussed in Section 4.6.1, we additionally evaluated a portion of theclustering results using the Normalized Mutual Information metric, which suggestedthat there is little deviation between the two measurements (Figure 5.9).

We further evaluated SkinnyDip on ten real-world data sets from public repositories(see Section 4.4). The data varied in size and dimensionality (up to n = 7494 and m = 16).In seven cases out of ten, SkinnyDip outperformed the suite of comparison techniques.In two of the remaining cases, it ranked second. Additionally, the use of SparseDip

identified a number of interpretable and useful “maximally-modal” directions in thedata that yielded intuitive insights into the domain.

SkinnyDip Shown to Outperform the State-Of-The-Art in Terms of Efficiency: As afull-space clustering technique, we showed that SkinnyDip’s theoretical worst-case run-time complexity is O(n · log(n) ·m + n ·m · k), where n is the number of data objects,

71

Page 86: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

Figure 5.9: A reproduction of Figure 2 in Paper D, with the addition of SkinnyDip’sNormalized Mutual Information values (the NMI is the top-most dotted redline series, seen slightly above the blue AMI measurement for SkinnyDip).

m is the dimension of the space, and k is the maximum number of clusters that isfound from a call to UniDip (by definition not greater than the number of clustersfound in total). The logarithmic factor is related only to the requirement of sorting thedata (computation of the dip requires a sorted sample). Using aggressively-optimizedR sorting routines, we were unable to detect this logarithmic factor. Practically, ourrun-time experiments even showed that SkinnyDip scales sub-linearly with the size ofthe data, which is attributable to the fact that data objects cease to be considered bySkinnyDip’s recursive calls after they have been labeled as noise. Practically then, linearcomplexity is a conservative statement about SkinnyDip’s run-time. As a comparison,the standard implementation of DBSCAN is quadratic4 in the number of objects n.

In terms of absolute response time, SkinnyDip outperformed all six comparisontechniques. The closest competitor was DBSCAN. The slowest of the competition wasSelf-Tuning Spectral Clustering; the computation of its locally-scaled affinity matrix isvery expensive and we were not able to complete all of its run-time experiments withina reasonable time frame.

4Optimized implementations of DBSCAN can reduce this factor to n · log(n).

72

Page 87: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.1 Summary of Findings

Paper E: Let’s see your Digits: Anomalous-State Detection using Benford’s Law

Benford’s Law Holds for Data From Many Online Services: We showed empiricallythat Benford’s Law (BL) holds for data from the online services Twitter, Wikipedia,YouTube and GitHub. Benford’s original manuscript [Ben38] had already shown em-pirically that many real-life sets of numerical data obey the law. These data includedpopulation numbers, lengths of rivers and a list of physical and mathematical constants.Since the original manuscript, it has also become well-known that BL holds for economicdata [Nig99; DHP04]. To the best of our knowledge, however, our work is the firstto investigate conformance for Wikipedia, YouTube and GitHub data (conformity ofTwitter data was already shown in [Gol15]). We argued that this is a convenient property,particularly because data from such online services is often considered “noisy”, andbecause the Benford distribution requires no parameterization.

Benford’s Law is a "Natural" Tool for Anomaly-Detection in Time-Series Data: Wereflected on Benford’s original manuscript [Ben38] which argued that, despite mathe-maticians having named the sequence 1, 2, 3, . . . the “natural” numbers, Nature is foundto count e0, ex, e2x, e3x, . . . much more often. Numerous examples of natural and man-made processes that follow geometric or logarithmic progressions are given in Benford’smanuscript. Given a set of numbers that was generated by such an underlying processand that span a large range, random samples over that set will yield a collection ofnumbers that follow Benford’s law. We argued that conformity to BL is therefore asensible measure for its “naturalness”, and that deviations from this conformity canbe understood as resulting from some “unnatural” influence. We further argued thatsignificant deviations from the Benfordness property in a window of data that is streamedwith high-bandwidth in real-time (Challenge 2) is an intuitive indication for “unnatu-ral” or “anomalous” system behavior, and that this mechanism may prove useful as a“red-flagging” approach for many applications.

Statistical Hypothesis Test Presented for Benford’s-Law Conformity: In order to ob-jectively evaluate the BL-conformity of arbitrary univariate numerical samples, wepresented a statistical hypothesis test based on the formal Kolmogorov-Smirnov one-sample test. The test exploits the fact that the mantissae of a Benford set are uniformlydistributed in [0, 1) [NW12].

We compared our hypothesis test with seven other state-of-the-art BL-conformitytests. These tests were either based on testing single digits at once, testing all digitsat once, or investigating the mantissae. We used two ideal Benford sequences of thesame length for evaluating the tests. Six of the seven tests, based on considering thediscrete digit distributions, failed to conclude that the second sequence was Benford

73

Page 88: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

due to identified discretization errors. The seventh test, the so-called Mantissa Arc test,correctly identified both sequences as Benford. We showed using a simple example,however, that the Mantissa Arc test is not sufficient: it concludes Benfordness on anumerical sample that is clearly not Benford. Our test was thus used to form the basisfor our anomaly-detection technique BenFound.

BenFound Introduced as a Novel Anomaly-Detection Technique for Time-SeriesData: We presented BenFound as a technically-uncomplicated anomaly-detection tech-nique that can be used to monitor “Benford-style” data, that is, data generated bymultiplicative or exponential growth phenomena. We explained how BenFound onlydepends on the choice of a window size w and a statistical threshold α, and discussedhow these values can sensibly be set. To the best of our knowledge, this is a novelapproach to the general task of anomaly-detection in time-series data. All of the otherapproaches that we know of consider the behavior and evolution of the absolute valuesof the measurements, or statistics or moments based directly thereon. We are not awareof a general anomaly-detection technique for time-series data that considers only theleading digits and/or the mantissae.

BenFound Shown to Detect “Unnatural” Behavior in Online Services: We presentedthree case studies showing how BenFound can detect “unnatural” behavior in onlineservices. The first case-study was related to Wikipedia data. We investigated a largeset of change-size deltas (in bytes) for page-edit events that were labeled by Wikipediaas “non-bot”. Analyzing the leading two-digit distribution exposed significant “spikes”which, upon further analysis, were found to corresponded to bot behavior (autonomousagents that were mass-editing Wikipedia content). Wikipedia’s labeling system hadfailed to label the edits as “bot” edits.

The second case-study involved listening to the Twitter tweets against the #Fathers-Day hashtag. Approximately one week before Father’s Day, the Benfordness of thesignal was significantly rejected. It turns out that the hashtag had been “hijacked” bya number of spam and advertising accounts “unnaturally” trying to generate a profitfrom the event. Finally, our third case-study involved listenting to the Twitter tweetsagainst the #PokemonGO hashtag. A sharp drop was seen on July 16 and correspondedto unnatural tweet behavior because of a confirmed denial-of-service attack on thePokemon Go game servers.

BenFound Shown to Outperform Ten State-Of-The-Art Change-Point and Anomaly-Detection Techniques on Synthetic Data: We generated synthetic time-series dataand performed an experiment comparing BenFound to ten state-of-the-art anomaly-detection techniques from the statistics and data-mining literature. The synthetic data

74

Page 89: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.2 Implications for Research

simulated a change in the underlying process from Benford to non-Benford, and then itsregression from non-Benford to Benford thereafter. BenFound was the only techniqueable to identify both changes to the underlying generative process. Many of the othertechniques generated an indigestible number of false positives.

5.2 Implications for Research

This thesis has three main implications for research:

Implication 1: Knowledge can be extracted in the complex context of incomplete,heterogeneous and discrete data measured over non-ratio scales.

Our contributions imply that Blind-Source Separation can be performed over heteroge-neous data sets containing certain kinds of discrete data types (Challenge 4). Throughthe introduction of the Ternary Matrix Factorization (TMF) problem, for example, wehave provided a basis for performing Blind-Source Separation on data measured on thescale of ternary logic. This logical structure additionally implies a useful perspective onthe problem of incomplete Boolean measurements (MVBMF) and has applications incollaborative filtering and the mining of access roles.

With the Matrix Factorizations over Discrete Finite Sets (MDFS) framework, we extendthis concept to arbitrary discrete finite sets having a sensible notion of “mixing”. TheMDFS framework implies progress towards a unified theory of data mining (Challenge1) because it subsumes a number of existing data mining techniques (BMF, TMF andOMF). The MDFS framework also implies that researchers can deduce additional appli-cations by identifying further features with applicability.

Implication 2: Knowledge can be found despite the complexities of “extreme” globalclutter and high dimensionality.

Our contributions imply that object-groupings (clusters) can be found despite theirpresence in a sea of up to 90% noise (Challenge 4) and despite their location in anarbitrarily-oriented, low-dimensional subspace (Challenge 2). Through the introductionof our clustering algorithm SkinnyDip and the subspace-search algorithm SparseDip, wehave introduced a novel take on the problem of extracting knowledge through clustering.The SkinnyDip result implies, for example, that a multivariate distance or “similarity”measure is not a strict requirement for a vector-based clustering method. It also impliesthat the relatively-unknown dip test is an effective tool for data mining – it can efficientlyand non-parametrically quantify a notion of “separation” in continuous univariate data,

75

Page 90: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

and locate the primary “modal” region in such data. It does this in linear-time withoutparameters.

Implication 3: Knowledge can be extracted from the complex context of high-bandwidthtime-series data without needing a parameterized model.

Our contributions imply an intriguing new approach for analyzing high-bandwidthnumerical time-series data (Challenge 3). The approach does not involve the use of aparameterized model for explaining the distribution of the absolute measurement values.Specifically, our anomaly-detection approach BenFound implies that the availabilityof the absolute values of the time-series data measurements is not strictly necessary.By monitoring the conformance of the leading digits or mantissae to Benford’s Law, weobtain an elegant mechanism for determining whether a system has deviated from itsnatural state. Our experiments on real-world data imply that this approach is by nomeans limited to data from social-media streams like Twitter. Indeed, our results onWikipedia, GitHub and YouTube data suggest that many systems measure phenomenathat obey Benford’s Law. To the best of our knowledge, Benford’s Law has not yet beeninvestigated or exploited by the data-mining community.

5.3 Implications for Practice

Practitioners will benefit from the results of this thesis in the following four ways.

Implication 1: Our proposed techniques outperform the state-of-the-art with respectto solution quality on many practical problems.

The effectiveness of a given algorithm (with respect to a sensible objective function) is akey consideration for practitioners looking for high-quality solutions. For the complexproblems we investigate, it is of course difficult to compare the quantitative performancebetween algorithms over the complete set of problem instances. For this reason, we tookcare in the design of our empirical evaluation to ensure that our generative modelsreflected the kind of data seen in practice. On such data, we showed that our algorithmsoutperformed the state-of-the-art in the majority of cases. For our work on discretematrix factorizations, for example, one generative model was inspired by patterns foundin survey research. For our work on clustering, the generative model was inspiredby the levels of clutter often seen in minefield-detection and the clustering of galaxies.For our work on anomaly-detection in time series data, our generative model wasinspired by dynamic observations of the “Benfordness” of system metrics, and how

76

Page 91: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.3 Implications for Practice

non-conformance often relates to events of interest in the system under investigation.In addition to synthetic data, we showed the effectiveness of our techniques on a

number of real-world examples. Practitioners almost always deal with real-world data,and such data is seldom “well-behaved”. In many cases, our quantitative evaluationswere complemented by qualitative ones that showed our techniques yield results whichare more interpretable and intuitive than the state-of-the-art (helping to address Challenge4).

Perhaps the most concrete evidence of the practical impact of our work is found ina recent independent study of TMF [Vav+]. In this study, the authors cite estimatesthat the adoption of role-based access control (RBAC) saved American organizations$USD1.8 billion in the year 2009 [OL10]. The most significant expense in the transitionfrom permission-based access control to RBAC was noted to be in the role engineeringand mapping stage. Our TMF work assists in this migration process by effectivelysolving the role-based mining with missing-values problem at scale. In our work, weshowed empirically that our algorithm FasTer outperforms existing techniques that areable to solve this problem. This conclusion was independently verified in this study[Vav+].

Implication 2: Our proposed techniques exhibit practically linear run-time behaviorin the size of the data.

An algorithm that yields the globally-optimal result to a problem that needed to besolved yesterday is of little use. Often more so than researchers, practitioners and the“production” applications on which they work need to produce results on short timescales. To directly address this challenge, the approaches we have presented yieldhighly-competitive results at linear run-time complexity in the size of the data. Inmany cases, our algorithms outperform the state-of-the-art in terms of both effectivenessand efficiency, helping to address Challenge 2 and simplifying the practitioner’s job ofalgorithm-selection.

Implication 3: Our proposed techniques require no “obscure” parameters.

A Utopian system of unsupervised learning would promptly addresses the requestHere is my data, tell me what I want to know without requiring any further information.Practically, algorithms need some information about the nature of the patterns whichshould be sought and presented. In many cases, this information is embedded as anassumption in the technique itself (e.g. Gaussian clusters); in others, it may be providedin the form of an algorithm parameter (e.g. number of patterns to find). Some algorithmparameters are not intuitively connected with the form of the solution and are instead

77

Page 92: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

used “internally” for managing a heuristic (e.g. a rounding threshold for an intermediatematrix). We might call such parameters “obscure”, because a practitioner may find itdifficult to know how the nature of the solution varies when varying the parameter. Theiterative nature of data mining and KDD becomes more difficult when such parametersare involved, because the user may be less certain about how to proceed. Obscureparameters are hence best avoided where possible.

Fortunately, the algorithms we present in this thesis do not require parameters of thiskind. FasTer requires only the number k of source patterns to find, as does Finesse.SkinnyDip is even more lean: its default parameter value for the statistical-significancethreshold is equal to the typically-used threshold in most of statistics (α = 0.05). Thisvalue seldom needs to be changed in practice (indeed, we did not change it for all ourwork). BenFound requires an analogous significance threshold α and the width ofthe sliding window, which the user can appreciate as representing a trade-off betweenstatistical power and the ability to capture system dynamics. From the perspective ofalgorithm parameters, our algorithms are therefore an attractive choice for practitioners.

Implication 4: Our proposed techniques have documented implementations avail-able in publicly-accessible locations online. All results are reproducible.

Replication is an important criteria by which scientific claims are judged [Pen11]. Theability to replicate results can also be of use for practitioners who wish to get startedon applying our techniques to their own problems. In each of our research articles, wehave provided references (URLs) to supplementary material that includes a download-able research prototype of the proposed method. Each repository includes a detaileddescription and verbose code examples for getting started with the algorithm.

In the most recent contribution (Paper E) we also adopted and advocated the use ofKnitr [Xie15]. Knitr is a LaTeX preprocessor that enables transparent and reproducibleresearch by embedding the source code required for reproducing all results, figures andtables directly alongside the source for the report.

5.4 Limitations

Like all research, there are limitations to the contributions of this thesis. In the followingsections we discuss these limitations in detail.

5.4.1 Non-Optimality

Perhaps the most obvious (albeit important) limitation of the methods presented in thisthesis is the non-optimality of their solutions.

78

Page 93: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.4 Limitations

The field of data mining is rife with NP-Hard problems, including those treatedin this thesis. For example, applying our FasTer algorithm to the Boolean MatrixFactorization problem involves an iterative heuristic that, at each step, needs to solvea set of optimization instances of the ±PSC problem. The ±PSC problem itself isNP-Hard, which currently means that globally-optimal solutions are generally notattainable for it in a computationally-tractable manner. Assuming we want “efficient”algorithms, the limitation of non-optimal solutions will therefore continue for sometime (at least until the field sees revolutionary and constructive contributions in thedirection of P = NP and similar conjectures, or alternate computing machines likepractical quantum computers become available).

Another important limitation is that “optimality” in the sense of a problem’s objectivefunction may not coincide with the “optimal” solution from the perspective of thepractitioner or domain expert. SkinnyDip, for example, involves a hard-partitioning ofthe data points based on which points are determined to belong to the “modes” of thedistribution. Practically, it uses mathematical and statistical notions for where modesterminate. A practitioner may well subjectively argue that these boundaries are notoptimal. To some extent then, the notion of “optimality” extends to the choice of theobjective function, which in the context of exploratory and unsupervised learning maybe a subjective one.

5.4.2 Limited Approximability Results

Although our work includes theoretical results regarding the worst-case run-time com-plexity of our algorithmic contributions, it is somewhat more deficient of theoreticalresults regarding effectiveness.

For the FasTer algorithm, we were able to show that we can successfully reduce±PSC to one of the sub-problems (Ternary Usage Row) under certain conditions. Forthis case it was possible to present an approximation ratio (at the cost of increasedrun-time complexity). We were not, however, able to present approximation ratios forthe higher-level TMF and MDFS problems.

Good approximation algorithms have escaped researchers for a wide variety of NP-Hard problems. In [Aro98] it is argued that achieving certain reasonable approximationratios is no easier than computing optimal solutions. That is, a number of negativeresults suggest that approximation itself can be NP-Hard for many problems.

For some algorithms, the convergence to a local minimum can be shown (e.g. k-Means [SI84]). We were also unable to prove such convergence for the FasTer andFinesse algorithms. Indeed, for the TMF and MDFS problems addressed in this thesis,discussions of the mathematical notions of “convexity”, “differentiability”, “intervals”and “locality” are not sensible because the domain of the objective functions is finite

79

Page 94: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

and non-continuous.Without proofs for approximation ratios and “local” convergence, we resorted to

1) proofs that show algorithm-termination upon reaching a minimum during theirsearch strategy, and 2) empirical evaluation. These approaches, particularly empiricalevaluation, are common in the data-mining community. Strictly speaking, however, weconcede that such evaluations cannot be used as the basis for conclusive statementsabout an algorithm’s general performance.

5.4.3 Limited Empirical Evaluation

All of the methods presented in this thesis are evaluated empirically on real-world andsynthetically-generated data sets.

Experiments on synthetically-generated data involve selecting a parameterized gen-erative model. Such a generative model should ideally be motivated by real-worldapplications. It is important to realize, however, that the set of problem instances gen-erated by all parameter combinations of such a model is typically only a small subsetof the complete population of problem instances. In our case, it was not even possibleto evaluate all parameter combinations of our generative models due to the curse ofdimensionality. Our empirical evaluation was hence limited to the selection of sensible“default” generative parameters, from which individual parameters were varied over arange.

In many cases the proposed algorithms outperformed the state-of-the-art, howeverthis was not always the case. Our empirical evaluation identified a number of generativeparameters to which our algorithms reacted negatively. In the case of FasTer andFinesse, increasing the density of the usage matrix (i.e. the number of mixed patterns)caused in some cases poorer performance compared to the Asso algorithm. In thecase of SkinnyDip being used as a full-space clustering approach, the clustering resultsdegraded quickly with increasing spatial dimensionality.

5.4.4 Algorithm Parameters

Although we have noted in this thesis that our algorithms do not rely on “obscure”parameters, we cannot claim that they are entirely parameter-free. The followingparameters are required by our methods. The most appropriate value for each parametermay not be known a priori, potentially causing confusion for practitioners.

1. TMF and MDFS (and their algorithms FasTer and Finesse) require the specificationof the parameter k (the number of latent/source patterns to find). Additionally,these methods require the user to choose a number of “randomization rounds”

80

Page 95: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.4 Limitations

(defaulting to 20). At the cost of longer running time, we showed that a largernumber of randomization rounds yields a higher-quality end result.

2. Our proposed “itemsets over an ontology” (“tree”) feature for MDFS assumesthat the tree structure is known a priori and provided as an input. Althoughwe concede that this makes it more difficult to prepare the data for analysis, weargue that such a tree structure would likely already exist in many applications.The website yummly.com, for example, exploits its existing proprietary ingredientsontology for optimizing the user-search experience on their website.

3. Our clustering technique SkinnyDip relies on a threshold for statistical significanceα, which can optionally be varied as a parameter.

4. SparseDip, our subspace-search technique, requires the specification of the numberof grid points to evaluate in the search for a gradient-ascent starting point. Herethere is likewise a trade-off between run-time and result quality.

5. BenFound requires the parameters w and α, representing the width of the slidingtime-window and the threshold for statistical significance respectively. We discussthe setting of these parameters in Paper E.

5.4.5 Non-Determinism

Only a subset of our algorithms are deterministic (SkinnyDip and BenFound). The algo-rithms FasTer and Finesse are non-deterministic due to the use of a non-deterministicinitialization heuristic. Specifically, the initialization procedure of both techniques in-volves a random selection. No other initialization strategies were evaluated, and we didnot investigate mechanisms to make FasTer and Finesse deterministic. We note thatAsso, a competitor to these methods, is deterministic.

5.4.6 Performance Trade-Offs

We introduced practical limitations to some of our methods in order to improve theirrun-time performance. For example, the FasTer algorithm was presented to be usedfor k ≤ 32. This decision was made in order to reduce the run-time complexityfrom O(nmk3) to O(nmk2) through the exploitation of bitwise instructions on 64-bitarchitectures. As each ternary value requires two bits, the maximum permissible lengthof a row vector in the usage matrix or a column vector in the basis matrix is 32. Thislimitation could be mitigated by exploiting larger-width SSE instructions (as in Finesse),or accepting the higher run-time complexity O(nmk3). We also note a limitation ofour work on evaluating FasTer in high-performance environments. In this part of our

81

Page 96: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

work, we only considered a shared-memory environment, so we are unable to drawconclusions about the performance of FasTer on distributed-memory architectures orGPUs.

5.4.7 Side-Effects of our Complexity-Reducing Assumptions

As with most data-mining techniques, our methods rely on a number of assumptionsfor reducing complexity. Our techniques may yield sub-optimal results when theseassumptions are violated. In the case of TMF and FasTer, we note that our work onMissing-Value Boolean Matrix Factorization (MVBMF) assumed the missing completelyat random case (MCAR). We did not perform synthetic experiments for the missing atrandom (MAR) case, nor the missing not at random (MNAR) case, so we are unable todraw conclusions about the performance of FasTer in such cases.

Our framework MDFS demands that each admissible “mixable” feature fulfill anumber of formal requirements. In particular, we assume the existence of a sensiblecontrast function �, which may be difficult to objectively formulate.

One of SkinnyDip’s most intriguing strengths (no multivariate distance computations)is related to one of its main complexity-reducing assumptions. That is, SkinnyDip as-sumes that the cluster structure can still be reconstructed by taking systematic univariateprojections. We can construct pathological examples which violate this assumption.Consider the relatively simple clustering problem in Figure 5.10. Two rectangular clus-ters exist in a “sea” of approximately 70% noise. SkinnyDip is able to detect the fullclusters, but also includes a number of obvious noise points off to the side in eachcase. The problem here is that SkinnyDip first projects onto the horizontal axis. The“shadows” of the two-dimensional clusters appear as one when looking up from this axis,so SkinnyDip can only tell them apart after it has recursed into the second dimension.

A second effect of this complexity-reducing SkinnyDip assumption is that it modelsclusters as “hyper-intervals”, which may be sub-optimal. In two dimensions, for example,the clusters are modeled using rectangles. Rectangles cannot precisely model “rounded”cluster structures, like the Gaussian clusters in Figure 5.11.

Our method BenFound relies on the assumption that “unnatural” or “fraudulent”behavior causes a violation of Benford’s Law. If this assumption is violated, we see thatBenFound is not immune to false negatives. That is, if true anomalies are created byagents that are aware of Benford’s Law, those agents may tune their anomalies such asto remain undetected by the law. However, the success of BL to date (particularly inforensic accounting [NW12]) suggests that many real-life fraudsters are not aware of thelaw.

Finally, BenFound relies on the assumption that the system metrics in focus followBL in the “natural” state. We concede that many metrics do not follow BL. For example,

82

Page 97: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.4 Limitations

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

● ●

● ●

●●

●●

●●

●●

Raw data

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

● ●

● ●

●●

●●

●●

●●

Skinny−dip clustering result

Figure 5.10: An example data set on which SkinnyDip achieves a less-than-optimalresult (univariate projections limitation).

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

Raw Data

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

Skinny−Dip Clustering Result

Figure 5.11: An example data set on which SkinnyDip achieves a less-than-optimalresult (“boxed” or “rectangular” cluster-models limitation).

83

Page 98: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

consider the signal obtained when monitoring the server response time of a popularwebsite over time (Figure 5.12). Such data is normally not associated with any naturalgrowth processes, so the digit distribution is by no means Benford. BenFound wouldbe in a constant “red flag” state on such data (not useful).

0 200 400 600

600

800

1000

1200

1400

1600

Website Response Time (pingdom.com)

Time

Web

site

res

pons

e tim

e (m

s)

Figure 5.12: An example of non-Benford data: server response times for a popularwebsite.

5.4.8 The Multiple Comparisons Problem

A subset of our proposed methods apply multiple statistical hypothesis tests in anautomated fashion. This raises the question as to whether the multiple-comparisonsproblem is applicable [Sal10]. SkinnyDip, for example, may apply the dip test multipletimes during its execution. In statistics, the application of multiple simultaneoushypothesis tests to a given sample should be accompanied by changing the significancelevel α to a more conservative (lower) value. Controlling procedures like Bonferronicorrection [Dun61] can be used to this end. Bonferroni correction compensates for thefact that, as the number of tests increases, it becomes more likely that at least one testwill report a significant result simply by chance.

In the case of SkinnyDip, we note the following with respect to the multiple compar-isons problem:

• The dip test is used in a heuristic manner, and it is not known how many tests willbe applied a priori. That is, the UniDip algorithm on which SkinnyDip is based

84

Page 99: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.5 Future Research

will perform k tests, where k is the number of modal intervals (clusters) found at“runtime”.

• UniDip ceases to apply the dip as soon as it finds a non-significant result.

• The size of the sample changes for each application of the dip (each is a subsetof the full data set), meaning that the test theoretically loses power as UniDip

progresses.

These observations complicate the situation, making it difficult to determine whetheror not the multiple comparisons problem is a valid concern in SkinnyDip, and howcorrection should be performed if so. Although we believe that SkinnyDip has moreimportant limitations that should be addressed first (e.g. the reliance on univariateprojections), we concede that there may be a limitation due to this multiple comparisonseffect as well. A future investigation into the multiple comparisons problem in thecontext of SkinnyDip may yield a more robust algorithm.

A statistical hypothesis test is also applied in an automated fashion in our methodBenFound. In this case, the sample changes for each application of the test, so againthe situation is more complex and not directly equivalent to the classical multiple-comparisons scenario.

5.5 Future Research

Our MDFS framework is one which leaves room for additional research. Perhaps themost interesting question is: What other data types are admissible to MDFS? We haveidentified a particular kind of graph structure in which it makes sense to mix instances(“itemsets over an ontology” ). Other kinds of graph-based structures could also beinvestigated. A multitree, for example, can be used to represent multiple overlappingtaxonomies over the same ground set. They are a class of directed acyclic graph witheasily-identifyable substructures that are trees. Such structures are found in variousapplications, like a family tree which contains multiple inter-family marriages but noblood-relative marriages, or a corpus of academic material that is structured into asyllabus in different hierarchical ways by different university professors [FZ94]. List-or set-based objects may also be appropriate. For example, we can imagine a classical“itemsets” feature for which the measurement is a set of objects from some catalog (as inmarket-basket analysis). The “mixing” mechanism for such a feature would be the setunion operator.

Perhaps the next most obvious improvement for TMF and the MDFS frameworkwould be automated model-order selection. That is, the complete automation of the

85

Page 100: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

algorithm can be achieved by removing the need for the user to provide the parameterk. Automatic model-order selection is a difficult problem, particularly considering thatalgorithms like FasTer can only approximately calculate the solution for any given k. Inour work on MDFS, we used the “elbow” heuristic to estimate a sensible model orderk before moving to interpret the data (Figure 5.13). Some approaches for model-orderselection try to automate the detection of such a point. More sophisticated approachesare based on various ways of interpreting Occam’s Razor, which suggests that “amongcompeting hypotheses, the one with the fewest assumptions should be selected”. TheMinimum Description Length principle [Ris78; Grü07] or the Bayesian InformationCriterion [Sch+78] are often used in this light, helping to find a balance between amodel’s complexity and its performance. An investigation into automatic model-orderselection for TMF and MDFS would ideally compare a number of such approaches on avariety of synthetic and real-world data, similar to the approach used in [MV14].

2 4 6 8 10 12 14 16

300

400

500

600

Decomposition rank k

Erro

r(1

)

2 4 6 8 10 12 14 16

2

2.2

2.4

2.6

2.8

3·104

Decomposition rank k

Erro

r(1

)

Figure 5.13: Model-order selection for Yummly (left) and IMDB (right) data. The “elbow”points are approximately k = 6 and k = 9 respectively.

Another direction for further research on our proposed matrix decompositions relatesto the investigation of additional constraints. Like BMF and OMF, our formulation of theTMF and MDFS problems impose no strict constraints on the nature of the factors in thebasis matrix (other than that they should contain entries from the finite set in question, ofcourse). The objective function is based solely on the reconstruction error. In techniquesfrom linear algebra, the factors are typically constrained. In PCA we see the constraintof orthogonality; in ICA we see the constraint of statistical-independence; in NMF wesee the constraint of non-negativity. Interestingly, there already exist BMF variantsthat dictate similar constraints. For example, the Boolean CX (BCX) Decompositionproblem [Mie08b] is similar to BMF, however the set of basis vectors in BCX must be

86

Page 101: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5.5 Future Research

a subset of the original (observed) vectors. The Boolean CUR problem is a furthervariant with similar constraints [Mie08b]. Unfortunately, the practical usefulness ofthese decompositions is not entirely clear, as the original article (and, to the best of ourknowledge, future work) does not show real-world applications which serve to motivatethe choice of these constrained techniques over vanilla BMF. Despite this, and even ifonly for theoretical purposes, one could investigate analogous constrained versions ofTMF and MDFS. Future applications may find such variants useful.

Our clustering algorithm SkinnyDip is the first to consider the use of the dip test.Although elegant, the dip is only usable on univariate data, so important informationis sometimes lost when SkinnyDip performs the necessary univariate projections frommultivariate data. To improve on this, SkinnyDip could be modified to perform multiple“projection rounds” on the data (perhaps heuristic-based), and the results aggregated inan ensemble-like manner.

Overcoming the “box clusters” limitation of SkinnyDip may be a comparatively easierproblem to solve. One approach might work as follows. Each detected SkinnyDip clustercould have its mean density computed. This density, along with the detected clusterpoints, could serve as a “seed” for “elaborating” on the cluster in a local density-basedfashion. The propagation algorithm might be similar to that of DBSCAN, for example[Est+96]. Another approach might be to create multiple overlapping “boxes” for thesame cluster from different projection angles, again using the results to “fine-tune” theshape of each cluster.

With respect to our work on exploiting Benford’s Law (BL), it is difficult not to beintrigued by this “gem of statistical folklore” [BH11]. Indeed, it remains somewhat of amystery, having not yet been fully explained [BH11]. To the best of our knowledge, ourcontribution is the first that exploits the law as the basis of a data-mining technique. Weanticipate that BenFound, our non-parametric anomaly-detection technique based onBL, will consequently generate some interest in the data-mining community. Instead ofbeing used as the basis for a “red-flagging” anomaly-detection technique, one might useBL as the basis for ranking the “authenticity” of a set of phenomena. Ranking Twitterhashtags, YouTube channels or sets of networks or graphs are some concrete ideas thatcould be investigated in this respect.

Finally, there is room in all our work for even more aggressive hardware optimization.Although we have optimized some of our techniques for shared-memory parallelenvironments, we did not consider the performance gains that are possible throughgraphics cards. The performance of FasTer, for example, is dependent on the ability toefficiently evaluate bitwise instructions. A modern CPU core can typically execute four32-bit instructions per clock (using a 128-bit SSE instruction), whereas many modernGPUs can execute thousands of such instructions per clock (having many more ALUs or“shaders”). Although other factors would of course need to be considered (“executive

87

Page 102: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

5 Results and Discussion

work” like branching logic), we expect that a graphic-card implementation of FasTer

would yield significant performance gains.

88

Page 103: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

6 Conclusion

Data mining is the application of specific algorithms to prepared data for the purpose ofeither prediction or description. The scope of this thesis was restricted to exploratory,unsupervised, descriptive data mining. We have advanced the state-of-the-art by pre-senting novel methods which address a number of complexities in modern informationsystems: 1) heterogeneous data types and measurement scales, 2) missing information,3) clutter and noise, 4) high dimensionality and 5) high-bandwidth time-series data.We designed linear-time algorithms for our problems, discussed their properties, andempirically evaluated their performance on 1) synthetically-generated data, 2) real-worlddata, and 3) run-time behavior.

From a research perspective this thesis offers a number of results. To help address thecomplex challenges of heterogeneous data types, heterogeneous measurement scales andmissing values, we presented Ternary Matrix Factorization and the Matrix Factorizationsover Discrete Finite Sets framework. To help address the complex challenges of clutterand high-dimensionality in clustering, we presented the novel clustering paradigmSkinnyDip. To help address the complex challenge of anomaly-detection in high-bandwidth time-series data, we presented the insightful method BenFound based onthe exploitation of Benford’s Law. We thus achieved our goals by demonstrating howthese kinds of complexities can be addressed in variants of the most fundamentalkinds of data mining problems (finding associations, clustering objects and detectinganomalies).

The results of this thesis are beneficial to practitioners from a number of perspectives.From the perspective of effectiveness, we showed empirically that each algorithmcan outperform the state-of-the-art on synthetic and real-world data with respect toappropriate quality measures. From the perspective of efficiency, we showed that eachalgorithm exhibits practically-linear run-time growth in the size of the data. Fromthe perspective of usability, we showed that none of our techniques require “obscure”parameters. From the perspective of interpretability, we showed that the majority ofour algorithms yield directly-interpretable results in the context of the domain (onlyBenFound requires some level of post-processing). Finally, we discussed how allimplementations are provided in publicly-accessible repositories online, and how allresults are reproducible.

We identified a number of limitations. Our methods are mostly based on heuristics

89

Page 104: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

6 Conclusion

and are unable to solve optimally in the general case (“hard” computational complexity).We provide only a single approximation guarantee for a sub-problem of one method.Our evaluation was mainly empirical and did not cover all possible problem instances.FasTer and Finesse require the provision of the parameter k, the number of patterns tofind. SkinnyDip may yield sub-optimal results as a result of univariate projections and“blocky” cluster bounding. BenFound is not immune to false-negatives and can only beapplied to systems with metrics that follow Benford’s Law in their “natural” state.

We identified a number of future research directions. Further concrete types of“mixable” finite sets could be identified and implemented for MDFS, thereby wideningits applicability. Enhancements to SkinnyDip could be made with the aim of overcomingsome of its limitations, particularly its tendency to produce “rectangular” clusters andits dependency on univariate projections. Finally, the real-time concepts in BenFound

might be applied to data contexts other than univariate time series, like monitoring theflow of information between nodes of a network.

90

Page 105: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Bibliography

[AIS93] R. Agrawal, T. Imielinski, and A. Swami. “Mining association rules betweensets of items in large databases.” In: Acm sigmod record. Vol. 22. 2. ACM.1993, pp. 207–216.

[AKZ08] E. Achtert, H.-P. Kriegel, and A. Zimek. “ELKI: a software system forevaluation of subspace clustering algorithms.” In: International Conference onScientific and Statistical Database Management. Springer. 2008, pp. 580–585.

[Alb+16] F. D. Albareti, C. A. Prieto, A. Almeida, F. Anders, S. Anderson, B. H.Andrews, A. Aragon-Salamanca, M. Argudo-Fernandez, E. Armengaud,E. Aubourg, et al. “The Thirteenth Data Release of the Sloan Digital SkySurvey: First Spectroscopic Data from the SDSS-IV Survey MApping NearbyGalaxies at Apache Point Observatory.” In: arXiv preprint arXiv:1608.02013(2016).

[Ank+99] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. “OPTICS: orderingpoints to identify the clustering structure.” In: ACM Sigmod Record. Vol. 28.2. ACM. 1999, pp. 49–60.

[Aro98] S. Arora. “The approximability of NP-hard problems.” In: Proceedings of thethirtieth annual ACM symposium on Theory of computing. ACM. 1998, pp. 337–348.

[BC57] R. Bellman and R. Corporation. Dynamic Programming. Rand Corporationresearch study. Princeton University Press, 1957. isbn: 9780691079516.

[Ben+10] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. “Detecting spam-mers on twitter.” In: Collaboration, electronic messaging, anti-abuse and spamconference (CEAS). Vol. 6. 2010, p. 12.

[Ben38] F. Benford. “The law of anomalous numbers.” In: Proceedings of the AmericanPhilosophical Society (1938), pp. 551–572.

[BG04] H.-J. Bungartz and M. Griebel. “Sparse grids.” In: Acta numerica 13 (2004),pp. 147–269.

[BH11] A. Berger and T. P. Hill. “Benford’s law strikes back: no simple explanationin sight for mathematical gem.” In: The Mathematical Intelligencer 33.1 (2011),pp. 85–91.

91

Page 106: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Bibliography

[Bis81] J. Biskup. “A formal approach to null values in database relations.” In:Advances in Data Base Theory. Springer, 1981, pp. 299–341.

[BJ08] K. Bernhard and V. Jens. Combinatorial optimization: Theory and algorithms.2008.

[BK13] R. Belohlavek and M. Krmelova. “Beyond Boolean matrix decompositions:toward factor analysis and dimensionality reduction of ordinal data.” In:2013 IEEE 13th International Conference on Data Mining. IEEE. 2013, pp. 961–966.

[BMD09] C. Boutsidis, M. W. Mahoney, and P. Drineas. “An improved approximationalgorithm for the column subset selection problem.” In: Proceedings of thetwentieth Annual ACM-SIAM Symposium on Discrete Algorithms. Society forIndustrial and Applied Mathematics. 2009, pp. 968–977.

[Boa08] O. Board. “OpenMP application program interface version 3.0.” In: TheOpenMP Forum, Tech. Rep. 2008.

[Böh+06] C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. “Robust information-theoreticclustering.” In: Proceedings of the 12th ACM SIGKDD international conferenceon Knowledge discovery and data mining. ACM. 2006, pp. 65–75.

[BP09] P. Burman and W. Polonik. “Multivariate mode hunting: Data analytic toolswith measures of significance.” In: Journal of Multivariate Analysis 100.6(2009), pp. 1198–1218.

[Bro00] A. W. Bronkhorst. “The cocktail party phenomenon: A review of research onspeech intelligibility in multiple-talker conditions.” In: Acta Acustica unitedwith Acustica 86.1 (2000), pp. 117–128.

[Buh+13] H. U. Buhl, M. Röglinger, F. Moser, J. Heidemann, et al. “Big data.” In:Business & Information Systems Engineering 5.2 (2013), pp. 65–69.

[BY09] R. S. Baker and K. Yacef. “The state of educational data mining in 2009: Areview and future visions.” In: JEDM-Journal of Educational Data Mining 1.1(2009), pp. 3–17.

[Car+00] R. D. Carr, S. Doddi, G. Konjevod, and M. V. Marathe. “On the red-blueset cover problem.” In: Symposium on Discrete Algorithms: Proceedings of theeleventh annual ACM-SIAM symposium on Discrete algorithms. Vol. 9. 11. Cite-seer. 2000, pp. 345–353.

[CFF00] A. Cuevas, M. Febrero, and R. Fraiman. “Estimating the number of clusters.”In: Canadian Journal of Statistics 28.2 (2000), pp. 367–382.

[CM02] K. J. Cios and G. W. Moore. “Uniqueness of medical data mining.” In:Artificial intelligence in medicine 26.1 (2002), pp. 1–24.

92

Page 107: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Bibliography

[ÇM09] A. Çivril and M. Magdon-Ismail. “On selecting a maximum volume sub-matrix of a matrix and related problems.” In: Theoretical Computer Science410.47 (2009), pp. 4801–4811.

[Dha13] V. Dhar. “Data science and prediction.” In: Communications of the ACM 56.12(2013), pp. 64–73.

[DHP04] C. Durtschi, W. Hillison, and C. Pacini. “The effective use of Benford’s law toassist in detecting fraud in accounting data.” In: Journal of forensic accounting5.1 (2004), pp. 17–34.

[DLR77] A. P. Dempster, N. M. Laird, and D. B. Rubin. “Maximum likelihood fromincomplete data via the EM algorithm.” In: Journal of the royal statisticalsociety. Series B (methodological) (1977), pp. 1–38.

[DR98] A. Dasgupta and A. E. Raftery. “Detecting features in spatial point processeswith clutter via model-based clustering.” In: Journal of the American StatisticalAssociation 93.441 (1998), pp. 294–302.

[Dun61] O. J. Dunn. “Multiple comparisons among means.” In: Journal of the AmericanStatistical Association 56.293 (1961), pp. 52–64.

[Eis+11] D. J. Eisenstein, D. H. Weinberg, E. Agol, H. Aihara, C. A. Prieto, S. F.Anderson, J. A. Arns, É. Aubourg, S. Bailey, E. Balbinot, et al. “SDSS-III:Massive spectroscopic surveys of the distant universe, the Milky Way, andextra-solar planetary systems.” In: The Astronomical Journal 142.3 (2011),p. 72.

[Est+96] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. “A density-based algorithm fordiscovering clusters in large spatial databases with noise.” In: Kdd. Vol. 96.34. 1996, pp. 226–231.

[FB75] J. D. Francis and L. Busch. “What We Now Know about "I Don’t Knows".”In: The Public Opinion Quarterly 39.2 (1975), pp. 207–218.

[FPS96] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. “From data mining toknowledge discovery in databases.” In: AI magazine 17.3 (1996), p. 37.

[FZ94] G. W. Furnas and J. Zacks. “Multitrees: enriching and reusing hierarchi-cal structure.” In: Proceedings of the SIGCHI conference on Human factors incomputing systems. ACM. 1994, pp. 330–336.

[Gol15] J. Golbeck. “Benford’s Law Applies to Online Social Networks.” In: PloS one10.8 (2015), e0135169.

[Grü07] P. D. Grünwald. The minimum description length principle. MIT press, 2007.

93

Page 108: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Bibliography

[Hal+09] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Wit-ten. “The WEKA data mining software: an update.” In: ACM SIGKDDexplorations newsletter 11.1 (2009), pp. 10–18.

[HH85] J. A. Hartigan and P. Hartigan. “The dip test of unimodality.” In: The Annalsof Statistics (1985), pp. 70–84.

[HKO04] A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis.Adaptive and Cognitive Dynamic Systems: Signal Processing, Learning,Communications and Control. Wiley, 2004. isbn: 9780471464198.

[HL04] M. Hu and B. Liu. “Mining and summarizing customer reviews.” In: Proceed-ings of the tenth ACM SIGKDD international conference on Knowledge discoveryand data mining. ACM. 2004, pp. 168–177.

[Hor12] K. Hornik. “The comprehensive R archive network.” In: Wiley Interdisci-plinary Reviews: Computational Statistics 4.4 (2012), pp. 394–398.

[HPP00] B. Heisele, T. Poggio, and M. Pontil. Face Detection in Still Gray Images. A.I.memo 1687. Cambridge, MA: Center for Biological and ComputationalLearning, MIT, 2000.

[HTK15] K. Hayashi, M. Takanori, and K.-i. Kawarabayashi. “Real-time topic detec-tion on twitter with topic hijack filtering.” In: Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining.ACM. 2015, pp. 417–426.

[IM98] P. Indyk and R. Motwani. “Approximate nearest neighbors: towards remov-ing the curse of dimensionality.” In: Proceedings of the thirtieth annual ACMsymposium on Theory of computing. ACM. 1998, pp. 604–613.

[JKM14] N. A. James, A. Kejariwal, and D. S. Matteson. “Leveraging Cloud Data toMitigate User Experience from "Breaking Bad".” In: arXiv preprint arXiv:1411.7955(2014).

[K+11] H. C. Koh, G. Tan, et al. “Data mining applications in healthcare.” In: Journalof healthcare information management 19.2 (2011), p. 65.

[Kan+02] R. V. Kantety, M. La Rota, D. E. Matthews, and M. E. Sorrells. “Data miningfor simple sequence repeats in expressed sequence tags from barley, maize,rice, sorghum and wheat.” In: Plant molecular biology 48.5-6 (2002), pp. 501–510.

[Kar+04] A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis. “kernlab-an S4 packagefor kernel methods in R.” In: (2004).

[Kar72] R. M. Karp. “Reducibility among combinatorial problems.” In: Complexityof computer computations. Springer, 1972, pp. 85–103.

94

Page 109: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Bibliography

[KB11] D. Kumar and D. Bhardwaj. “Rise of data mining: current and futureapplication areas.” In: IJCSI International Journal of Computer Science Issues8.5 (2011).

[KKZ09] H.-P. Kriegel, P. Kröger, and A. Zimek. “Clustering high-dimensional data:A survey on subspace clustering, pattern-based clustering, and correlationclustering.” In: ACM Transactions on Knowledge Discovery from Data (TKDD)3.1 (2009), p. 1.

[KL05] A. Krause and V. Liebscher. “Multimodal projection pursuit using the dipstatistic.” In: Preprint-Reihe Mathematik 13 (2005).

[Kri+07] H.-P. Kriegel, K. M. Borgwardt, P. Kröger, A. Pryakhin, M. Schubert, andA. Zimek. “Future trends in data mining.” In: Data Mining and KnowledgeDiscovery 15.1 (2007), pp. 87–97.

[KSS03] M. Kuhlmann, D. Shohat, and G. Schimpf. “Role mining-revealing businessroles for security administration using data mining technology.” In: Proceed-ings of the eighth ACM symposium on Access control models and technologies.ACM. 2003, pp. 179–186.

[LAF15] N. Laptev, S. Amizadeh, and I. Flint. “Generic and scalable framework forautomated time-series anomaly detection.” In: Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining.ACM. 2015, pp. 1939–1947.

[LK07] D. Liben-Nowell and J. Kleinberg. “The link-prediction problem for so-cial networks.” In: Journal of the American society for information science andtechnology 58.7 (2007), pp. 1019–1031.

[LOP14] C. Lucchese, S. Orlando, and R. Perego. “A Unifying Framework for MiningApproximate Top-Binary Patterns.” In: IEEE Transactions on Knowledge andData Engineering 26.12 (2014), pp. 2900–2913.

[LRL07] J. Li, S. Ray, and B. G. Lindsay. “A nonparametric statistical approach toclustering via mode identification.” In: Journal of Machine Learning Research8.Aug (2007), pp. 1687–1723.

[LS99] D. D. Lee and H. S. Seung. “Learning the parts of objects by non-negativematrix factorization.” In: Nature 401.6755 (1999), pp. 788–791.

[Mac+67] J. MacQueen et al. “Some methods for classification and analysis of multi-variate observations.” In: Proceedings of the fifth Berkeley symposium on mathe-matical statistics and probability. Vol. 1. 14. Oakland, CA, USA. 1967, pp. 281–297.

95

Page 110: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Bibliography

[McG+02] A. McGrail, E. Gulski, E. Groot, D. Allan, D. Birtwhistle, and T. Blackburn.“Datamining techniques to assess the condition of high voltage electricalplant.” In: CIGRE Paris WG15 11 (2002).

[MHR10] J. Marchini, C. Heaton, and B. Ripley. fastICA: FastICA Algorithms to performICA and Projection Pursuit. R package version 1.1-13. 2010.

[Mie+08] P. Miettinen, T. Mielikäinen, A. Gionis, G. Das, and H. Mannila. “The discretebasis problem.” In: IEEE Transactions on Knowledge and Data Engineering20.10 (2008), pp. 1348–1362.

[Mie08a] P. Miettinen. “On the positive–negative partial set cover problem.” In: Infor-mation Processing Letters 108.4 (2008), pp. 219–221.

[Mie08b] P. Miettinen. “The boolean column and column-row matrix decomposi-tions.” In: Data Mining and Knowledge Discovery 17.1 (2008), pp. 39–56.

[Mie09] P. Miettinen. “Matrix decomposition methods for data mining: Compu-tational complexity and algorithms.” PhD thesis. University of Helsinki,2009.

[Mil56] G. A. Miller. “The magical number seven, plus or minus two: Some limitson our capacity for processing information.” In: Psychological review 63.2(1956), p. 81.

[MJ14] D. S. Matteson and N. A. James. “A nonparametric approach for multi-ple change point analysis of multivariate data.” In: Journal of the AmericanStatistical Association 109.505 (2014), pp. 334–345.

[MP14] S. Maurus and C. Plant. “Ternary matrix factorization.” In: ©2014 IEEE.Reprinted, with permission, from Maurus, Samuel and Plant, Claudia, Ternarymatrix factorization, 2014 IEEE International Conference on Data Mining. 2014,pp. 400–409.

[MP16a] S. Maurus and C. Plant. “Factorizing Complex Discrete Data with Finesse.”In: ©2016 IEEE. Reprinted, with permission, from Maurus, Samuel and Plant,Claudia, Factorizing Complex Discrete Data with Finesse, 2016 IEEE Interna-tional Conference on Data Mining. 2016.

[MP16b] S. Maurus and C. Plant. “Skinny-dip: Clustering in a Sea of Noise.” In:Proceedings of the 22nd ACM SIGKDD international conference on Knowledgediscovery and data mining (2016), pp. 1055–1064.

[MP16c] S. Maurus and C. Plant. “Ternary Matrix Factorization: problem definitionsand algorithms.” In: Knowledge and Information Systems 46.1 (2016), pp. 1–31.

96

Page 111: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Bibliography

[MP17] S. Maurus and C. Plant. “Let’s see your Digits: Anomalous-State Detectionusing Benford’s Law.” In: Submitted to the Research Track of the 2017 SIAMInternational Conference on Data Mining (2017).

[MV14] P. Miettinen and J. Vreeken. “mdl4bmf: Minimum description length forBoolean matrix factorization.” In: ACM Transactions on Knowledge Discoveryfrom Data (TKDD) 8.4 (2014), p. 18.

[Nig99] J. Nigrini Mark. “I’ve got your number: How a mathematical phenomenoncan help CPAs uncover fraud and other irregularities.” In: Journal of Accoun-tancy 187.5 (1999).

[NW12] M. Nigrini and J. Wells. Benford’s Law: Applications for Forensic Account-ing, Auditing, and Fraud Detection. Wiley Corporate F&A. Wiley, 2012. isbn:9781118152850.

[OL10] A. C. O’Connor and R. J. Loomis. “2010 Economic Analysis of Role-BasedAccess Control.” In: NIST, Gaithersburg, MD 20899 (2010).

[Ooi12] H. Ooi. “Density visualization and mode hunting using trees.” In: Journal ofComputational and Graphical Statistics (2012).

[Par62] E. Parzen. “On estimation of a probability density function and mode.” In:The annals of mathematical statistics 33.3 (1962), pp. 1065–1076.

[Pen11] R. D. Peng. “Reproducible research in computational science.” In: Science334.6060 (2011), pp. 1226–1227.

[Poe+88] G. S. Poe, I. Seeman, J. McLaughlin, E. Mehl, and M. Dietz. “"Don’t Know"Boxes in Factual Questions in a Mail Questionnaire Effects on Level andQuality of Response.” In: Public Opinion Quarterly 52.2 (1988), pp. 212–222.

[Rig10] G. Rigaill. Pruned dynamic programming for optimal multiple change-point de-tection. 2010.

[Ris78] J. Rissanen. “Modeling by shortest data description.” In: Automatica 14.5(1978), pp. 465–471.

[Rit+11] A. Ritter, S. Clark, Mausam, and O. Etzioni. “Named Entity Recognition inTweets: An Experimental Study.” In: EMNLP. 2011.

[Rit+12] A. Ritter, Mausam, O. Etzioni, and S. Clark. “Open Domain Event Extractionfrom Twitter.” In: KDD. 2012.

[Ros+56] M. Rosenblatt et al. “Remarks on some nonparametric estimates of a densityfunction.” In: The Annals of Mathematical Statistics 27.3 (1956), pp. 832–837.

[Sal10] N. Salkind. Encyclopedia of Research Design. SAGE Publications, 2010. isbn:9781506319315.

97

Page 112: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Bibliography

[Sch+78] G. Schwarz et al. “Estimating the dimension of a model.” In: The annals ofstatistics 6.2 (1978), pp. 461–464.

[SI84] S. Z. Selim and M. A. Ismail. “K-means-type algorithms: a generalizedconvergence theorem and characterization of local optimality.” In: IEEETransactions on pattern analysis and machine intelligence 1 (1984), pp. 81–87.

[SK09] X. Su and T. M. Khoshgoftaar. “A survey of collaborative filtering tech-niques.” In: Advances in artificial intelligence 2009 (2009), p. 4.

[Sla96] P. Slavík. “A tight analysis of the greedy algorithm for set cover.” In: Pro-ceedings of the twenty-eighth annual ACM symposium on Theory of computing.ACM. 1996, pp. 435–441.

[SRJ04] N. Srebro, J. Rennie, and T. S. Jaakkola. “Maximum-margin matrix factor-ization.” In: Advances in neural information processing systems. 2004, pp. 1329–1336.

[SS13] S. Sagiroglu and D. Sinanc. “Big data: A review.” In: Collaboration Technolo-gies and Systems (CTS), 2013 International Conference on. IEEE. 2013, pp. 42–47.

[Ste46] S. S. Stevens. On the theory of scales of measurement. 1946.

[SVM06] J.-F. Superby, J. Vandamme, and N. Meskens. “Determination of factorsinfluencing the achievement of the first-year university students using datamining methods.” In: Workshop on Educational Data Mining. Vol. 32. Citeseer.2006, p. 234.

[Tip+15] S. Tippmann et al. “Programming tools: Adventures with R.” In: Nature517.7532 (2015), pp. 109–110.

[TKB12] C. Thurau, K. Kersting, and C. Bauckhage. “Deterministic CUR for ImprovedLarge-Scale Data Analysis: An Empirical Study.” In: SDM. SIAM. 2012,pp. 684–695.

[TL10] L. Tang and H. Liu. “Community detection and mining in social media.” In:Synthesis Lectures on Data Mining and Knowledge Discovery 2.1 (2010), pp. 1–137.

[Vav+] S. Vavilis, A. I. Egner, M. Petkovic, and N. Zannone. Role Mining withMissing Values. Tech. rep. Eindhoven University of Technology (SecurityGroup), and Philips Research Europe.

[VEB09] N. X. Vinh, J. Epps, and J. Bailey. “Information theoretic measures forclusterings comparison: is a correction for chance necessary?” In: Proceedingsof the 26th Annual International Conference on Machine Learning. ACM. 2009,pp. 1073–1080.

98

Page 113: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

Bibliography

[VHK14] O. Vallis, J. Hochenbaum, and A. Kejariwal. “A novel technique for long-term anomaly detection in the cloud.” In: 6th USENIX Workshop on HotTopics in Cloud Computing (HotCloud 14). 2014.

[VS08] J. Vreeken and A. Siebes. “Filling in the blanks-Krimp minimisation formissing data.” In: 2008 Eighth IEEE International Conference on Data Mining.IEEE. 2008, pp. 1067–1072.

[Wan+05] J. T. Wang, M. J. Zaki, H. T. Toivonen, and D. Shasha. “Introduction to DataMining in Bioinformatics.” In: Data Mining in Bioinformatics. Springer, 2005,pp. 3–8.

[Wil45] F. Wilcoxon. “Individual comparisons by ranking methods.” In: Biometricsbulletin 1.6 (1945), pp. 80–83.

[WM02] W.-K. Wong and A. Moore. “Efficient algorithms for non-parametric cluster-ing with clutter.” In: Proceedings of the 34th Interface Symposium. 2002.

[Xie15] Y. Xie. Dynamic Documents with R and knitr. Vol. 29. CRC Press, 2015.

[Yao03] Y. Yao. “Information-theoretic measures for knowledge discovery and datamining.” In: Entropy Measures, Maximum Entropy Principle and Emerging Ap-plications. Springer, 2003, pp. 115–136.

[YHX14] X. Yu, D. Hu, and J. Xu. Blind Source Separation: Theory and Applications.Wiley, 2014. isbn: 9781118679845.

[YM12] P. Yadava and P. Miettinen. “BMF with missing values.” In: Master’s Thesis,University of Saarland (2012).

[YW06] Q. Yang and X. Wu. “10 challenging problems in data mining research.” In:International Journal of Information Technology & Decision Making 5.04 (2006),pp. 597–604.

[YWB11] X. Ying, X. Wu, and D. Barbará. “Spectrum based fraud detection in socialnetworks.” In: 2011 IEEE 27th International Conference on Data Engineering.IEEE. 2011, pp. 912–923.

[Zan82] C. Zaniolo. “Database relations with null values.” In: Proceedings of the 1stACM SIGACT-SIGMOD symposium on Principles of database systems. ACM.1982, pp. 27–33.

[ZP05] L. Zelnik-Manor and P. Perona. “Self-tuning spectral clustering.” In: Ad-vances in Neural Information Processing Systems 17 (2005).

[ZSK12] A. Zimek, E. Schubert, and H.-P. Kriegel. “A survey on unsupervised outlierdetection in high-dimensional numerical data.” In: Statistical Analysis andData Mining 5.5 (2012), pp. 363–387.

99

Page 114: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,
Page 115: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

List of Figures

1.1 The Knowledge Discovery in Databases workflow, inspired by Figure 1 in[FPS96]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Counts of publications over time that include “clustering”, “anomaly” or“frequent itemset” in the title respectively (source: DBLP). . . . . . . . . . 4

2.1 An example of Blind-Source Separation using Independent ComponentAnalysis (using the R package fastICA [MHR10]). Note the ambiguities:ICA is generally not able to reconstruct the amplitude and the sign of theoriginal signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 PCA finds orthogonal directions in the data, ordered such that eachsuccessive direction explains the maximum possible remaining variance. 16

2.3 Top 25 basis vectors found by performing PCA (left) and NMF (right) onthe CBCL faces data [HPP00]. . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Four different two-dimensional spatial data sets. . . . . . . . . . . . . . . . 182.5 An example k-means result. The locations of the cluster prototypes (also

known as “centers” or “centroids”) are given by the yellow + markers. . 192.6 k-Means (left) and hard-EM (right) clustering algorithms applied to a

common data set. EM models each cluster using a bivariate Gaussianwith different distribution parameters. . . . . . . . . . . . . . . . . . . . . . 20

2.7 A DBSCAN result on a data set (over the unit square) containing clustersof non-convex form. The parameters used were ε = 0.01, minPts = 5.Light gray is used to represent points labeled by DBSCAN as noise. . . . 21

2.8 The spectral clustering result on the spirals data (using the R kernlabpackage [Kar+04]). We note that density-based algorithms like DBSCANcan find a similar solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 A simple data matrix that records the participation of students in courses. 263.2 Finding latent disciplines from students and courses. . . . . . . . . . . . . 273.3 Finding latent canine temperaments from dogs and traits. . . . . . . . . . 323.4 A simulated minefield (left) in the presence of “clutter” (like metal objects

or rocks). This data set was generated based on similar data presented in[DR98]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

101

Page 116: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

List of Figures

3.5 A view of galaxy data from the Sloan Digital Sky Survey [Alb+16; Eis+11]as considered in [WM02]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Twitter Anomaly Detection finds significant deviations in signals contain-ing noise, short-term seasonality and long-term trend. . . . . . . . . . . . 36

3.7 Partial screenshot of the Twitter Status Calendar (statuscalendar.com) – ademonstration of the system described in [Rit+11; Rit+12]. . . . . . . . . . 38

5.1 A basic culinary ingredients ontology. The bold sub-tree represents (therecipe with) the ingredients list scallions, feta, sirene and butter. . . . . . 59

5.2 A matrix factorization of discrete movie data. . . . . . . . . . . . . . . . . 615.3 GreEss achieves the optimal decomposition on this synthetic, no-noise

ordinal data. The basis matrix is shown below, and the usage matrix tothe right. It correctly identifies the three ground-truth patterns (k = 3). . 62

5.4 GreEss requires 24 basis vectors to exactly reconstruct this synthetic,noisy data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5 Boolean Matrix Factorization for synthetic data with varying data-generationparameters k, λ, ρ and η. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.6 Results from further clustering techniques on the “running example”data in Paper D. MCLUST is the algorithm used in [DR98] (discussed inSection 3.4.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.7 The UniDip result on a univariate sample (from left: a Beta distribution,a Gaussian distribution, a uniform distribution, and 50% “global” clutterover the interval [0, 1]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.8 Using a constant sample size of the Levy function, the curves show theminimum value found by varying dimensionality when using 1) simplerandom sampling and 2) sparse-grid-based sampling. . . . . . . . . . . . . 70

5.9 A reproduction of Figure 2 in Paper D, with the addition of SkinnyDip’sNormalized Mutual Information values (the NMI is the top-most dot-ted red line series, seen slightly above the blue AMI measurement forSkinnyDip). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.10 An example data set on which SkinnyDip achieves a less-than-optimalresult (univariate projections limitation). . . . . . . . . . . . . . . . . . . . 83

5.11 An example data set on which SkinnyDip achieves a less-than-optimalresult (“boxed” or “rectangular” cluster-models limitation). . . . . . . . . 83

5.12 An example of non-Benford data: server response times for a popularwebsite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.13 Model-order selection for Yummly (left) and IMDB (right) data. The“elbow” points are approximately k = 6 and k = 9 respectively. . . . . . . 86

102

Page 117: TECHNISCHE UNIVERSITÄT MÜNCHEN - TUMmediatum.ub.tum.de/doc/1335245/1335245.pdfHerausforderungen lenken den Blick auf fünf Aspekte der in der Praxis zu beobach tenden Komplexität,

List of Tables

4.1 Existing real-world data sourced from public repositories . . . . . . . . . 424.2 Real-world data sourced from public Application Programming Interfaces. 434.3 Sources for implementations of comparison algorithms . . . . . . . . . . . 48

103