Download pdf - Einführung in die Informatik: Systeme und Anwendungendmm.dbs.ifi.lmu.de/content/lehre/SS19/infonf/skript/InfoNF_04_Teil01.pdf · Die Methode findet Muster, die einem bestimmten Modell

LUDWIG-MAXIMILIANS-UNIVERSITYMUNICH

DATABASESYSTEMSGROUP

DEPARTMENTINSTITUTE FORINFORMATICS

Kapitel 4: Data Mining

Skript zur Vorlesung:

Einführung in die Informatik: Systeme und AnwendungenSommersemester 2019

Vorlesung: Prof. Dr. Christian Böhm

Übungen: Dominik Mautz

Skript © Christian Böhm

http://dmm.dbs.ifi.lmu.de/infonf


Einführung in die Informatik: Systeme und Anwendungen – SoSe 2019

Überblick

4.1 Einleitung

4.2 Clustering

4.3 Klassifikation


2



Motivation


Telefongesellschaft

Astronomie

Kreditkarten Scanner-Kassen

• Riesige Datenmengen werden in Datenbanken gesammelt

• Analysen können nicht mehr manuell durchgeführt werden

Datenbanken

3



Definition KDD

• Knowledge Discovery in Databases (KDD) ist der Prozess der

(semi-) automatischen Extraktion von Wissen aus Datenbanken,

das

• gültig

• bisher unbekannt

• und potentiell nützlich ist.

• Bemerkungen:

• (semi-) automatisch: im Unterschied zu manueller Analyse.

Häufig ist trotzdem Interaktion mit dem Benutzer nötig.

• gültig: im statistischen Sinn.

• bisher unbekannt: bisher nicht explizit, kein „Allgemeinwissen“.

• potentiell nützlich: für eine gegebene Anwendung.


4



Der KDD-Prozess (Modell)


Vorverar-

beitung

Trans-

formation

Datenbank

Fokussieren Data

MiningEvaluation

Muster Wissen

Prozessmodell nach Fayyad, Piatetsky-Shapiro & Smyth

Fok

uss

iere

n:

•B

esch

affu

ng d

er D

aten

•V

erw

altu

ng

(F

ile/

DB

)

•S

elek

tion r

elev

ante

r D

aten

Vorv

erar

bei

tun

g:

•In

teg

rati

on v

on D

aten

aus

unte

rsch

iedli

chen

Quel

len

•V

erv

oll

stän

dig

ung

•K

onsi

sten

zprü

fung

Tra

nsf

orm

atio

n•

Dis

kre

tisi

erung

num

eri-

sch

er M

erkm

ale

•A

ble

itung n

euer

Mer

km

ale

•S

elek

tion r

elev

ante

r M

erkm

.

Dat

a M

inin

g•

Gen

erie

run

g d

er M

ust

er

bzw

. M

od

elle

Eval

uat

ion

•B

ewer

tung d

er I

nte

ress

ant-

hei

tdurc

h d

en B

enutz

er

•V

alid

ieru

ng:

Sta

tist

isch

e

Prü

fun

g d

er M

odel

le

5



Objekt-Merkmale (Feature)


• Oft sind die betrachteten Objekte komplex

• Eine Aufgabe des KDD-Experten ist dann, geeignete Merkmale

(Features) zu definieren bzw. auszuwählen, die für die Unterscheidung

(Klassifikation, Ähnlichkeit) der Objekte relevant sind.

Beispiel: CAD-Zeichnungen:

Mögliche Merkmale:

• Höhe h

• Breite w

• Kurvatur-Parameter

(a,b,c)ax2+bx+c

6



Feature-Vektoren


(h, w, a, b, c)

ax2+bx+c

h

wh

wa

bc

Objekt-Raum Merkmals-Raum

• Im Kontext von statistischen Betrachtungen werden die Merkmale

häufig auch als Variablen bezeichnet

• Die ausgewählten Merkmale werden zu Merkmals-Vektoren (Feature

Vector) zusammengefasst

• Der Merkmalsraum ist häufig hochdimensional (im Beispiel 5-dim.)

7



Feature-Vektoren (weitere Beispiele)


Bilddatenbanken:

Farbhistogramme

Farbe

Häu

figk

eit

Gen-Datenbanken:

Expressionslevel

Text-Datenbanken:

Begriffs-Häufigkeiten

Der Feature-Ansatz ermöglicht einheitliche Behandlung von Objekten

verschiedenster Anwendungsklassen

Data 25

Mining 15

Feature 12

Object 7

...

8

http://images.google.de/imgres?imgurl=www.npl.co.uk/biotech/images/microarray.gif&imgrefurl=http://www.npl.co.uk/biotech/validfluo.html&h=348&w=360&prev=/images?q=microarray&svnum=10&hl=de&lr=&ie=UTF-8&oe=UTF-8

http://hmgc.mcw.edu/images/microarray.jpg



Feature: verschiedene Kategorien


Nominal (kategorisch)

Charakteristik:

Nur feststellbar, ob der Wert gleich oder verschieden ist. Keine Richtung (besser, schlechter) und kein Abstand.Merkmale mit nur zwei Werten nennt man dichotom.

Beispiele:

Geschlecht (dichotom)AugenfarbeGesund/krank (dichotom)

Ordinal

Charakteristik:

Es existiert eine Ordnungsrelation (besser/schlechter) zwischen den Kategorien, aber kein einheitlicher Abstand.

Beispiele:

Schulnote (metrisch?)GüteklasseAltersklasse

Metrisch

Charakteristik:

Sowohl Differenzen als auch Verhältnisse zwischen den Werten sind aussagekräftig. Die Werte können diskret oder stetig sein.

Beispiele:

Gewicht (stetig)Verkaufszahl (diskret)Alter (stetig oder diskret)

9



Ähnlichkeit von Objekten


• Spezifiziere Anfrage-Objekt qDB und…

– … suche ähnliche Objekte – Range-Query (Radius e)

RQ(q,e) = {o DB | (q,o) e }

– … suche die k ähnlichsten Objekte – Nearest Neighbor Query

NN(q,k) DB mit k Objekten, sodass

oNN(q,k), pDB\NN(q,k) : (q,o) (q,p)

10

alternative Schreibweise

für Mengendifferenz:

A\B = A – B



Ähnlichkeitsmaße im Feature-Raum


Euklidische Norm (L2):

1(x,y) = ((x1-y1)2+(x2-y2)

2+...)1/2

yx

Manhattan-Norm (L1):

2(x,y) = |x1-y1|+|x2-y2|+...

yx

Maximums-Norm (L):

(x,y) = max{|x1-y1|, |x2-y2|,...}

xy

Die Unähnlichkeitender einzelnen Merkmalewerden direkt addiert

Die Unähnlichkeit desam wenigsten ähnlichenMerkmals zählt

Abstand in Euklidischen Raum(natürliche Distanz)

Verallgemeinerung Lp-Abstandsmaß:

p(x,y) = (|x1-y1|p + |x2-y2|

p + ...)1/p

11



Gewichtete Ähnlichkeitsmaße

• Viele Varianten gewichten verschiedene Merkmale unterschiedlich

stark.


p

d

i

p

iiiwpyxwyx

=

-=

1

,),(

y

x

)()(),(1

yxyxyxT

--=-

y

x

Gewichtete Euklidische Distanz Mahalanobis Distanz

= Kovarianz-Matrix

12



Kategorien von Data Mining

• Wichtigste Data-Mining-Verfahren auf Merkmals-Vektoren:

– Clustering

– Outlier Detection

– Klassifikation

– Regression

• Supervised: In Trainingsphase wird eine Funktion gelernt, die in der Testphase

angewandt wird.

• Unsupervised: Es gibt keine Trainingsphase. Die Methode findet Muster, die

einem bestimmten Modell entsprechen.

• Darüber hinaus gibt es zahlreiche Verfahren, die nicht auf

Merkmalsvektoren, sondern direkt auf Texten, Mengen, Graphen

usw. arbeiten.


normalerweise unsupervised

normalerweise supervised

13



Clustering


Cluster 1: KlammernCluster 2: Nägel

Ein Grundmodell des Clustering ist:

Zerlegung (Partitionierung) einer Menge von Objekten (bzw. Feature-

Vektoren) in Teilmengen (Cluster), so dass• die Ähnlichkeit der Objekte innerhalb eines Clusters maximiert

• die Ähnlichkeit der Objekte verschiedener Cluster minimiert wird

Idee: Die verschiedenen Cluster repräsentieren meist unterschiedliche Klassen von

Objekten; bei evtl. unbek. Anzahl und Bedeutung der Klassen

14



Anwendung: Thematische Karten


Aufnahme der Erdoberflächein 5 verschiedenen Spektren

Pixel (x1,y1)

Pixel (x2,y2)

Wert in Band 1

Wer

t in

Ban

d 2

Wert in Band 1

Wer

t in

Ban

d 2

Cluster-Analyse

Rücktransformation

in xy-Koordinaten

Farbcodierung nach

Cluster-Zugehörigkeit

15

http://www.angelfire.com/stars2/farid/ast01/imagenes/satelite.jpg



Outlier Detection


Datenfehler?

Betrug?

Outlier Detection bedeutet:

Ermittlung von untypischen Daten

Anwendungen:

• Entdeckung von Missbrauch etwa bei • Kreditkarten

• Telekommunikation

• Datenbereinigung (Messfehler)

16



Klassifikation


SchraubenNägelKlammern

Aufgabe:

Lerne aus den bereits klassifizierten Trainingsdaten die Regeln, um neue

Objekte nur aufgrund der Merkmale zu klassifizieren

Das Ergebnismerkmal (Klassenvariable) ist nominal (kategorisch)

Trainings-daten

Neue Objekte

17



Anwendung: Neugeborenen-Screening


Blutprobe des

Neugeborenen

Massenspektrometrie Metabolitenspektrum

Datenbank

14 analysierte Aminosäuren:

alanine phenylalanine

arginine pyroglutamate

argininosuccinate serine

citrulline tyrosine

glutamate valine

glycine leuzine+isoleuzine

methionine ornitine

18

http://www.comiccompany.co.uk/icons_banners_navbars/baby_blue_100_1k.gif



Anwendung: Neugeborenen-Screening


Ergebnis:• Neuer diagnostischer Test

• Glutamin als bisher

unbekannter Marker

19



Regression


0

5

Grad der Erkrankung

Neue Objekte

Aufgabe:

Ähnlich zur Klassifikation, aber das Ergebnis-Merkmal, das gelernt bzw.

geschätzt werden soll, ist metrisch.

20



The Apriori Algorithm

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of VLDB conference (Int. Conf. On Very Large Data Bases).

Originally published at one to the top conferences of databaseresearch, this is one of the most influential algorithms for datamining:

• Recieved the 10-year Best Paper Award of VLDB 2004

• Among the top-10 data mining algorithms identified by IEEE ICDM 2006

• 24,238 citations in Google Scholar (May, 2019, December2016: 20,714)


21



Problem setting

Given a large database of transactions,

Each transaction consists of a set of (discrete) items.

Find non-trivial, valid, expressive and interesting rulesrepesenting associations between the items, e.g.,


22

tid Ti

1 {beer, chips, wine}

2 {beer, chips}

3 {pizza, wine}

4 {chips, pizza}

{beer, wine} {chips}



Applications

• Classical application: market basket analysis

𝑜𝑛𝑖𝑜𝑛𝑠, 𝑝𝑜𝑡𝑎𝑡𝑜𝑒𝑠 ⇒ 𝑚𝑒𝑎𝑡

Numerous applications beyond that, e.g.:

– Web usage mining,

– Recommender systems,

– Intrusion or plagiarism detection,

– Bioinformatics.

Apriori algorithm and its improvements is an efficient search algorithm for many related problem settings, e.g.,

– Feature selection for clustering and classification,

– Graph mining.


23



Definition

• Let 𝐼 = {𝑖1, 𝑖2, … , 𝑖𝑛} be a set of n attributes called items.

• Let 𝐷 = {𝑡1, 𝑡2, … , 𝑡𝑚} be a set of m transactions called the database. In our example the table at the bottom-right.

• Each transaction in D has a unique transaction ID and contains a subset of the items in I (𝑡𝑖 ⊆ 𝐼).

• A rule is defined as an implication of the form:𝑋 ⇒ 𝑌

• Where 𝑋, 𝑌 ⊆ 𝐼 and 𝑋 ∩ 𝑌 = ∅

Example:

• 𝐼 = {𝑏𝑒𝑒𝑟, 𝑐ℎ𝑖𝑝𝑠, 𝑝𝑖𝑧𝑧𝑎, 𝑤𝑖𝑛𝑒}.

• An example rule could be: 𝑏𝑢𝑡𝑡𝑒𝑟, 𝑏𝑟𝑒𝑎𝑑 ⇒ {𝑚𝑖𝑙𝑘}


24

ID beer chips pizza wine

1 1 1 0 1

2 1 1 1 0

3 0 0 1 1

4 0 1 1 0



Concepts – Support


25

• Support is a measure of how frequently the itemset appears in the database.

• The support of X 𝑠𝑢𝑝𝑝(𝑋) with respect to D is defined as the number of transactions in the database which contains itemset X; often also expressed as relative frequency

• In our example the itemset 𝑏𝑒𝑒𝑟, 𝑐ℎ𝑖𝑝𝑠 has a support of 2 and a relative frequency of Τ2 4 = 0.5, since it occurs in 50% of all transactions.

• 𝑠𝑢𝑝𝑝 𝑏𝑒𝑒𝑟, 𝑐ℎ𝑖𝑝𝑠 = 2

• 𝑜𝑟 𝑖𝑛 𝑡𝑒𝑟𝑚𝑠 𝑜𝑓 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑦:

• 𝑠𝑢𝑝𝑝 𝑏𝑒𝑒𝑟, 𝑐ℎ𝑖𝑝𝑠 = 0.5


1 1 1 0 1

2 1 1 1 0

3 0 0 1 1

4 0 1 1 0



Concepts - Confidence

• The confidence 𝑐𝑜𝑛𝑓(𝑋 ⇒ 𝑌) of the rule 𝑋 ⇒ 𝑌 with respect to the transactions D determines how frequently items in Y appear in transactions that contain X

• 𝑐𝑜𝑛𝑓 𝑋 ⇒ 𝑌 =𝑠𝑢𝑝𝑝(𝑋∪𝑌)

𝑠𝑢𝑝𝑝(𝑋)

• For example the rule 𝑏𝑒𝑒𝑟, 𝑤𝑖𝑛𝑒 ⇒ {𝑐ℎ𝑖𝑝𝑠} in our example has a confidence of Τ0.25

0.25 = 100%. This means that for 100% of the transactions containing {𝑏𝑒𝑒𝑟, 𝑤𝑖𝑛𝑒} the rule is correct (i.e. 100% of the times a customer buys 𝑏𝑒𝑒𝑟, 𝑤𝑖𝑛𝑒 , 𝑐ℎ𝑖𝑝𝑠 is

bought as well.)

• 𝑐𝑜𝑛𝑓( 𝑏𝑒𝑒𝑟, 𝑤𝑖𝑛𝑒 ⇒ 𝑐ℎ𝑖𝑝𝑠 = 1.0

• Confidence is the same as the conditionalprobability 𝑃(𝑌|𝑋).


26


1 1 1 0 1

2 1 1 1 0

3 0 0 1 1

4 0 1 1 0



Process description

Often we are interested in two kinds of problems:

1. Frequent Itemset Mining

– Find all frequent item sets in D with a support of s:{𝑋 ⊆ 𝐼 | 𝑠𝑢𝑝𝑝 𝑋 ≥ 𝑠}

• “naive algorithm” is infeasible because a data set with k items, i.e. |I| = k has itemsets 2k-1itemsets ⇒ therefore we need efficient algorithms

2. Association Rule Mining

– Find all association rules A with a support s and a confidence of c {𝐴 ≡ 𝑋 ⇒ 𝑌 | 𝑠𝑢𝑝𝑝 𝐴 ≥ 𝑠, 𝑐𝑜𝑛𝑓 𝐴 ≥ 𝑐}

– A subproblem of this is Problem 1.


27



A closer look on Itemset Mining

• The size of the power-set grows exponentially in the number of items k in I.

• Nevertheless efficient search is possible using the downward-closure property of support (also called anti-monotonicity).

– It guarantees that for a frequent itemset, all its subsets are also frequent

– and thus for an infrequent itemset, all its super-sets must also be infrequent.

– Efficient algorithms (e.g. Apriori, Eclat, FP-Growth) exploit this property.

– Therefore these algorithms can find all frequent itemsets in reasonable time.


28



Running example for Apriori algorithm


29

tid Ti

1 {beer, chips, wine}

2 {beer, chips}

3 {pizza, wine}

4 {chips, pizza}

I = {beer, chips, pizza, wine}

Database D

Itemset Cover Sup. Freq.

{} {1,2,3,4} 4 100 %

{beer} {1,2} 2 50 %

{chips} {1,2,4} 3 75 %

{pizza} {3,4} 2 50 %

{wine} {1,3} 2 50 %

{beer, chips} {1,2} 2 50 %

{beer, wine} {1} 1 25 %

{chips, pizza} {4} 1 25 %

{chips, wine} {1} 1 25 %

{pizza, wine} {3} 1 25 %

{beer, chips, wine} {1} 1 25 %

Rule Sup. Freq. Conf.

{beer} {chips} 2 50 % 100 %

{beer} {wine} 1 25 % 50 %

{chips} {beer} 2 50 % 66 %

{pizza} {chips} 1 25 % 50 %

{pizza} {wine} 1 25 % 50 %

{wine} {beer} 1 25 % 50 %

{wine} {chips} 1 25 % 50 %

{wine} {pizza} 1 25 % 50 %

{beer, chips} {wine} 1 25 % 50 %

{beer, wine} {chips} 1 25 % 100 %

{chips, wine} {beer} 1 25 % 100 %

{beer} {chips, wine} 1 25 % 50 %

{wine} {beer, chips} 1 25 % 50 %



Illustration of the problem

• A frequent itemset lattice is a good visualization technique for our problem of Itemset Mining.

• Lower levels can contain at most the minimum number of their parents.

• Example: | 𝑏𝑒𝑒𝑟, 𝑐ℎ𝑖𝑝𝑠 | can have only at most min(|{𝑏𝑒𝑒𝑟}|, |{𝑐ℎ𝑖𝑝𝑠}|) support.

• This is an example for the downward-closure property

Foundations of Data Analysis – Association Rules 30




Frequent Itemset Lattice


31

{beer}:2 {chips}:3 {pizza}:2 {wine}:2

{beer,chips}:2 {beer,pizza}:0 {beer,wine}:1 {chips,wine}:1{chips,pizza}:1 {pizza,wine}:1

{beer,chips,pizza}:0 {beer,chips,wine}:1 {beer,pizza,wine}:0 {chips,pizza,wine}:0

{beer,chips,pizza,wine}:0

{}:4

subse

ts

super

sets



Key ideas of Apriori

• Apriori uses a breadth-first search strategy,

• a hash tree to count the support of itemsets,

• and uses a candidate generation function which exploits the downward-closure property.


32



Apriori principle


33





{}:4

Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent



Algorithm

1. Find frequent 1-itemsets and put them into 𝐿𝑘 (𝑘 = 1)

2. Determine frequent k-itemsets from 𝑘 − 1 item subsets

1. Create candidate itemsets 𝐶𝑘 (𝑘 = 2,3, … ) from 𝑘 − 1 itemsets using the apriori-gen function

2. Database D is scanned while counting the support for each element in 𝐶𝑘.Fast counting is achieved by efficiently determine the candidates in 𝐶𝑘 that are contained in a given D using the subset function.


34



Apriori Algorithm

𝐿1 = {𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 1 − 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠}

𝒇𝒐𝒓 (𝑘 = 2 ; 𝐿𝑘−1 ≠ ∅ ; 𝑘 + + ) 𝒅𝒐𝐶𝑘 = 𝑎𝑝𝑟𝑖𝑜𝑟𝑖 − 𝑔𝑒𝑛 𝐿𝑘−1 ;𝒇𝒐𝒓𝒂𝒍𝒍 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑡 𝜖 𝒟 𝒅𝒐𝐶𝑡= 𝑠𝑢𝑏𝑠𝑒𝑡 𝐶𝑘 , 𝑡 ;𝒇𝒐𝒓𝒂𝒍𝒍 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 𝑐 𝜖 𝐶𝑡 𝒅𝒐

𝑐. 𝑐𝑜𝑢𝑛𝑡 + +;𝒆𝒏𝒅

𝒆𝒏𝒅𝐿𝑘 = 𝑐 𝜖 𝐶𝑘 𝑐. 𝑐𝑜𝑢𝑛𝑡 ≥ 𝑚𝑖𝑛𝑠𝑢𝑝 }

𝒆𝒏𝒅

𝐴𝑛𝑠𝑤𝑒𝑟 = ራ

𝑘

𝐿𝑘

𝒂𝒑𝒓𝒊𝒐𝒓𝒊 − 𝒈𝒆𝒏 𝑳𝒌−𝟏 :𝑖𝑛𝑠𝑒𝑟𝑡 𝑖𝑛𝑡𝑜 𝐶𝑘 𝑠𝑒𝑙𝑒𝑐𝑡 ∗

𝑓𝑟𝑜𝑚 𝐿𝑘−1 𝑝, 𝐿𝑘−1 𝑞

𝑤ℎ𝑒𝑟𝑒 𝑝. 𝑖1 = 𝑞. 𝑖1 𝑎𝑛𝑑… 𝑎𝑛𝑑

𝑝. 𝑖𝑘−2 = 𝑞. 𝑖𝑘−2 𝑎𝑛𝑑 𝑝. 𝑖𝑘−1 <𝑞. 𝑖𝑘−1

// pruning:𝑑𝑒𝑙𝑒𝑡𝑒 𝑎𝑙𝑙 𝑒𝑛𝑡𝑟𝑖𝑒𝑠 𝑤ℎ𝑒𝑟𝑒 𝑛𝑜𝑡 𝑎𝑙𝑙𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑠𝑢𝑏𝑠𝑒𝑡𝑠 𝑎𝑟𝑒 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑖𝑛 𝐶𝑘

𝒆𝒏𝒅

𝒔𝒖𝒃𝒔𝒆𝒕 𝑪𝒌, 𝒕 :

store 𝐶𝑘 𝑖𝑛 𝑎 ℎ𝑎𝑠ℎ𝑇𝑟𝑒𝑒

𝑓𝑖𝑛𝑑 𝑎𝑙𝑙 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑒𝑑 𝑖𝑛 𝑒𝑎𝑐ℎ𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛

𝒆𝒏𝒅


35



Concrete example

Itemset Sup. count

{beer} 2

{chips} 3

{pizza} 2

{wine} 2


36

Itemset Sup. count

{beer} 2

{chips} 3

{pizza} 2

{wine} 2Scan

D f

or

cou

nt

of

each

can

did

ate

s Compare candidatesupport with minimumsupport count

Itemset

{beer, chips}

{beer, pizza}

{beer, wine}

{chips, pizza}

{chips, wine}

{pizza, wine}

Itemset Sup. count

{beer, chips} 2

{beer, pizza} 0

{beer, wine} 1

{chips, pizza} 1

{chips, wine} 1

{pizza, wine} 1

Gen

era

te c

an

did

ate

s

Scan D for count of each candidate and Compare with minsupport

Min support = 1



Pruning


37





Pruning, If {ab} is infrequent, then all

supersets are infrequent too

{}:4

Min support = 1



acd ace ade cde

Example of Candidate Generation (𝑳𝟒)

• 𝐿3 = {𝑎𝑏𝑐, 𝑎𝑏𝑑, 𝑎𝑐𝑑, 𝑎𝑐𝑒, 𝑏𝑐𝑑}

• Self-joining: 𝐿3 × 𝐿3– abcd from abc and abd

– acde from acd and ace

• Pruning:

– All subsets from acde need to be in 𝐿3– acde is removed because ade is not in 𝐿3– 𝐶4 = {𝑎𝑏𝑐𝑑}


38

{acd} {ace}

{acde}

X

X



Rule Generation


39

Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement

If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,

If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)



Rule Generation


40

In general, confidence does not have an anti-monotone property

c(ABC D) can be larger or smaller than c(AB D)

But confidence of rules generated from the same itemsethas an anti-monotone property

– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:

c(ABC D) c(AB CD) c(A BCD)

– Confidence is anti-monotone w.r.t. number of items on the right hand side of the rule



Rule Pruning


41

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned

Rules



Discussion of the Apriori algorithm

• Much faster than the Brute-force algorithm

– It avoids checking all elements in the lattice

• The running time is in the worst case 𝒪(2|𝐼|)– Pruning really prunes in practice

• It makes multiple passes over the dataset

– One pass for every level k

– Improvement AprioriTID avoids multiple runs, but needs additional data structures which require more space.

• Average transaction wid

– Transaction width increases with denser data sets.

– Affects traversals of hash tree and may increase max length of frequent itemsets.


42Apriori Algorithm



Further reading

• Tan; Steinbach; Kumar, Mass. : Introduction to Data Mining Pearson / Addison-Wesley ; 2006 – Book and free sample chapter http://www-users.cs.umn.edu/~kumar/dmbook/index.php


43