Streaming Algorithmen · 2012. 7. 9. · Juni 2012 Streaming Algorithmen - Dynamische Zuordnung von Objekte mit größten Häufigkeit 4. Der Algorithmus Satz 1: Wenn ein Objekt mit

Seminar über Algorithmen - 06. Juni 2012 Streaming Algorithmen - Dynamische Zuordnung von Objekte mit größten Häufigkeit

Streaming Algorithmen

Streaming AlgorithmenDynamische Zuordnung von Objekten mit größter Häufigkeit

Seminar über Algorithmen, SS 2012

Simon Tippenhauer

Freitag, 06. Juni 2012

1


Gliederung

Gliederung

1. Motivation

2. Data Streams und Data Stream Algorithm

3. Formalisierung der Aufgabe: Dynamische Zuordnung von Objekten mit größter Häufigkeit.

4. Der Algorithmus

(i) Methoden und Idee

(ii) Bestimmung der absoluten Mehrheit

(iii) Bestimmung von k Hot Items

5. Evaluation

2


1. Motivation

1. Motivation

• Internet Service Provider (ISP) führt Statistik über den Datenverkehr.

• Welche Netzwerkadressen

• verursachen den meisten Datenverkehr oder

• unterliegen den häufigsten Anfragen?

• Der ISP kann daraufhin den Datenfluss steuern und einer Überlastung der Server entgegenwirken.

• Anhand der Statistik können auch Anomalien festgestellt werden.

3


2. Data Streams und Data Stream Algorithms


• Ein einführendes Beispiel: “Finde die fehlende Zahl!”

• Sei 𝞹 eine Permutation der Zahlen 1, 2, ..., n und 𝞹-1 die selbe Permutation mit einer fehlenden Zahl (n ist dabei sehr groß!).

• Nun werden die Zahlen von 𝞹-1 nacheinander gezeigt und mit O(log n) Speicherplatz, soll die fehlende Zahl bestimmt werden.

• Die fehlende Zahl ist dann:

• Beispiel: n = 8, 𝞹-1 = 4,2,7,6,1,5,8

•

4

s = n(n +1)2

− π−1[i]i=1

n−1

∑

s = 8(8 +1)2

− π−1[i]i=1

8−1

∑ = 36 − 33 = 3



Data Streams

• Definition 1: “Ein Data Stream stellt Eingabedaten ai dar, die in einer sehr hohen Rate eintreffen.”

• Eine hohe Rate liegt vor, wenn die Kommunikations- und Berechnungsinfrastruktur stark beansprucht werden.

• Es ist also schwierig

• die gesamte Eingabe dem Programm zu übermitteln - transmit (T)

• anspruchsvolle Funktionen zu berechnen - compute (C)

• die Eingabe temporär oder auf Dauer zu speichern - space (S)

• Die Aufgabe ist es

• ständige Updates zu bearbeiten und zeitnahe Statistiken zu liefern.

5



Data Stream Models

• Eine Eingabefolge a1, a2, ... beschreibt ein zugrunde liegendes Signal A.

• Das Signal A ist eine Funktion A:[1...N] → R.

• Die Modelle unterscheiden sich dadurch, wie die ai’s das Signal A beschreiben.

• Time Series Model

• Cash Register Model

• Turnstile Model

6



Time Series Model

• Jedes ai entspricht A[i] und sie treten in aufsteigender Reihenfolge auf.

7

0

25

50

75

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

Time Series Model

Signal A



Cash Register Model

• Jedes ai = (j, Ii) entspricht einer Erhöhung von A[j] um den Wert Ii > 0. Also, Ai = Ai-1[j] + Ii

• Ai ist der Zustand des Signals A, nachdem die i-te Eingabe erfolgt ist.

• Mehrere ai’s können ein gegebenes A[j] mit der Zeit erhöhen.

• Dies ist das wohl am häufigsten auftretende Model.

8



Turnstile Model

• Jedes ai = (j, Ui) entspricht einer Erhöhung von A[j] um den Wert Ui. Also, Ai = Ai-1[j] + Ui

• Ai ist der Zustand des Signals A, nachdem die i-te Eingabe erfolgt ist.

• Die Werte Ui können positiv oder negativ sein.

• Dies ist das wohl allgemeinste Model.

• Das Turnstile Model ist strikt, wenn immer alle A[j]’s ≥ 0 sind.

9



Data Stream Algorithms

• Auf dem Signal A sollen unterschiedliche Funktionen berechnet werden.

• Die Performance hängt von drei wesentlichen Messverfahren ab:

1. Verarbeitungszeit pro Eingabe ai. (Proc. Time)

2. Speicherbedarf der Datenstruktur für A zum Zeitpunkt t. (Storage)

3. Laufzeit zur Berechnung von Funktionen auf A. (Compute Time)

• Daraus ergeben sich folgende Anforderung an die Algorithmen:

‣ Zu jedem Zeitpunkt t soll Proc. Time, Storage und Compute Time gleichzeitig o(N, t) sein, vorzugsweise polylog(N, t). Dabei entspricht N die Größe der Domäne.

‣ Abgeschwächt kann Compute Time auch nicht sublinear sein.

10


3. Formalisierung der Aufgabe

Gliederung

1. Motivation



4. Der Algorithmus




5. Evaluation

11



3. Formalisierung der Aufgabe:Dynamische Zuordnung von Objekten mit größter Häufigkeit

• Datenbanksystem mit Lösch- und Einfügeoperationen.

➡ Aufgabe: Bestimmung der k häufigsten Objekte.

I. Folge von Transaktionen auf Objekten,

II. o.B.d.A. sind die Objekte mit Zahlen von 1 bin m identifizierbar.

III.Das Nettovorkommen von Objekt i zum Zeitpunkt t wird bezeichnet mit:

IV. Aktuelle relative Häufigkeit von Objekt i:

V. Die k häufigsten Objekte haben die k größten .

VI. Es liegt ein strict Turnstile Model vor.

12

fi (t) = ni (t) / nj (t)j=1

m

∑fi (t)

ni (t)



Anforderungen an den Algorithmus

I. Berechnung der k häufigsten Objekte zu jedem Zeitpunkt.

II. Sublinear zu jedem Zeitpunkt t bezüglich

(i) der Verarbeitungszeit pro Transaktion ai,

(ii) dem Speicherbedarf der Datenstruktur und

(iii) der Laufzeit für die Berechnung der k häufigsten Objekte.

‣ Die Anforderungen für Streaming Algorithmen können für I. nicht eingehalten werden.

‣ Daher werden statt der k häufigsten Objekt, die Hot Items gefordert.

13


Definition:Ein Objekt i ist ein Hot Item zum Zeitpunkt t, wenn gilt:

Es kann maximal k Hot Items geben, aber auch weniger oder keins.


Hot Item

14

fi (t) >1k +1

c = nj (t)j=1

m

∑ = 20

Beispiel: k = 3, m = 4

f1(t) =220

f2 (t) =720

f3(t) =520

f4 (t) =620

1k +1

=14=520

⇒ f2 (t) >1k +1

∧ f4 (t) >1k +1

Somit sind die Hot Items 2 und 4.



➡ Mit o(m) Speicherplatz folgt, dass

1. nicht alle Hot Items bestimmt werden können oder

2. Objekte mit niedrigerer relativen Häufigkeit als Hot Item klassifiziert werden.

• Wenn die Small Tail Eigenschaft vorausgesetzt wird, gilt das Lemma 1 nicht!

Lemma 1: Jeder Algorithmus, der garantiert alle und nur Objekte zu finden, die eine größeren Häufigkeit als 1/(k + 1) haben,

benötigt Ω(m) viele Bits.

15

Definition:Sei F eine Menge von relativen Häufigkeiten ƒi mit ƒ1 ≥ ƒ2 ≥ ... ≥ fk ≥ fk+1 ≥ ... ≥ ƒm. Dann hat F einen Small Tail, wenn gilt:

fii>k

m

∑ ≤1k +1



Gliederung

1. Motivation



4. Der Algorithmus




5. Evaluation

16


4. Der Algorithmus

4. Der AlgorithmusMethoden

• Summary Datenstruktur mittels Random Sampling

• Summary Datenstruktur repräsentiert die zugrundeliegende Relation

• mit deutlich weniger Speicherplatzbedarf und

• bietet ausreichend viele Informationen, um

• Anfragen zufriedenstellend zu beantworten.

• Random Sampling beschreibt ein Auswahlverfahren für Stichproben, indem

• die Stichproben zufällig ausgewählt werden.

• Benötigt viele Samples und

• Resampling ist von Zeit zu Zeit notwendig.

17


4. Der Algorithmus

Methoden

• (non-adaptive) Group Testing

• Group Testing dient zur effektiven Bestimmung von ausgezeichneten Elementen einer Menge.

• non-adaptive bedeutet, dass alle Test unabhängig von den Ergebnissen, der anderen Test durchgeführt werden.

18


4. Der Algorithmus

Idee zum Algorithmus

• Es werden Gruppen für eine Teilmenge von Objekten gebildet.

• Jede Gruppe hat einen Zähler für das Nettovorkommen innerhalb dieser.

• Zusätzlich wird ein Zähler für das gesamte Nettovorkommen geführt.

• Die Zähler bilden die Summary Datenstruktur.

• Der Gruppentest informiert über die Beinhaltung eines Hot Items in der Gruppe.

• Aus den Ergebnissen lassen sich die Hot Items ermitteln.

19


4. Der Algorithmus

Herausforderungen für diesen Ansatz

1. Begrenzung der Anzahl notwendiger Teilmengen (Gruppen).

2. Geeignete (platzsparende) Repräsentation der Teilmengen.

3. Effiziente Bestimmung der Hot Items anhand von Gruppentests.

20

Starten wir mit dem einfachen Fall!


4. Der Algorithmus

k = 1Deterministische Bestimmung der absoluten Mehrheit

• Gegeben ein Stream a1, ..., at zum Zeitpunkt t mit ai ∈ {1, ..., m}.

• Objekt i ist ein Hot Item, wenn gilt:

• Auswahl der Teilmengen (Gruppen)

• Gj ist die Teilmenge von Objekten, dessen j-te Ziffer in Binärdarstellung eine 1 ist.

• Die zugehörigen Zähler werden mit c0, ..., clog m bezeichnet.

• Die Datenstruktur umfasst die Zähler

• c0, ..., clog m der Gruppen und

• einen Zähler c für das Nettovorkommen aller Objekt.

21

fi >11+ k

>12


4. Der Algorithmus

Verarbeitung des Streams und Bestimmung des Hot Items

• Einfügen von Objekt i:

• Inkrementiere den Zähler c.

• Erhöhe jeden Zähler cj um 1, wenn das j-te Bit von i gesetzt ist.

• Löschen von Objekt i:

• Dekrementiere den Zähler c.

• Verringere jeden Zähler cj um 1, wenn das j-te Bit von i gesetzt ist.

• Bestimmung des Hot Items:

• Wenn es existiert, so ist

22

hot = 2 j

j=0

logm⎢⎣ ⎥⎦

∑ ⋅ greater(cj ,c2)


4. Der Algorithmus

Beispiel (nur Inserts):k = 1, m = 8, t = 17, Signal A = 3,1,2,3,8,5,3,3,8,7,7,7,3,3,3,3,3

23

0

3

6

9

12

15

- 3 1 2 3 8 5 3 3 8 7 7 7 3 3 3 3 3

Netto

vork

omm

en

Signal

Hot Item Bit 0 Bit 1 Bit 2 Bit 3

1


4. Der Algorithmus

Satz 1: Wenn ein Objekt mit der absoluten Mehrheit vorhanden ist, so findet der Algorithmus dieses in O(log m) Zeit pro Operation.

1. Der Zustand der Datenstruktur ist der selbe nach k Inserts und l Deletes, wie wenn nur n = k - l Inserts vorliegen.

2. Der Algorithmus arbeitet nur mit Inserts korrekt.

Beweis zu 2.: Sei i das Hot Item mit relativer Häufigkeit ƒi > 0.5 zum Zeitpunkt t.

➡ ni(t) > 0.5 c

➡ Für die Zähler von i gilt: cj > 0.5 c

➡

➡ Für die anderen Zähler gilt: cj < 0.5

➡ Die Ziffern der Binärdarstellung kann somit abgelesen werden.

➡ Laufzeit abhängig von der Anzahl der Zähler, d.h. O(log m)

24

f j =1− fi < 0.5j≠i

n

∑


4. Der Algorithmus

Erfüllte Herausforderungen

Begrenzung der Anzahl notwendiger Teilmengen (Gruppen).

Geeignete (platzsparende) Repräsentation der Teilmengen.

Effiziente Bestimmung der Hot Items anhand von Gruppentests.

25

Weiter geht es mit dem allgemeinen Fall!


4. Der Algorithmus

k > 1Randomisierte Bestimmung der Hot Items

• Idee für k = 1 kann mit folgenden Beobachtungen verwendet werden.

• In einer Teilmenge mit nur einem Hot Item, besitzt dieses die Mehrheit.

• Mit der Small Tail Eigenschaft sogar die absolute Mehrheit.

• Durch geschickte Wahl der Teilmengen kann sichergestellt werden, dass jedes Hot Item in solch einer Gruppe vorkommt.

• Für jede Teilmenge werden O(log m) Zähler geführt.

26

Definition: Sei F ⊆ {1, ..., m} die Menge der Hot Items mit |F| ≤ k. Dann ist S ⊆ {1, ..., m} eine gute Teilmenge, wenn |S ∩ F| = 1 ist.


4. Der Algorithmus

Wiederholung:Herausforderungen für diesen Ansatz mit k > 1




27

Starten mit den Begrenzung der notwendigen Teilmengen.


4. Der Algorithmus

Satz 2: Durch wählen von O(k ln k) Teilmengen, indem zufällig (m/k) Objekte von 1 bis m gleichverteilt gewählt werden, sind mit konstanter Wahrscheinlichkeit k Teilmengen S1, ..., Sk enthalten, sodass gilt:

• Pr[i ist ein Hot Item] =

• Pr[Ein Hot Item bei Objekten] = p =

• Für Relevante Fälle mit 1 ≤ k ≤ gilt: 0 ≤ p ≤ 0.25

• Aus dem Coupon Collector Problem folgt die Wahl von O(k ln k) Teilmengen

28

(F∩ Si ) = F

i

km

mk

km(1− k

m)mk−1

m2


4. Der Algorithmus

Herausforderungen für diesen Ansatz mit k > 1




29

Weiter mit der geeigneten Repräsentation der Teilmengen.


4. Der Algorithmus

• Primzahl P ≫ 2k, a und b gleichverteilte Zufallszahlen aus {0, 1, ..., P-1}

• Definieren

• ha,b(x) = ((ax + b mod P) mod 2k) und

• Sa,b,i = {x | ha,b(x) = i} zugehörige Teilmenge.

➡ Mit Wahrscheinlichkeit sind zwei Objekte in der selben Teilmenge.

30

Universelles Hashing zur Repräsentation von Teilmengen

12k


4. Der Algorithmus

• Brauchen O(k ln k) Teilmengen, damit jedes Hot Item in einer guten Teilmenge ist.

• Dazu werden Paare von (a,b) gewählt.

➡ Teilmengen.

• Jedes Objekt gehört T Teilmengen an.

• Anhand der Bitrepräsentation des Objekts werden die entsprechende Zähler der Teilmengen erhöht bzw. verringert.

31

Universelles Hashing Genaue Anzahl der Teilmengen

T =O(log kδ)

2Tk = 2k log kδ

Lemma 2 (ohne Beweis): Für jedes Hot Item ist die Wahrscheinlichkeit in mindestens einer guten Menge zu sein mindestens 1 - δ.


4. Der Algorithmus

32

Die Teilmengen im Bild

Sa1 ,b1 ,0

Sa1 ,b1 ,1

Sa1 ,b1 ,2k−1

Sa2 ,b2 ,2k

SaT ,bT ,2Tk

c0 c00 c01 c0 logmc1 c10 c11 c1logm

c2k−1 c2k−10 c2k−11 c2k−1logmc2k c2k0 c2k1 c2k logm

c2Tk−1 c2Tk−10 c2Tk−11 c2Tk−1logm

⎛

⎝

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟

Gruppenzähler

c

0-te Bit

1-te Bit

(log m)-te

Bit


4. Der Algorithmus





33

Es müssen nur noch die Hot Items bestimmt werden.


4. Der Algorithmus

Effiziente Bestimmung der Hot Items

34

1. Für jede Teilmenge Sa,b,i teste,

1.1. ob der zugehörige Gruppenzähler und

1.2. es eine gute Teilmenge ist.

2. Wenn ja,

2.1. Nutze den Algorithmus von k = 1 und bestimme das Hot Item der Gruppe.

3. Wenn nein,

3.1. Teste die nächste Teilmenge.

4. Gebe alle Hot Items aus.

ci >c

k +1


4. Der Algorithmus

Lemma 3: Sei Sa,b,i eine Teilmenge mit den assoziierten Zählern ca,b,i, c0, c1, ..., clog m. Mit der Small Tail Eigenschaft vorausgesetzt, lässt sich deterministisch bestimmen, ob Sa,b,i eine gute Teilmenge ist.

35

1. Fall: Kein Hot Item in der Menge, dann gilt:

2. Fall: Zwei oder mehr Hot Items in der Menge, dann ex. ein j, sodass gilt:

ca,b,i <c

k +1

ca,b,i − cj >c

k +1


4. Der Algorithmus

Satz 3a: Mit einer Wahrscheinlichkeit von mindestens 1 - δ können alle Hot Items mit O(k(log k + log 1/δ)) Speicherplatz zu jedem Zeitpunkt t bestimmt werden.

36

• Speicherplatz für die Hashfunktionen

• Primzahl P in Abhängigkeit von m, also ,

• sowie Werte von Paaren (a,b) für alle Teilmengen.

• Also .

• Speicherplatz für die Zähler

• Für jede Gruppe Zähler.

• Bei Teilmenge also .

➡ Speicherplatz für die Datenstruktur.

O(log kδ)

O(logm)+O(log kδ)

O(logm)

O(logm)

2Tk = 2k log kδ

O(logm ⋅ k log kδ)

O(logm ⋅ k log kδ)


4. Der Algorithmus

Satz 3a: Mit einer Wahrscheinlichkeit von mindestens 1 - δ können alle Hot Items mit O(k(log k + log 1/δ)) Speicherplatz zu jedem Zeitpunkt t bestimmt werden.

37

• Lemma 2

➡ Wahrscheinlichkeit 1 - δ ist jedes Hot Item in einer guten Menge.

• Lemma 3 und Satz 1

➡ Kann deterministische bestimmt werden, ob ein Hot Item in der Teilmenge ist.

➡ Wenn ein Hot Item vorhanden ist, so kann es bestimmt werden.


4. Der Algorithmus

Satz 3b: Dabei wird O(log k/δ log m) Zeit für eine Aktualisierung und O(k log k/δ log m) Zeit für die Auflistung aller Hot Items benötigt.

38

• Bei einer Aktualisierung werden

• Hashfunktionen in berechnet und

• jeweils Zähler verändert.

➡ Zeit pro Aktualisierung.

• Für die Auflistung der Hot Items gilt:

• Hot Item kann in einer Gruppe in Zeit bestimmt werden und

• dies für Gruppen.

➡ Zeit für alle Hot Items.

O(logm)

T =O(log kδ)

O(log kδ⋅ logm)

O(logm)

O(k log kδ)

O(k log kδ⋅ logm)


4. Der Algorithmus





39

Fertig!


5. Evaluation

Gliederung

1. Motivation



4. Der Algorithmus




5. Evaluation

40


5. Evaluation

Vergleich mit vorherigen Ansätzen

41

Recall on Synthetic Data

0.00.10.20.30.40.50.60.70.80.91.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Zipf parameter

Reca

ll

Group Testing Lossy Counting Frequent

Precision on Synthetic Data

0.00.10.20.30.40.50.60.70.80.91.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Zipf parameter

Prec

isio

n


Figure 5: Experiments on synthetic data consisting of 106

transactions. Left: testing recall (proportion of

the hot items reported). Right: testing precision (proportion of the output items which are hot)

Proof. During any insertion or deletion, the algorithmtakes the same action and does not inspect the contents ofthe memory. It just adds or subtracts values from the coun-ters, as a function solely of the item value. Since additionand subtraction commute, the lemma follows.

4. EXPERIMENTSTo evaluate our approach, we implemented our Group

Testing algorithm in C. We also implemented two algorithmswhich operate on non-dynamic data, the algorithm LossyCounting [23] and Frequent [9]. Neither algorithm is able tocope with the case of the deletion of an item, and there isno obvious modification to accommodate deletions and stillguarantee the quality of the output. We instead performed a“best e↵ort” modification: since both algorithms keep coun-ters for certain items, which are incremented when that itemis inserted, we modified the algorithms to decrement thecounter whenever the corresponding item is deleted. Whenan item without a counter is deleted, then we take no ac-tion.4 This modification ensures that when the algorithmsencounter an inserts-only dataset, then their action is thesame as the original algorithms. We ran tests on both syn-thetic and real data, and measure the two standard qualitiesof the output: the recall and the precision.

Definition 3. The recall of an experiment to find hot itemsis the proportion of the hot items that are found by themethod. The precision is the proportion of items identifiedby the algorithm which are hot items.

It will be interesting to see how these properties interact.For example, if an algorithm outputs every item in the rangethen it clearly has perfect recall (every hot item is indeedincluded in the output), but its precision is very poor. At theother extreme, an algorithm which is able to identify onlythe most frequent item will have perfect precision, but mayhave low recall if there are many hot items. For example,

4Many variations of this theme are possible. Our experi-mental results here that compare our algorithms to modifi-cations of Lossy Counting [23] and Frequent [9] should beconsidered proof-of-concept only.

the Frequent algorithm gives guarantees on the recall of itsoutput, but does not bound the precision, whereas for LossyCounting, the parameter ✏ a↵ects the precision indirectly(depending on the properties of the sequence). Meanwhile,given a sequence with the small tail property, Group Testingguarantees that no infrequent items will be output and forgeneral sequences it is unlikely to output infrequent items.

4.1 Synthetic DataWe created synthetic datasets designed to test the behav-

ior when confronted with a sequence including deletes. Thedatasets were created in three equal parts: first, a sequenceof insertions distributed uniformly over a small range; next,a sequence of inserts was drawn from a zipf distributionwith varying parameter; lastly, a sequence of deletes wasdistributed uniformly over the same range as the startingsequence. The net e↵ect of this sequence is that first andlast groups of transactions should (mostly) cancel out, leav-ing the “true” signal from the zipf distribution. The datasetwas designed to test whether the algorithms could find thissignal from the added noise. We generated a datasets of1,000,000 items so it was possible to compute the exact an-swers in order to compare, and searched for the k = 50 hotitems while varying the zipf parameter of the signal from0 (uniform) to around 3 (highly skewed). The results areshown in Figure 5, with the recall plotted on the left, andthe precision on the right.

We immediately see that existing algorithms perform verybadly on this data set including deletions. Lossy countingperforms worst on both recall and precision, while Frequent,which does reasonably well when zipf parameter of the signalis low (all hot items have about the same frequency) doesless well as the distribution becomes more skewed (when thelower ranked frequent items are pushed closer to the thresh-old. Group Testing finds all hot items, and only hot itemsin every case, even though the zipf distribution does notsatisfy the small tail assumptions: this property is slightlystronger than is needed, and so Group Testing is able tosucceed on data sets on which it does not hold. Even whenthe recall of the other algorithms is reasonably good (find-ing around three-quarters of the hot items), their precision

Recall on Synthetic Data

0.00.10.20.30.40.50.60.70.80.91.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Zipf parameter

Reca

ll


Precision on Synthetic Data

0.00.10.20.30.40.50.60.70.80.91.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Zipf parameter

Prec

isio

nGroup Testing Lossy Counting Frequent

Figure 5: Experiments on synthetic data consisting of 106

transactions. Left: testing recall (proportion of

the hot items reported). Right: testing precision (proportion of the output items which are hot)

Proof. During any insertion or deletion, the algorithmtakes the same action and does not inspect the contents ofthe memory. It just adds or subtracts values from the coun-ters, as a function solely of the item value. Since additionand subtraction commute, the lemma follows.

4. EXPERIMENTSTo evaluate our approach, we implemented our Group

Testing algorithm in C. We also implemented two algorithmswhich operate on non-dynamic data, the algorithm LossyCounting [23] and Frequent [9]. Neither algorithm is able tocope with the case of the deletion of an item, and there isno obvious modification to accommodate deletions and stillguarantee the quality of the output. We instead performed a“best e↵ort” modification: since both algorithms keep coun-ters for certain items, which are incremented when that itemis inserted, we modified the algorithms to decrement thecounter whenever the corresponding item is deleted. Whenan item without a counter is deleted, then we take no ac-tion.4 This modification ensures that when the algorithmsencounter an inserts-only dataset, then their action is thesame as the original algorithms. We ran tests on both syn-thetic and real data, and measure the two standard qualitiesof the output: the recall and the precision.

Definition 3. The recall of an experiment to find hot itemsis the proportion of the hot items that are found by themethod. The precision is the proportion of items identifiedby the algorithm which are hot items.

It will be interesting to see how these properties interact.For example, if an algorithm outputs every item in the rangethen it clearly has perfect recall (every hot item is indeedincluded in the output), but its precision is very poor. At theother extreme, an algorithm which is able to identify onlythe most frequent item will have perfect precision, but mayhave low recall if there are many hot items. For example,

4Many variations of this theme are possible. Our experi-mental results here that compare our algorithms to modifi-cations of Lossy Counting [23] and Frequent [9] should beconsidered proof-of-concept only.

the Frequent algorithm gives guarantees on the recall of itsoutput, but does not bound the precision, whereas for LossyCounting, the parameter ✏ a↵ects the precision indirectly(depending on the properties of the sequence). Meanwhile,given a sequence with the small tail property, Group Testingguarantees that no infrequent items will be output and forgeneral sequences it is unlikely to output infrequent items.

4.1 Synthetic DataWe created synthetic datasets designed to test the behav-

ior when confronted with a sequence including deletes. Thedatasets were created in three equal parts: first, a sequenceof insertions distributed uniformly over a small range; next,a sequence of inserts was drawn from a zipf distributionwith varying parameter; lastly, a sequence of deletes wasdistributed uniformly over the same range as the startingsequence. The net e↵ect of this sequence is that first andlast groups of transactions should (mostly) cancel out, leav-ing the “true” signal from the zipf distribution. The datasetwas designed to test whether the algorithms could find thissignal from the added noise. We generated a datasets of1,000,000 items so it was possible to compute the exact an-swers in order to compare, and searched for the k = 50 hotitems while varying the zipf parameter of the signal from0 (uniform) to around 3 (highly skewed). The results areshown in Figure 5, with the recall plotted on the left, andthe precision on the right.

We immediately see that existing algorithms perform verybadly on this data set including deletions. Lossy countingperforms worst on both recall and precision, while Frequent,which does reasonably well when zipf parameter of the signalis low (all hot items have about the same frequency) doesless well as the distribution becomes more skewed (when thelower ranked frequent items are pushed closer to the thresh-old. Group Testing finds all hot items, and only hot itemsin every case, even though the zipf distribution does notsatisfy the small tail assumptions: this property is slightlystronger than is needed, and so Group Testing is able tosucceed on data sets on which it does not hold. Even whenthe recall of the other algorithms is reasonably good (find-ing around three-quarters of the hot items), their precision

recall = truePositivetruePositive+ falseNegative

precision = truePositivetruePositive+ falsePositive


5. Evaluation

Vergleich mit vorherigen Ansätzen

42

recall = truePositivetruePositive+ falseNegative

precision = truePositivetruePositive+ falsePositive

Recall on Real Data

0.00.10.20.30.40.50.60.70.80.91.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5Number of Transactions / 10^6

Rec

all


Precision on Real Data

0.00.10.20.30.40.50.60.70.80.91.0


Prec

isio

n


Figure 6: Performance results on real data

is very poor: for every hot item that is reported, aroundten infrequent items are also included in the output, and wecannot distinguish between these two types.

There is a price to pay for the extra power of the GroupTesting algorithm: it takes longer to process each item underour implementation, and requires more memory. However,these memory requirements are all very small compared tothe size of the dataset: Group Testing used 17Kb, LossyCounting used 6Kb on average, and Frequent needed only3Kb. Additional speed could be achieved by a more opti-mized implementation, with more attention paid to makingthe hash function fast to evaluate. It is also trivial to par-allelize the Group Testing algorithm, since each test can bedone separately.

4.2 Real DataWe obtained data from one of AT&Ts networks for part

of a day, totaling around 100Mb. This consisted of a se-quence of new telephone connections being initiated, andsubsequently closed. The duration of the connections variedconsiderably, meaning that at any one time there were manytens of thousands of connections in place. In total, therewere 3.5 million transactions. We ran the algorithms on thisdynamic sequence in order to test their ability to operate onnaturally occurring sequences. After every 100,000 transac-tions we posed the query to find all (source,destination) pairswith current frequency greater than 1%. We were groupingconnections by their regional code, giving many millions ofpossible pairs, m, although we discovered that neighboringareas generated the most communication. This meant thatthere were significant numbers of pairings achieving the tar-get frequency. Again, we computed recall and precision forthe three algorithms, with the results shown in Figure 6.

The Group Testing approach is shown to be justified hereon real data, which has no guarantee of having the smalltail property. In terms of both recall and precision, it isnear perfect. On one occasion, it overlooked a hot item,and a few times it includes items which are not hot. Undercertain circumstances this may be acceptable if the items in-cluded are “nearly hot”, that is, are just under the thresholdfor being considered hot. However, we did not pursue thisline. Lossy Counting performs generally poorly on this dy-namic dataset, its quality of results swinging wildly betweenreadings but on average finding only half the hot items. The

recall of the Frequent algorithm looks reasonably good es-pecially as time progresses, but its precision, which beginspoorly, appears to degrade further. One possible explana-tion is that the algorithm is collecting all items which areare ever hot, and outputting these whether they are hot ornot. Certainly, it outputs between two to three times asmany items as are currently hot, meaning that its outputwill necessarily contain many infrequent items.

Lastly, we ran tests which demonstrated the flexibility ofour approach. As noted in Section 3.3, if we have created aset of counters for a particular frequency level f = 1/(k +1), then we can use these counters to answer a query for ahigher frequency level without any need for re-computation.To test this, we computed the data structure for the firstmillion items of the real data set based on a frequency levelof 0.5%. We then asked for all hot items for a variety offrequencies between 10% and 0.5%. The results are shownin Figure 7. As predicted, the recall level was the same(100% throughout), and precision was high, with a few non-hot items included at various points. We then examinedhow much below the designed capability we could push thegroup testing algorithm, and ran queries asking for hot itemswith progressively lower frequencies. Results maintained an

Varying Frequency at Query Time

0.0

0.2

0.4

0.6

0.8

1.0

0.01%0.10%1.00%10.00%Query Frequency

Recall Precision

Above frequency parameter

Below frequency parameter

Figure 7: Choosing the frequency level at query

time: the data structure was built for queries at the

0.5% level, but was then tested with queries ranging

from 10% to 0.02%

Recall on Real Data

0.00.10.20.30.40.50.60.70.80.91.0


Rec

all


Precision on Real Data

0.00.10.20.30.40.50.60.70.80.91.0


Prec

isio

n


Figure 6: Performance results on real data

is very poor: for every hot item that is reported, aroundten infrequent items are also included in the output, and wecannot distinguish between these two types.

There is a price to pay for the extra power of the GroupTesting algorithm: it takes longer to process each item underour implementation, and requires more memory. However,these memory requirements are all very small compared tothe size of the dataset: Group Testing used 17Kb, LossyCounting used 6Kb on average, and Frequent needed only3Kb. Additional speed could be achieved by a more opti-mized implementation, with more attention paid to makingthe hash function fast to evaluate. It is also trivial to par-allelize the Group Testing algorithm, since each test can bedone separately.

4.2 Real DataWe obtained data from one of AT&Ts networks for part

of a day, totaling around 100Mb. This consisted of a se-quence of new telephone connections being initiated, andsubsequently closed. The duration of the connections variedconsiderably, meaning that at any one time there were manytens of thousands of connections in place. In total, therewere 3.5 million transactions. We ran the algorithms on thisdynamic sequence in order to test their ability to operate onnaturally occurring sequences. After every 100,000 transac-tions we posed the query to find all (source,destination) pairswith current frequency greater than 1%. We were groupingconnections by their regional code, giving many millions ofpossible pairs, m, although we discovered that neighboringareas generated the most communication. This meant thatthere were significant numbers of pairings achieving the tar-get frequency. Again, we computed recall and precision forthe three algorithms, with the results shown in Figure 6.

The Group Testing approach is shown to be justified hereon real data, which has no guarantee of having the smalltail property. In terms of both recall and precision, it isnear perfect. On one occasion, it overlooked a hot item,and a few times it includes items which are not hot. Undercertain circumstances this may be acceptable if the items in-cluded are “nearly hot”, that is, are just under the thresholdfor being considered hot. However, we did not pursue thisline. Lossy Counting performs generally poorly on this dy-namic dataset, its quality of results swinging wildly betweenreadings but on average finding only half the hot items. The

recall of the Frequent algorithm looks reasonably good es-pecially as time progresses, but its precision, which beginspoorly, appears to degrade further. One possible explana-tion is that the algorithm is collecting all items which areare ever hot, and outputting these whether they are hot ornot. Certainly, it outputs between two to three times asmany items as are currently hot, meaning that its outputwill necessarily contain many infrequent items.

Lastly, we ran tests which demonstrated the flexibility ofour approach. As noted in Section 3.3, if we have created aset of counters for a particular frequency level f = 1/(k +1), then we can use these counters to answer a query for ahigher frequency level without any need for re-computation.To test this, we computed the data structure for the firstmillion items of the real data set based on a frequency levelof 0.5%. We then asked for all hot items for a variety offrequencies between 10% and 0.5%. The results are shownin Figure 7. As predicted, the recall level was the same(100% throughout), and precision was high, with a few non-hot items included at various points. We then examinedhow much below the designed capability we could push thegroup testing algorithm, and ran queries asking for hot itemswith progressively lower frequencies. Results maintained an

Varying Frequency at Query Time

0.0

0.2

0.4

0.6

0.8

1.0

0.01%0.10%1.00%10.00%Query Frequency

Recall Precision

Above frequency parameter

Below frequency parameter

Figure 7: Choosing the frequency level at query

time: the data structure was built for queries at the

0.5% level, but was then tested with queries ranging

from 10% to 0.02%


Literatur

Literatur

S. Muthukrishnan. Data Streams: Algorithms and Applications.Foundations and Trends® in Theoretical Computer Science. Volume 1, Issue 2, 2005.

G. Cormode and S. Muthukrishnan.What’s Hot and What’s Not: Tracking most frequent items dynamically. ACM PODS, 2003.

43

Vielen Dank für eure Aufmerksamkeit!

Documents

Streaming Algorithmen · 2012. 7. 9. · Juni 2012 Streaming Algorithmen - Dynamische Zuordnung von Objekte mit größten Häufigkeit 4. Der Algorithmus Satz 1: Wenn ein Objekt mit