56
Komplexitätstheorie und effiziente Algorithme Christian Sohler, TU Dortmund Algorithms for geometric data streams

Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

Embed Size (px)

DESCRIPTION

Komplexitätstheorie und effiziente Algorithmen 3 Geometric data streams Massive sets of geometric objects arriving sequentially Objects are typically points Different form of arrival: - sequence of points - sequence of updates Questions Find ways to analyze the geometric structure of the input data using small space Introduction

Citation preview

Page 1: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

Komplexitätstheorie und effiziente Algorithmen

Christian  Sohler, TU Dortmund

Algorithms for geometric data streams 

Page 2: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

2

Komplexitätstheorie und effiziente Algorithmen

Data streams• Massive data set arriving sequentially• Different ways of „arriving“

Examples• Network traffic• Query logs• …

Approach• Find algorithms that make a single (a few) pass(es) and process

data sequentially

Introduction

Page 3: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

3

Komplexitätstheorie und effiziente Algorithmen

Geometric data streams• Massive sets of geometric objects arriving sequentially• Objects are typically points• Different form of arrival:

- sequence of points- sequence of updates

Questions• Find ways to analyze the geometric structure of the input data using

small space

Introduction

Page 4: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

4

Komplexitätstheorie und effiziente Algorithmen

Motivation• Many computational tasks can be interpreted geometrically• Geometric features may be useful in learning and classification• Geometry plays an important role in the application

Examples• Learning • Clustering• How ‚clusterable‘ is a data set?• Road traffic prediction

Introduction

Page 5: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

5

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects

Introduction

Page 6: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

6

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects

Introduction

Page 7: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

7

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes

Introduction

Page 8: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

8

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes

Introduction

Page 9: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

9

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes• Learn from examples to which class

future objects belong

Introduction

?

Page 10: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

10

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes• Learn from examples to which class

future objects belong• Map object‘s description to Euclidean space

Introduction

?

Page 11: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

11

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes• Learn from examples to which class

future objects belong• Map object‘s description to Euclidean space

SVM approach • Compute maximum margin hyperplane• Classifiy points according to their side

Introduction

?

Page 12: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

12

Komplexitätstheorie und effiziente Algorithmen

SVM and SEB (smallest enclosing balls)• Dual of certain SVM formulation is SEB

[Tax, Duin, Pattern Recognition Letters, ‘99]• Geometric streaming SEB can be used as SVM heuristic

[Rai, Daume III, Venkatasubramanian, IJCAI‘09]• Also: Coresets have been used

to construct CSVMs[Tsang, Kwok, Cheung, Journal of Machine Learning Research, ’05]

Introduction

?

Page 13: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

13

Komplexitätstheorie und effiziente Algorithmen

Outline • Merge & Reduce• Embeddings into tree metrics• Estimation of distribution of local neighborhoods• Balanced partitions• Approximating properties of balanced partitions

Introduction

Page 14: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

14

Komplexitätstheorie und effiziente Algorithmen

Insertion-only streams• Sequence of points p ,…, p from R

Merge & Reduce

1 nd

Page 15: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

15

Komplexitätstheorie und effiziente Algorithmen

Definition [k-median clustering]Given a weighted set P of points in R the k-median problem is to find

a set CR of k points (centers) such that cost(P,C) = wmin ||p-c||

is minimized, where w >0 is the weight of point p.

Merge & Reduce

d

pP cC

d

p

p

Page 16: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

16

Komplexitätstheorie und effiziente Algorithmen

Coreset [Har-Peled, Mazumdar, STOC’04]A weighted point set S is a (k,)-coreset of a weighted point set P, if for

every set C of k centers | cost(P,C) – cost(S,C) | cost(P,C).

Merge & Reduce

3

3

3

33

4

4

Page 17: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

17

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 18: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

18

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Coreset

Page 19: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

19

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 20: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

20

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Coreset

Page 21: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

21

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Coreset of Union of Coreset

Page 22: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

22

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 23: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

23

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 24: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

24

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 25: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

25

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 26: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

26

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 27: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

27

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 28: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

28

Komplexitätstheorie und effiziente Algorithmen

Coresets by pre-clustering [Guha, Mishra, Motwani, O‘Callaghan, FOCS’00; Har-Peled, Mazumdar, STOC’04; Frahling, S., STOC‘05]

• Compute a pre-clustering S with >k centers and cost(P,S) Opt• Size exponential in d

Merge & Reduce

3

3

3

33

4

4

k

Page 29: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

29

Komplexitätstheorie und effiziente Algorithmen

Coresets by sampling [Chen, SICOMP’09; Feldman, Monemizadeh, S., SoCG‘07]

• Compute a random non-uniform sample• Show that sample approximates all solutions from a net• Size polynomial in d

Merge & Reduce

MM M/4

Page 30: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

30

Komplexitätstheorie und effiziente Algorithmen

Coresets by reduction to 1D [Har-Peled, Kushal, DCG’07, Feldman, Fiat, Sharir, FOCS‘06]

• Uses geometric arguments to solve 1D• Combine with preclusting using line centers• For k-median: Size independent of n (but exponential in d)

Merge & Reduce

Page 31: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

31

Komplexitätstheorie und effiziente Algorithmen

Open problems • Coresets for k-median of size independent of n and d ?

(Partial result in [Feldman, Monemizadeh, S., SoCG’07])• Coresets for k-median of size O(d/²)• Coresets for k-median of size poly(d, log n)/for

constant c=c(d)>0• Coresets for j-subspace 1-median of size poly(, d, j, log n) ?• Same questions for k-means objective function

Remark: Open questions refer to the definition of coresets from this talk.

Merge & Reduce

2-c

Page 32: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

32

Komplexitätstheorie und effiziente Algorithmen

Insertion/deletion model• Stream consists of Insert(p), Delete(p) operations• Points are from {1,…, }• Stream is consistent, i.e. no Delete(p), if p is not present and no

Insert(p), if p is already present in the current set

Geometric update streams

d

Page 33: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

33

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

st

Page 34: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

34

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

st tsr

p q

Page 35: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

35

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

st tsr

p q

p q sr t

2 i

2 i2 i

2 i

Page 36: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

36

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

st tsr

p q

p q

2 i-1

2 i2 i

2 i

qp sr

s tr

2 i-1 2 i-1 2 i-1

Page 37: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

37

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

st tsr

p q

p q

2 i2 i

2 i

qp sr

s tr

2 i-1 2 i-1 2 i-1

r s

2 i-22 i-2

Page 38: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

38

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

st tsr

p q

p q

2 i2 i

2 i

qp sr

s tr

2 i-1 2 i-1 2 i-1

r s

2 i-22 i-2

Page 39: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

39

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics D(.,.)• ||p-q|| D(p,q)• E[D(p,q)] = O(log )||p-q||

[Bartal, FOCS’96;Charikar, Chekuri, Goel, Guha,Plotkin, FOCS’98]

tsrp q

p q

2 i2 i

2 i

qp sr

s tr

2 i-1 2 i-1 2 i-1

r s

2 i-22 i-2

Page 40: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

40

Komplexitätstheorie und effiziente Algorithmen

Estimator for cost of Euclidean minimum spanning tree (EMST) [Indyk, STOC’04]

• Write EMST for cost of EMST• Write MST for cost of minimum spanning tree of tree metric D• E[MST ] = O(log ) EMST (linearity of expectation)• Use cost of MST of D as estimator

Streaming algorithms viaembeddings into tree metrics

D

D

Page 41: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

41

Komplexitätstheorie und effiziente Algorithmen

Observation [Indyk, STOC’04]• The MST of D(.,.) is given by the tree defining the tree metric • #edges of length 2 = #non-empty cells in corresponding grid

Streaming algorithms via embeddings into tree metrics

p

q

r

st

tsrp q

p q sr t

2 i2 i

2 i

i

2 i

Page 42: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

42

Komplexitätstheorie und effiziente Algorithmen

Euclidean minimum spanning tree1. Use O(log nested grids G(i) with side length 22. for each grid3. approximate |G(i)| := #nonempty cells in G(i) using F sketch

4. return 2 |G(i)|

Theorem [Indyk, STOC’04]The above algorithm computes a O(log )-approximation to the cost of

the minimum spanning tree.

Streaming algorithms viaembeddings into tree metrics

i

i0

Page 43: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

43

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms viaembeddings into tree metrics

Results using a similar approach [Indyk, STOC’04]

Earth mover‘s distance O(log )Facility location O(log² )Matching O(log )

k-Median O(1)1+ with huge extraction time

Problem Approx. factor

Page 44: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

44

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms viaestimating the distribution of local neighborhoods

Distribution of neighborhoods • Grids G(i) as before• R-neighborhood of C: cells within distance at most R from C• m (i) is number of points in i-th cell of the R-neighborhood of C

1 2 3

4 5 6 7 8

9 10 11 12 13

14 15 16 17 18

19 20 21

C,R

A cell and its 2-neighborhood

Page 45: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

45

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms viaestimating the distribution of local neighborhoods

EMST estimator• Define Z (i) = ( m (i) > 0 )• EMST can be approximated from the Z (i)• Approx. ratio goes to 1 as R goes to

C,R C,R

C,R

Page 46: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

46

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms viaestimating the distribution of local neighborhoods

EMST estimator• K: Size of R-neighborhood• Z are functions from {1,…,K} to {0,1}• Random (nonempty) C defines distribution over neighborhoods,

i.e. over functions Z:{1,…,K} {0,1}• Can still estimate EMST from this distribution

C,R

Page 47: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

47

Komplexitätstheorie und effiziente Algorithmen

Algorithm• Sample a certain number of nonempty grid cells and maintain

number of points for each cell in their neighborhood• Sample gives estimation of the distribution of the Z (.)• Obtain estimation for EMST from estimated distribution

Theorem [Frahling, Indyk, S., IJCGA’07]Let >0, d be constants.The cost of a Euclidean minimum spanning

tree of a point set in R given as an update stream can be estimated with a factor of 1 using polylog() space.

Streaming algorithms viaestimating the distribution of local neighborhoods

C,R

d

Page 48: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

48

Komplexitätstheorie und effiziente Algorithmen

Open Problems• (1+)-approximation for matching and/or earth mover‘s distance• Other problems? Approach is not very well understood• General characterization of problems solvable via approximation of

the distribution of local neighborhoods

Streaming algorithms viaestimating the distribution of local neighborhoods

Page 49: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

49

Komplexitätstheorie und effiziente Algorithmen

Estimating the distribution [Frahling, S., STOC’05]• Divide space into regions• For each region maintain #points inside• Balance „error“ among regions• Notion of error depends on problem

Example• 1-Median in 1D• Error cell width #points in cell

Streaming algorithms viabalanced partitions

Page 50: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

50

Komplexitätstheorie und effiziente Algorithmen

Small space?• Problem dependent • Need to show that decomposition in few regions with sufficiently

small error exists

Streaming algorithms viabalanced partitions

Page 51: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

51

Komplexitätstheorie und effiziente Algorithmen

One approach [Frahling, S., STOC’05]• Nested grids G(i)• For each grid maintain cells intersected by random sample

(sample sizes differ for different grids)• #sample points inside cell -> #points inside cell• Combine cells from different grids to space decomposition

Streaming algorithms viabalanced partitions

Page 52: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

52

Komplexitätstheorie und effiziente Algorithmen

Works for• k-median• k-means• MaxTSP, MaxMatching, Maximum spanning tree, Average distance,

MaxCut

Why?• Require proof for k-median and k-means• Last 5 problems can be reduced to 1-median

Streaming algorithms viabalanced partitions

Page 53: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

53

Komplexitätstheorie und effiziente Algorithmen

Approximating properties of balanced partitions[Lammersen, S., ESA‘08]• Previous approach may lead to many regions• Example: facility location• Can approximate properties of balanced partitions, e.g. #regions• Only gives approximation of cost of solution• More details in Christiane‘s talk

Streaming algorithms viaapproximation of balanced partitions

Page 54: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

54

Komplexitätstheorie und effiziente Algorithmen

Open problems• Min-sum-k-clustering• Other problems?

Streaming algorithms viabalanced partitions

Page 55: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

55

Komplexitätstheorie und effiziente Algorithmen

(Some) Techniques in geometric streaming:• Merge & Reduce• Embeddings into tree metrics• Estimation of distribution of local neighborhoods• Balanced partitions• Approximating properties of balanced partitions

And lots of open problems to work on…

Summary

Page 56: Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

56

Komplexitätstheorie und effiziente Algorithmen

Thank you!