38
Mixture Models & EM algorithm Lecture 21 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore

lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Mixture  Models  &  EM  algorithm  Lecture  21  

David  Sontag  New  York  University  

Slides  adapted  from  Carlos  Guestrin,  Dan  Klein,  Luke  Ze@lemoyer,  Dan  Weld,  Vibhav  Gogate,  and  Andrew  Moore  

Page 2: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

The  Evils  of  “Hard  Assignments”?  

•  Clusters  may  overlap  •  Some  clusters  may  be  “wider”  than  others  

•  Distances  can  be  deceiving!  

Page 3: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

ProbabilisOc  Clustering  

•  Try  a  probabilisOc  model!  •  allows  overlaps,  clusters  of  different  

size,  etc.  

•  Can  tell  a  genera&ve  story  for  data  – P(X|Y)  P(Y)  

•  Challenge:  we  need  to  esOmate  model  parameters  without  labeled  Ys    

Y   X1   X2  

??   0.1   2.1  

??   0.5   -­‐1.1  

??   0.0   3.0  

??   -­‐0.1   -­‐2.0  

??   0.2   1.5  

…   …   …  

Page 4: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

The  General  GMM  assumpOon  

µ1 µ2

µ3

•  P(Y):  There  are  k  components  

•  P(X|Y):  Each  component  generates  data  from  a  mul<variate  Gaussian  with  mean  μi  and  covariance  matrix  Σi  

Each  data  point  is  sampled  from  a  genera&ve  process:    

1.  Choose  component  i  with  probability  P(y=i)  

2.  Generate  datapoint  ~  N(mi,  Σi  )  

Gaussian  mixture  model  (GMM)  

Page 5: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

What  Model  Should  We  Use?  

•  Depends  on  X!  •  Here,  maybe  Gaussian  Naïve  Bayes?  

– MulOnomial  over  clusters  Y  

–  (Independent)  Gaussian  for  each  Xi  given  Y  

Y   X1   X2  

??   0.1   2.1  

??   0.5   -­‐1.1  

??   0.0   3.0  

??   -­‐0.1   -­‐2.0  

??   0.2   1.5  

…   …   …  

= ln

1σi√2π

e− (xi−µi0)

2

2σ2i

1σi√2π

e− (xi−µi1)

2

2σ2i

= −(xi − µi0)2

2σ2i+(xi − µi1)2

2σ2i

=µi0 + µi1

σ2ixi +

µ2i0 + µ2i12σ2i

w0 = ln1− θ

θ+

µ2i0 + µ2i12σ2i

wi =µi0 + µi1

σ2i

w = w + η�

j

[y∗j − p(y∗j |xj , w)]f(xj)

w = w + [y∗ − y(x;w)]f(x)

w = w + y∗f(x)

h(x) = sign

��

i

αihi(x)

Count(Y = y) =n�

j=1

δ(Y j = y)

Count(Y = y) =n�

j=1

D(j)δ(Y j = y)

f(x) = w0 +�

i

wihi(x)

p(Yi = yk) = θk

4

Page 6: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Could  we  make  fewer  assumpOons?  

•  What  if  the  Xi  co-­‐vary?  

•  What  if  there  are  mulOple  peaks?    •  Gaussian  Mixture  Models!  

–  P(Y)  sOll  mulOnomial  

–  P(X|Y)  is  a  mulOvariate  Gaussian  distribuOon:  

P(X = x j |Y = i) =1

(2! )m/2 || !i ||1/2 exp "

12x j "µi( )

T!i"1 x j "µi( )

#

$%&

'(

Page 7: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

MulOvariate  Gaussians  

Σ  ∝  idenOty  matrix  

P(X = x j |Y = i) =1

(2! )m/2 || !i ||1/2 exp "

12x j "µi( )

T!i"1 x j "µi( )

#

$%&

'(                                P(X=xj)=

Page 8: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

MulOvariate  Gaussians  

Σ  =  diagonal  matrix  Xi  are  independent  ala  Gaussian  NB  

P(X = x j |Y = i) =1

(2! )m/2 || !i ||1/2 exp "

12x j "µi( )

T!i"1 x j "µi( )

#

$%&

'(                                P(X=xj)=

Page 9: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

MulOvariate  Gaussians  

Σ  =  arbitrary  (semidefinite)  matrix:      -­‐  specifies  rotaOon  (change  of  basis)    -­‐  eigenvalues  specify  relaOve  elongaOon  

P(X = x j |Y = i) =1

(2! )m/2 || !i ||1/2 exp "

12x j "µi( )

T!i"1 x j "µi( )

#

$%&

'(                                P(X=xj)=

Page 10: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

P(X = x j |Y = i) =1

(2! )m/2 || !i ||1/2 exp "

12x j "µi( )

T!i"1 x j "µi( )

#

$%&

'(                                P(X=xj)=

Covariance  matrix,  Σ,  =  degree  to  which  xi  vary  together  

Eigenvalue,  λ,  of  Σ  

MulOvariate  Gaussians  

Page 11: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Mixtures  of  Gaussians  (1)  

Old  Faithful  Data  Set  

Time  to  ErupO

on  

DuraOon  of  Last  ErupOon  

Page 12: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Mixtures  of  Gaussians  (1)  

Old  Faithful  Data  Set  

Single  Gaussian   Mixture  of  two  Gaussians  

Page 13: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Mixtures  of  Gaussians  (2)  

Combine  simple  models  into  a  complex  model:  

Component  

Mixing  coefficient  

K=3  

Page 14: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Mixtures  of  Gaussians  (3)  

Page 15: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

EliminaOng  Hard  Assignments  to  Clusters  Model  data  as  mixture  of  mulOvariate  Gaussians  

Page 16: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

EliminaOng  Hard  Assignments  to  Clusters  Model  data  as  mixture  of  mulOvariate  Gaussians  

Page 17: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

EliminaOng  Hard  Assignments  to  Clusters  Model  data  as  mixture  of  mulOvariate  Gaussians  

Shown  is  the  posterior  probability  that  a  point  was  generated  from  ith  Gaussian:  Pr(Y = i | x)

Page 18: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

ML  esOmaOon  in  supervised  semng  

•  Univariate  Gaussian  

•  Mixture  of  Mul&variate  Gaussians  

ML  esOmate  for  each  of  the  MulOvariate  Gaussians  is  given  by:  

Just  sums  over  x  generated  from  the  k’th  Gaussian  

µML =1n

xnj=1

n

! !ML =1n

x j "µML( ) x j "µML( )T

j=1

n

#k   k   k   k  

Page 19: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

That  was  easy!    But  what  if  unobserved  data?  

•  MLE:  – argmaxθ  ∏j  P(yj,xj)  

– θ:  all  model  parameters    •  eg,  class  probs,  means,  and          variances  

•  But  we  don’t  know  yj’s!!!  •  Maximize  marginal  likelihood:  

– argmaxθ  ∏j  P(xj)  =  argmax  ∏j  ∑k=1  P(Yj=k,  xj)  K  

Page 20: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

How  do  we  opOmize?  Closed  Form?  

•  Maximize  marginal  likelihood:  – argmaxθ  ∏j  P(xj)  =  argmax  ∏j  ∑k=1  P(Yj=k,  xj)  

•  Almost  always  a  hard  problem!  – Usually  no  closed  form  soluOon  

– Even  when  lgP(X,Y)  is  convex,  lgP(X)  generally  isn’t…  

– For  all  but  the  simplest  P(X),  we  will  have  to  do  gradient  ascent,  in  a  big  messy  space  with  lots  of  local  opOmum…  

K  

Page 21: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Learning  general  mixtures  of  Gaussian  

•  Marginal  likelihood:  

•  Need  to  differenOate  and  solve  for  μk,  Σk,  and  P(Y=k)  for  k=1..K  •  There  will  be  no  closed  form  soluOon,  gradient  is  complex,  lots  of  

local  opOmum  •  Wouldn’t  it  be  nice  if  there  was  a  be<er  way!?!  

P(x j )j=1

m

∏ = P(x j ,y = k)k=1

K

∑j=1

m

=1

(2π)m / 2 ||Σk ||1/ 2 exp −

12x j − µk( )

TΣk

−1 x j − µk( )⎡

⎣ ⎢ ⎤

⎦ ⎥ P(y = k)

k=1

K

∑j=1

m

P(y = k | x j )∝1

(2π)m / 2 ||Σk ||1/ 2 exp −

12x j − µk( )

TΣk−1 x j − µk( )⎡

⎣ ⎢ ⎤

⎦ ⎥ P(y = k)

Page 22: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

ExpectaOon  MaximizaOon  

Page 23: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

The  EM  Algorithm  

•  A  clever  method  for  maximizing  marginal  likelihood:  –  argmaxθ  ∏j  P(xj)  =  argmaxθ  ∏j  ∑k=1

K  P(Yj=k,  xj)  

– A  type  of  gradient  ascent  that  can  be  easy  to  implement  (eg,  no  line  search,  learning  rates,  etc.)  

•  Alternate  between  two  steps:  – Compute  an  expectaOon  – Compute  a  maximizaOon  

•  Not  magic:  s&ll  op&mizing  a  non-­‐convex  func&on  with  lots  of  local  op&ma  –  The  computaOons  are  just  easier  (oqen,  significantly  so!)  

Page 24: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

EM:  Two  Easy  Steps  Objec<ve:  argmaxθ  lg∏j  ∑k=1

K  P(Yj=k,  xj  |θ)  =  ∑j  lg  ∑k=1K  P(Yj=k,  xj|θ)    

Data:  {xj  |  j=1  ..  n}    

•  E-­‐step:  Compute  expectaOons  to  “fill  in”  missing  y  values  according  to  current  parameters,  θ    

–  For  all  examples  j  and  values  k  for  Yj,  compute:  P(Yj=k  |  xj,  θ)    

•  M-­‐step:  Re-­‐esOmate  the  parameters  with  “weighted”  MLE  esOmates  

–  Set  θ  =  argmaxθ  ∑j  ∑k  P(Yj=k  |  xj,  θ)  log  P(Yj=k,  xj  |θ)  

Especially  useful  when  the  E  and  M  steps  have  closed  form  soluOons!!!  

Nota<on  a  bit  inconsistent  Parameters  =  θ=λ  

Page 25: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

EM  algorithm:  Pictorial  View  

Given  a  set  of  Parameters  and  training  data  

EsOmate  the  class  of  each  training  example  using  the  parameters  yielding  new  (weighted)  training  data  

Relearn  the  parameters  based  on  the  new  training  

data  

Class  assignment  is  probabilisOc  or  weighted  (soq  EM)  

Class  assignment  is  hard  (hard  EM)  

Supervised  learning  problem  

Page 26: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Simple  example:  learn  means  only!  Consider:  •  1D  data  •  Mixture  of  k=2  Gaussians  

•  Variances  fixed  to  σ=1  •  DistribuOon  over  classes  is  uniform  

•  Just  need  to  esOmate  μ1  and  μ2  

P(x,Yj = k)k=1

K

∑j=1

m

∏ ∝ exp −12σ2

x − µk2⎡

⎣ ⎢ ⎤

⎦ ⎥ P(Yj = k)

k=1

K

∑j=1

m

.01 .03 .05 .07 .09

=2  

Page 27: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

EM  for  GMMs:  only  learning  means  Iterate:    On  the  t’th  iteraOon  let  our  esOmates  be  

λt  =  {  μ1(t),  μ2(t)  …  μK(t)  }  E-­‐step  

 Compute  “expected”  classes  of  all  datapoints  

M-­‐step  

 Compute  most  likely  new  μs  given  class  expectaOons  €

P Yj = k x j ,µ1...µK( )∝ exp − 12σ 2 x j − µk

2⎛

⎝ ⎜

⎠ ⎟ P Yj = k( )

µk = P Yj = k x j( )

j=1

m

∑ x j

P Yj = k x j( )j=1

m

Page 28: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

E.M.  for  General  GMMs  Iterate:    On  the  t’th  iteraOon  let  our  esOmates  be  

λt  =  {  μ1(t),  μ2(t)  …  μK(t),  Σ1(t),  Σ2

(t)  …  ΣK(t),  p1(t),  p2(t)  …  pK(t)  }  

E-­‐step    Compute  “expected”  classes  of  all  datapoints  for  each  class  

P Yj = k x j ,λt( )∝ pk( t )p x j µk

(t ),Σk( t )( )

pk(t)  is  shorthand  for  esOmate  of  P(y=k)  on  t’th  iteraOon  

M-­‐step      

   Compute  weighted  MLE  for  μ  given  expected  classes  above  

µkt+1( ) =

P Yj = k x j ,λt( )j∑ x j

P Yj = k x j ,λt( )j∑

Σkt+1( ) =

P Yj = k x j ,λt( )j∑ x j − µk

t+1( )[ ] x j − µkt+1( )[ ]

T

P Yj = k x j ,λt( )j∑

pk(t+1) =

P Yj = k x j ,λt( )j∑

m m  =  #training  examples  

Just  evaluate  a  Gaussian  at  xj  

Page 29: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Gaussian  Mixture  Example:  Start  

Page 30: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Aqer  first  iteraOon  

Page 31: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Aqer  2nd  iteraOon  

Page 32: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Aqer  3rd  iteraOon  

Page 33: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Aqer  4th  iteraOon  

Page 34: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Aqer  5th  iteraOon  

Page 35: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Aqer  6th  iteraOon  

Page 36: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Aqer  20th  iteraOon  

Page 37: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

What  if  we  do  hard  assignments?  Iterate:    On  the  t’th  iteraOon  let  our  esOmates  be  

λt  =  {  μ1(t),  μ2(t)  …  μK(t)  }  E-­‐step  

 Compute  “expected”  classes  of  all  datapoints  

M-­‐step  

 Compute  most  likely  new  μs  given  class  expectaOons  

µk = δ Yj = k,x j( ) x j

δ Yj = k,x j( )j=1

m

δ  represents  hard  assignment  to  “most  likely”  or  nearest  cluster  

 Equivalent  to  k-­‐means  clustering  algorithm!!!  

P Yj = k x j ,µ1...µK( )∝ exp − 12σ 2 x j − µk

2⎛

⎝ ⎜

⎠ ⎟ P Yj = k( )

µk = P Yj = k x j( )

j=1

m

∑ x j

P Yj = k x j( )j=1

m

Page 38: lecture21 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012. 11. 29. · Mixture(Models(&(EMalgorithm(Lecture21 David&Sontag& New&York&University&

Lets  look  at  the  math  behind  the  

magic!  

Next  lecture,  we  will  argue  that  EM:  •  OpOmizes  a  bound  on  the  likelihood  •  Is  a  type  of  coordinate  ascent  •  Is  guaranteed  to  converge  to  a  (oqen  local)  opOma