Master Thesis - lms.tf.fau.de · DoA Direction of Arrival DTFT Discrete Time Fourier Transform DP Dirichlet Prior EM Expectation Maximization MAP Maximum A Posteriori ML Maximum Likelihood

Friedrich-Alexander-Universitat Erlangen-Nurnberg

Chair of Multimedia Communications and Signal

Processing

Prof. Dr.-Ing. Walter Kellermann

Master Thesis

Blind Source Separation Methods forAcoustic Sources

Chengyu Huang

November 2017

Supervisor: M.Sc. Andreas Brendel

Erklarung

Ich versichere, dass ich die vorliegende Arbeit ohne fremde Hilfe und

ohne Benutzung anderer als der angegebenen Quellen angefertigt habe,

und dass die Arbeit in gleicher oder ahnlicher Form noch keiner an-

deren Prufungsbehorde vorgelegen hat und von dieser als Teil einer

Prufungsleistung angenommen wurde. Alle Ausfuhrungen, die wortlich

oder sinngemaß ubernommen wurden, sind als solche gekennzeichnet.

————————————

Ort, Datum

————————————

Unterschrift

CONTENTS I

Contents

Abstract V

List of Abbreviations VII

List of Formula Symbols IX

1 Introduction 1

1.1 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Fundamentals 1

2.1 Short Time Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . 1

2.2 Signal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Observation Model for Speech . . . . . . . . . . . . . . . . . . . 3

2.2.2 DoAs Of Source Signals . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 GMM And Wrapped Phase Model . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.2 Wrapped Phase Model . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 K-means and EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.1 K-means Clustering[5] . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.2 EM Algorithm[5] . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 GMM Clustering With Dirichlet Prior 16

II

3.1 Dirichlet Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Problem of GMM Fitting . . . . . . . . . . . . . . . . . . . . . 17

3.1.2 Dirichlet Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 MAP Estimator for GMM Parameters . . . . . . . . . . . . . . . . . . 19

3.3 EM Algorithm For Counting Acoustic Sources . . . . . . . . . . . . . . 21

3.3.1 Method Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.2 EM Formula Derivation . . . . . . . . . . . . . . . . . . . . . . 22

4 Split and Merge Method 27

4.1 Tolerance Region of a Gaussian Distribution . . . . . . . . . . . . . . . 27

4.2 The Proposed Split and Merge Method . . . . . . . . . . . . . . . . . . 29

4.2.1 Removal of Gaussian Components . . . . . . . . . . . . . . . . . 29

4.2.2 Merging Gaussian Components . . . . . . . . . . . . . . . . . . 30

4.2.3 Pruning Training Data . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.4 Final Merging Step After Convergence . . . . . . . . . . . . . . 32

5 Infinite Gaussian Mixture Model 34

5.1 GMM and IGMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Component Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Adaptive Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Chinese Restaurant Process . . . . . . . . . . . . . . . . . . . . . . . . 41

5.5 Post processing for the IGMM Method . . . . . . . . . . . . . . . . . . 43

6 Experimental Results 46

6.1 Results with Artificial Data . . . . . . . . . . . . . . . . . . . . . . . . 46

6.1.1 Test Scenarios and Artificial Training Data . . . . . . . . . . . . 47

6.1.2 Convergence Condition for EM algorithm in the GMM Clustering

Method with DP . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.3 Parameters for Clustering Methods . . . . . . . . . . . . . . . . 51

6.1.4 The Effect of Varying Angular Difference . . . . . . . . . . . . . 51

CONTENTS III

6.1.5 Source Counting Accuracy with Varying SNR Values . . . . . . 54

6.1.6 Source Counting Accuracy with Varying Data Variance . . . . . 54

6.1.7 Comparison and Conclusions . . . . . . . . . . . . . . . . . . . . 57

6.2 Experimental Results with Observed DoAs . . . . . . . . . . . . . . . . 57

6.2.1 Test with Varying Source Numbers and Sensor Numbers . . . . 59

6.2.2 Test with Varying Reverberation Time . . . . . . . . . . . . . . 60

6.2.3 Test with Varying Sensor Distance to Circle Center . . . . . . . 61

6.2.4 Test with Varying Source Distance to Circle Center . . . . . . . 62

6.2.5 Comparison and Conclusions . . . . . . . . . . . . . . . . . . . . 63

7 Summary and Outlook 65

List of Figures 66

List of Tables 70

References 73

ABSTRACT V

Abstract

This thesis deals with the BSS-based source counting algorithms to estimate the num-

ber of speech sources from an observed mixture of DoAs. Three methods based on the

assumption of source sparsity are implemented and evaluated under different acoustic

reverberant scenarios. The implemented methods include the GMM clustering method

with Dirichlet prior, which models the mixture of DoAs with a GMM and estimates

the model parameters using the MAP-EM algorithm; the Split and Merge method,

which removes the redundant clusters in the GMM based on ML-EM algorithm; a

non-parametric IGMM clustering method, which determines the number of clusters

in the model automatically by using a CRP representation of Dirichlet process. The

accuracy rate of sourcing estimation in different scenarios, which represents the perfor-

mance of the clustering methods, will be calculated and compared in the experiments.

Keywords : BSS; GMM; Split and Merge; IGMM; speech source counting; EM algo-

rithm; MAP; ARS; CRP.

LIST OF ABBREVIATIONS VII

List of Abbreviations

ARS Adaptive Rejection Sampling

BSS Blind Source Separation

CRP Chinese Restaurant Process

DoA Direction of Arrival

DTFT Discrete Time Fourier Transform

DP Dirichlet Prior

EM Expectation−Maximization

MAP Maximum A Posteriori

ML Maximum Likelihood

PDF Probability Density Function

TDoA Time Difference of Arrival

TR Tolerance Region

GMM Gaussian Mixture Model

IGMM Infinite Gaussian Mixture Model

ICA Independent Component Analysis

STFT Short Time Fourier Transform

SNR Signal-to-Noise Ratio

w.r.t with respect to

VIII

obs. observed

art. artificial

FORMULA SYMBOLS IX

List of Formula Symbols

Notations

non-bold letter a scalar variable

bold letter in lowercase a vector

bold letter in uppercase A matrix

Operators

(·)T Transpose of (·)

| · | Absolute value of (·)

(·)+ Moore-Penrose pseudo-inverse of (·)

arg min {·}/ arg max {·} Argument of minimum/maximum of (·)

∝ Proportional to

arg(·) Argument of (·)

exp(·) Exponential of (·)∑(·) Sum of (·)∏(·) Production of (·)

ln(·) Natural logarithm of (·)

E(·) Expectation of (·)

X

∼ Sampling from

‖ (·) ‖2 Euclidean distance of (·)

Symbols

si source signal qi source location

xj sensor signals dj,dJ sensor location

dn observed DoA h the impulse response

fs sampling frequency ı imaginary unit

v sound velocity φi source elevation angle

θi source azimuth ∆ TDoA vector

D location matrix of sensor pairs dn observed DoA

N total number N Gaussian distribution

µ mean σ standard derivation

σ2 variance Σ covariance

s precision α weight of Gaussian components.

β shape of the Gamma distribution K number of clusters

G Gamma distribution ν parameter for wrapped phase model

rnk indicator variable in k-means r(znk) responsibility in EM algorithm

φ parameter of DP θ parameter set of GMM

a power value c parameter for the power value

λL parameter of Lagrange multiplier rm, rs radius of the sensor and source

σbig upper limit of σ in GMM zp standard score Gaussian distribution

C criterion for the Split and Merge cvar upper limit of σ

clm lower limit of the σ Ψp Mahalanobis distance

FORMULA SYMBOLS XI

z indicator for observed data αlm lower limit of the cluster weight

lk Number of data in component k P power

gl(z) squeezing function ge(z) envelope function

a evaluation points in ARS Ct convergence condition for EM

CHAPTER 1. INTRODUCTION 1

Chapter 1

Introduction

Recently, many source separation algorithms have been developed to separate the sig-

nals of multiple point sources from each other. However, many source separation

methods require the knowledge of the number of sources in advance [3]. This is dif-

ficult because in general only mixture signals are available for the further separation

process [11].

Blind Source Separation (BSS) is an approach to separate signals where no informa-

tion about the source nor the mixing process is available [11]. In the last two decades,

many BSS methods have been proposed. Some BSS methods use Independent Com-

ponent Analysis (ICA) but also many BSS algorithms are based on the assumption

of sparsity of the sources, i.e., that in a speech mixture only one source is active at a

time-frequency bin such that the corresponding short time Fourier spectra of the single

sources do not overlap [3]. Using this assumption, the Directions of Arrival (DoAs)

can be estimated in each time-frequency bin and clustered afterwards. In the end, the

number of clusters yields an estimate of the number of sources.

The aim of this thesis is the implementation and the evaluation of algorithms for the

estimation of the number of speech signals from an observed mixture. The source

counting results of these algorithms can be used in many audio processing studies such

as speech recognition and automatic meeting transcription. In this thesis, all the source

counting algorithms are based on the sparsity assumption of speech sources. The DoAs

2

of the observed speech signal mixture of the is estimated using the source separation

method in [3, 2]. Three clustering methods for speech source counting are discussed

and implemented in this work including:

• The Gaussian Mixture Model (GMM) clustering method with Dirichlet prior

proposed by [3]. The Expectation-Maximization (EM) algorithm is used to solve

the Maximum A Posteriori (MAP) estimation problem of the model parameters

of the GMM.

• The Split and Merge method for the EM algorithm described in [15]. This method

also models the observed DoAs using a GMM and estimates the model parameters

using the EM algorithm. Instead of using the Dirichlet prior this method includes

split and merge steps n the M-step of EM algorithm to reduce the number of

redundant Gaussian components. The underlying optimization problem is a ML

estimation problem.

• The Infinite Gaussian Mixture Model (IGMM)[12]. This method models the

observed DoAs by setting the number of Gaussian components in GMM to be

infinite, which successfully avoids the problem of defining the correct number of

Gaussian components in advance under the circumstances that the source number

is unknown. In the IGMM method the model parameters are estimated using

Gibbs sampling. To limit the number of Gaussian components Rasmussen used

the Chinese Restaurant Process (CRP) representation of the Dirichlet process.

After the implementation, different acoustic reverberate scenarios are designed to in-

vestigate the performances and the robustness of these three methods for counting

speech sources.

1.1 Structure of the Thesis

The structure of the thesis is summarized as:

Chapter 2 - Fundamentals introduces the fundamentals needed for the thesis, such

CHAPTER 1. STRUCTURE OF THE THESIS 3

as the principles of the STFT, the GMM and the EM algorithm. The BSS-based

method for estimating DoAs from speech sources is also described in this chapter.

Chapter 3 - GMM Clustering with Dirichlet Prior describes the GMM method

of [3] for estimating the number of speech sources using a Dirichlet Prior, the derivation

for the MAP-EM algorithm is presented at the end of this chapter.

Chapter 4 - Split and Merge Method introduces principles of the split and merge

method for reducing the redundant Gaussian components in GMM clustering based on

the ML-EM algorithm.

Chapter 5 - Infinite Gaussian Mixture Model introduces the IGMM method of

[12] for data clustering using Gibbs sampling and the CRP.

Chapter 6 - Experimental Results discusses and compares the source counting

results using the three clustering methods.

Chapter 7 - Summary and Outlook reviews the results and conclusions obtained

in this thesis and suggests the directions for the future work.

CHAPTER 2. FUNDAMENTALS 1

Chapter 2

Fundamentals

This chapter introduces the fundamental knowledge which helps readers to understand

the following chapters describing theoretical principles and experiment results. In this

chapter, Section 2.1 introduces the basic idea of the STFT, which is an essential tool

for speech signal processing. Section 2.2 presents a BSS-based DoA estimation method

for multiple sparse sources proposed in [2][3]. The Section 2.3 and 2.4 describe the

definitions of the Gaussian distribution, the GMM as well as the wrapped phase model

proposed by Smaragdis et al. in [14]. In the end, Section 2.5 introduces the basic idea

of k-means and EM algorithm for data clustering.

2.1 Short Time Fourier Transform

The Fourier transform is well known for revealing the frequency contents of a signal.

The Fourier transform of a continuous signal xc(t) is written as

Xc(ω) =

∫ ∞−∞

xc(t)e−ıωtdt. (2.1)

The DTFT of a sequence x[n] can be derived as

X(ω) =∞∑−∞

x[n]e−2πıωn. (2.2)

2

The DFT is obtained by sampling the DTFT at Nft discrete frequencies. The DFT of

time-domain sequence x[n] is written as

X[κ] = x(κ/Nft) =

Nft−1∑n=0

x[n]e−ı 2πκn

Nft . (2.3)

The Fourier transform yields frequency information that is averaged over the entire

time domain. However, the temporal information is hidden in the transform [9]. Fig.

2.1 illustrates an example for a signal with varying time sequences. It can be seen that

although the two signals are very different in time domain, their Fourier transforms are

similar. This is because these two signals have the same frequency content but with

different temporal evaluation.

0 2 4 6 8 10−1

−0.5

0

0.5

1

(a)

0 2 4 6 8

·10−3

0

1,000

2,000

3,000

0 1 2 3 4 5−2

−1

0

1

2

Time(seconds)

(b)

0 2 4 6 8

·10−3

0

1,000

2,000

3,000

Frequency(Hz)

Figure 2.1: Missing time information of the Fourier transform illustrated by two dif-ferent signals and their magnitude Fourier transforms. (a) Two subsequent sinusoidsof frequency 1 Hz and 5 Hz. (b) Superposition of the same sinusoids.[9]

For non-stationary signals, it is also important to obtain the temporal structure. In

CHAPTER 2. SIGNAL MODELS 3

this case, the DFT over short periods of time can be utilized because speeches can

be considered stationary for short time intervals. This method is named STFT and

commonly used for analysis, modification, and synthesis of speech and audio signals.

Let x : Z → R be a real-valued DT-signal obtained by equidistant sampling with

respect to a fixed sampling rate Fs. w : [0 : Nft−1]→ R is a sampled window function

of length Nft ∈ N. The discrete STFT X of the signal x is given by

X(l, κ) :=

Nft−1∑n=0

x(n+ lH)w(n)exp(−2πıκn/Nft) (2.4)

with l ∈ Z and κ is the frequency index. The complex number X(l, κ) denotes the κth

Fourier coefficient for the lth time frame. Fig. 2.2 shows how the STFT works with a

shifted rectangular window across the signal.

2.2 Signal Models

This Section introduces a signal model as well as an BSS method of DoA estimation

for speech sources proposed by Araki et al in [3, 2]. The observation model built for

speech sources will be presented in Section 2.2.1 and the proposed DoA estimation

method will be introduced in Section 2.2.2.

2.2.1 Observation Model for Speech

The observation model includes Ns ≥ 1 speech sources and Nm sensors. The signal

observed at sensors can be written as

xj(t) =Ns∑i=1

∑l

hji(l)si(t− l), j = 1, ..., Nm. (2.5)

Where hji(l) is the impulse response from source i to sensor j.

If the assumption that speech signals are sufficiently sparse holds, we can assume only

one speech source is dominant at each time-frequency bin [3]. With an F-point STFT,

4

0 2 4 6 8 10−1−0.5

00.5

1

(a)

0 2 4 6 8

·10−3

0

1,000

2,000

3,000

0 2 4 6 8 10−1

0

1

(b)

0 2 4 6 8

·10−3

0

1,000

2,000

3,000

0 2 4 6 8 10−1

0

1

(c)

0 2 4 6 8

·10−3

0

1,000

2,000

3,000

0 2 4 6 8 10−1

0

1

Time(seconds)

(d)

0 2 4 6 8

·10−3

0

1,000

2,000

3,000

Frequency(Hz)

Figure 2.2: Signal and Fourier transform consisting of two subsequent sinusoids offrequency 1 Hz and 5 Hz [9].

Eq. (1.5) is converted into

Xj(f, τ) ≈N∑k=1

Hji(f)Si(f, τ), (2.6)

whereHji is the frequency response from source i to sensor j. f ∈{

0,1

Ffs, ...,

F − 1

Ffs

}is a discrete frequency, fs is the sampling frequency and τ ∈ {0, ..., T − 1} is a time-

frame index. When the assumption of source spareness holds and si(f, τ) is a dominant

CHAPTER 2. SIGNAL MODELS 5

source at the time-frequency bin (f, τ), the observation model can be written as

Xj(f, τ) ≈Hji(f)Si(f, τ). (2.7)

2.2.2 DoAs Of Source Signals

Figure 2.3: Definition of the DoA. A speech source i locates in a spherical coordinate.The DoA of i is represented with a unit norm qi

‖qi‖ marked by the red arrow [2].

In spherical coordinates, Fig. 2.3 illustrates the definition of DoA of a speech source. qi

is a 3-dimensional vector representing the location of source si and the DoA is defined

by the unit-norm qi‖qi‖ as

qi‖qi‖

= [cosθi cosφi, sinθi cosφi, sinφi]T , (2.8)

where θi represents the source azimuth angle and φi is the source elevation angle.

Let dj represent the known location of sensor j. The frequency response Hji(f) is

approximated by a far-field model (see Fig 2.4).

Hji(f) ≈ exp

[ı2πfv−1dT

j

qi‖qi‖

], (2.9)

6

Figure 2.4: Far-field model of a speech source i and two sensors j and J . The TDoAbetween sensors is (dj − dJ)T qi

‖qi‖ marked by the red arrow [2].

where v is the sound velocity. The expression ofHji(f) is derived under the assumption

that the frequency response is only determined by the distance dTjqi‖qi‖ between the

source i to origin O and the source i to a sensor j in direction of sound propagation.

For two sensors j and J , the following expression is obtained

Hji(f)

HJi(f)≈ exp

[ı2πfv−1 (dj − dJ)T

qi‖qi‖

], (2.10)

Because Xj(f, τ)/XJ(f, τ) = Hji(f, τ)/HJi(f, τ) and

arg[Xj(f, τ)X∗J(f, τ)] = arg[Xj(f, τ)/XJ(f, τ)], (2.11)

the phase difference between sensor signals of sensor j and sensor J can be written as

arg[Xj(f, τ)X∗J(f, τ)] = 2πfv−1 (dj − dJ)Tqi‖qi‖

, (2.12)

CHAPTER 2. GAUSSIAN DISTRIBUTION 7

and the DoAs can be calculated with

qi‖qi‖

= D+∆jJc, (2.13)

where ∆jJ =1

2πfarg[Xj(f, τ)X∗J(f, τ)] is the TDoA between sensor j and J , D+ is

the Moore-Penrose pseudo-inverse of (dj − dJ)T. If more than two sensors exist, just

define a location matrix D = [d1 − dJ , ...,dM − dJ ]T and the TDoA vector ∆J =

(∆1J , ...,∆MJ) of all sensor pairs. The DoAs can be calculated by

qi‖qi‖

= D+∆v. (2.14)

2.3 Gaussian Distribution

The Gaussian distribution is one of the most important probability distributions for

continuous variables. For the case of a single real-valued variable x, the probability

density function (PDF) of the Gaussian distribution is

N (x|µ, σ2) =1

(2πσ2)1/2exp

{− 1

2σ2(x− µ2)

}, (2.15)

where the parameter µ is the mean, σ2 the variance. The square root of the variance

σ is called the standard deviation. The precision of a Gaussian distribution is the

reciprocal of the variance 1σ2 .

As a probability distribution, the normal distribution satisfies the requirement

N (x|µ, σ2) > 0. (2.16)

If x is a D-dimensional vector of continuous variables, the Gaussian distribution of x

is written as

N (x|µ,Σ) =1

(2π)D/21

(|Σ|)1/2exp

{− 1

2(x− µ)TΣ−1(x− µ)

}, (2.17)

in this PDF the mean µ is a D-dimensional vector, the DxD matrix Σ is called the

covariance, and |Σ| is the determinant of Σ.

8

2σ

µx

N (x|µ,σ)

Figure 2.5: Plot of the univariate Gaussian showing the mean µ and the standarddeviation σ [5].

2.4 GMM And Wrapped Phase Model

After the DoA estimation, we can plot the PDF of the observed DoAs in a histogram.

Under the assumption that speech sources are sufficiently sparse, the histogram of

DoAs is supposed to have clusters which correspond to the directions of speech sources.

Fig. 2.6 shows the PDF histogram of observed DoAS of three speech sources. Since

the observed mixture consists DoAs of usually more than one sources from different

directions, e.g. there are 3 sources in Fig 2.6. The observed DoAs dn is can be modeled

as a Gaussian mixture model (GMM).

This Section introduces the definition of GMM and the Wrapped Phase Model proposed

by Smaragdis in [14].

2.4.1 GMM

A GMM is a density model formed by the linear superposition of a sufficient number

K of Gaussian components. The GMM for the observed DoAs dn can be expressed as

p(dn) =K∑k=1

αkN (dn|µk, σ2k), (2.18)

where each Gaussian density N (dn|µk, σ2k) is called a component of the mixture and

models the DoAs of one speech source. The parameter αk is the mixing coefficient,

CHAPTER 2. GMM AND WRAPPED PHASE MODEL 9

−180 −120 −60 0 60 120 1800

0.5

1

1.5

2

2.5·10−2

DoA [deg.]

P(D

oA)

Histogram of obs. DoAsTrue DoA

Figure 2.6: Histogram of observed DoAs of three speech sources. The white columnsrepresent the observed DoAs, the red stems indicate the accurate direction of speechsources.

µk is the mean and σ2k is called the variance of the k-th Gaussian component. If both

p(dn) and the individual Gaussian components are normalized, the mixing coefficient

is restricted by the condition

K∑k=1

αk = 1, (2.19)

this is obtained by integrating both sides of Eq. (2.18) with respect to dn. Fig. shows

an example of GMM.

2.4.2 Wrapped Phase Model

When modeling the data, the problem of phase wrapping should be taken into consid-

eration. According to the signal model of Araki et al. described in Section 2.2, the

observed mixture dn is distributed in the interval [−π, π). Each source is supposed to

be modeled by a Gaussian distribution. If the mean of the observed data is close to

10

x

N (x|µ,σ)

Figure 2.7: Plot of mixtures of three Gaussian components, the linear superposition ofthese Gaussian components builds a GMM.

−π or π, the distribution wraps and becomes bimodal. In this case a Gaussian model

might misrepresent the data [14]. Fig. 2.8 illustrates this phenomenon.

−180−120 −60 0 60 120 1800

0.2

0.4

0.6

0.8

1·10−2

DoA [deg.]

P(D

oA)

−180−120 −60 0 60 120 1800

0.2

0.4

0.6

0.8

1·10−2

DoA [deg.]

P(D

oA)

Figure 2.8: Histogram of observed DoAs. On the left, little phase wrapping happenswhich results into data that can be well modeled by a Gaussian distribution. On theright severe phase wrapping occurs which results in a poor fit of the data by a singleGaussian distribution [14].

To solve this problem the observed DoAs can be modeled by replicating and adding

Gaussian distributions at intervals of 2π

p(dn) =

+∞∑

ν=−∞

1√2πσ2

e−

(dn + ν2π − µ)2

2σ2 if dn ∈ [−π, π)

0 otherwise.

(2.20)

CHAPTER 2. GMM AND WRAPPED PHASE MODEL 11

For example see Fig. 2.9, Gaussian distributed with mean µ = 170◦. The dotted

lines represent the replication of this distribution while the solid line is the sum of

the replications in the interval [−π, π) and results in the wrapped distribution. The

figure shows that the negative left part of the central Gaussian distribution is added

to right Gaussian distribution, and the right part of the right Gaussian which beyond

π is represented by the central Gaussian. So the wrapped phase GMM of

-2π -π 0 π 2π 3π 4π0

5 · 10−2

0.1

0.15

0.2

Wrapped phase model

Figure 2.9: Illustration of the wrapped phase Gaussian distribution model. Replicatinga Gaussian distribution with of 2π (dashed lines) and summing the result in the intervalbetween [−π, π) (solid blue line) results into accurate modeling of a wrapped Gaussiandistribution [14].

p(dn) =K∑k=1

+∞∑ν=−∞

αkN (dn + 2πν|µk, σk) (2.21)

will be used instead of Eq. (2.18), where ν is a integer parameter and is determined

from the observed DoAs.

12

2.5 K-means and EM Algorithm

After building the structure GMM for the observed DoAs. The clustering algorithms

are applied to classify data and estimate the parameters of GMM. This section intro-

duces the basic idea of the k-means algorithm and EM algorithm, which are well known

data clustering algorithms.

2.5.1 K-means Clustering[5]

As described in the previous sections, the observed DoAs are represented by a vector

d = [d1, ..., dN ], the goal of k-means clustering is to partition the data set d into a

given number of K clusters. If µk represents the center of the k-th cluster, we define

a binary indicator variable rnk = [0, 1] to describe which cluster the data point dn is

assigned to,

rnk =

1 if dn is assigned to the k−th cluster.

0 otherwise.

(2.22)

We also define the sum of the squares of the distance of each data point to its assigned

vector µk as an objective function J as

J =N∑n=1

K∑k=1

rnk ‖ dn − µk ‖2 (2.23)

The minimization of J means that each observed DoA is assigned to its nearest cluster.

In oder to minimize J , the process of k-means clustering includes three steps,

1) Initialize values for µk.

2) J is minimized with respect to rnk, keeping the µk fixed. To achieve this goal, the

n-th data point is assign to the closest cluster center, this can be expressed as

rnk =

1 if k = arg min j ‖ dn − µk ‖2

0 otherwise.

(2.24)

CHAPTER 2. K-MEANS AND EM ALGORITHM 13

3) J is minimized with respect to µk, keeping the rnk fixed. The function J is a

quadratic function of µk, it can be minimized by setting the derivative with respect

to µk to zero giving

2N∑n=1

rnk(xn − µk) = 0 (2.25)

which the µk is easily given by

µk =

∑n rnkxn∑n rnk

. (2.26)

The last two steps will be repeated until convergence.

2.5.2 EM Algorithm[5]

This section introduces the general and simplest process of EM algorithm, which aims

to find ML solutions for a GMM without latent variables. The wrapped phase GMM

for observed DoAs is more complicated because it has latent variables. The detailed

formula derivation of the EM algorithm for finding the maximum a posteriori (MAP)

solution for the wrapped is presented in Chapter 3.

The log of the likelihood function of GMM with data set d is written as

ln p(d|α,µ,σ2) =N∑n=1

ln

{ K∑k=1

αkN (dn|µk, σ2k)

}, (2.27)

where α, µ and σ are the model parameters representing the mixing proportions,

means and the variance of the Gaussian components.

To calculate the maximum likelihood solution for µk, we set the derivation of the

likelihood function with respect to the µk and obtain

0 = −N∑n=1

αkN (dn|µk, σ2k)∑K

j=1 αjN (dn|µj, σ2j )σ2k(dn − µk), (2.28)

here we define a parameter γ(znk) called responsibility with

γ(znk) =αkN (dn|µk, σ2

k)∑Kj=1 αjN (dn|µj, σ2

j ). (2.29)

14

µk is calculated with

µk =1

Nk

N∑n=1

γ(znk)dn (2.30)

where Nk , which represents the number of data points assigned to cluster k, is defined

as

Nk =N∑n=1

γ(znk) (2.31)

To find the the maximum likelihood solution for σ2k, we set the derivate of ln p(d|α,µ,σ2)

with respect to σ2 to 0, we obtain

σ2k =

1

Nk

N∑n=1

γ(znk)(dn − µk)2 (2.32)

for a single Gaussian component k. According to the constraint that∑K

k=1 αk = 1, we

use a Lagrange multiplier to maximize the quantity

ln p(d|α,µ,σ2) + λ

( K∑k=1

αk − 1

), (2.33)

so that the maximum likelihood solution for αk can be calculated with

0 =N∑n=1

αkN (dn|µk, σ2k)∑K

j=1 αjN (dn|µj, σ2j )

+ λ (2.34)

If we multiply both sides of (2.34) with αk and we sum both sides with k, we find that

λ = −N and

αk =Nk

N, (2.35)

which means that the mixing coefficient for the k−th Gaussian component is the ratio

of the population of data points assigned to the k−th Gaussian component to the

population of all data points. The EM algorithm for GMM is summarized as

1. Choose an initial setting for the parameter µk, σk and the mixing proportion αk.

Evaluate the initial value of the log likelihood.

CHAPTER 2. K-MEANS AND EM ALGORITHM 15

2. E step Evaluate the responsibilities using the current parameters

γ(znk) =αkN (dn|µk, σ2

k)∑Kj=1 αjN (dn|µj, σ2

j ). (2.36)

3. M step Re-estimate the parameters using the current responsibilities

µnewk =

1

Nk

N∑n=1

γ(znk)dn

(σ2k)

new =1

Nk

N∑n=1

γ(znk)(dn − µnewk )2)

αnewk =

Nk

N

(2.37)

where

Nk =N∑n=1

γ(znk) (2.38)

4. Evaluate the log likelihood

ln p(d|α,µ,σ) =N∑n=1

ln

{ K∑k=1

αkN (dn|µk, σ2k)

}(2.39)

and check for convergence of either the log likelihood or the parameter values. If

the convergence criterion is not satisfied, return to step 2.

In summary, this Chapter introduces the fundamental knowledge of the thesis. Section

2.1 introduces the basic idea of STFT. Section 2.2 presents the DoA estimation method

for speech sources in [4]. To count the number of speech sources, a wrapped phase GMM

is utilized to model the observed DoAs ( introduced in Section 2.3 and 2.4) and the

clustering algorithms such as k-means and EM algorithm can be used to estimate the

parameters of GMM.

16 CHAPTER 3. GMM CLUSTERING WITH DIRICHLET PRIOR

Chapter 3

GMM Clustering With Dirichlet

Prior

In the previous chapter, the DoA estimation method proposed by Araki et al. in [3, 2]

computes multiple DoAs using the information of sensor signals and sensor locations.

The remaining task is to estimate the source number combining all single measure-

ments, this can be accomplished by a clustering algorithm. In their previous work [4],

they applied the k-means algorithm for the observed DoAs by assuming the source

number Ns to be given. The authors of [8] modeled the data with a GMM and derived

an EM algorithm for fitting the GMM to the data. In the context of BSS, however, the

source number is usually unknown. Furthermore, In the under-determined case, i.e.,

the source number exceeds the number of sensor, the estimation can be very difficult [3].

This chapter describes the GMM fitting method with Dirichlet prior[3]. The method

does not require the source number as prior knowledge and works well also in the

under-determined case. A MAP estimator with the EM algorithm is applied to esti-

mate the model parameters.

In this chapter, the derivation and implementation of the method will be presented

CHAPTER 3. DIRICHLET PRIOR 17

and discussed step by step in each section. Section 3.1 defines the Dirichlet prior and

the reason for using it. In Section 3.2, the MAP estimator based on the GMM is

given for estimation of the parameters. In the end, Section 3.3 presents the derivation

of the corresponding EM algorithm. The experiment results of this method will be

documented and compared in Chapter 6.

3.1 Dirichlet Prior

3.1.1 Problem of GMM Fitting

Due to the fact that the setting number of components of EM algorithm (e.g. 10) must

exceeds the actual maximum source number (e.g. 5). Araki et al. describe an often

occurring problem when applying a GMM fitting method to estimate the number of

sources within the DoA histogram. Fig. 3.1 illustrates this problem, the white columns

represent the observed DoAs from three speech sources, the red stems show the correct

position of each source and the GMM fitting result is plotted by black lines. From Fig.

3.1 it can be seen that the source locates at 140◦

is fitted by two Gaussian components,

the same problem also happens on the third source which locates at 110◦. However,

one cluster, which corresponds to one speech source, should be modeled by only one

Gaussian component. This Problem is called overfitting.

3.1.2 Dirichlet Prior

To avoid the overfitting for the data set. Araki et al. proposed to apply the Dirichlet

distribution to the GMM component weight parameter α

p(α) =1

B(φ)

K∏k=1

αφ−1k (3.1)

where α = [α1, ..., αk, ..., αK ] is the vector of component weights,∑K

k=1 αk = 1 because

the distribution is normalized, 0 ≤ αk ≤ 1. Here, φ is the parameter of the Dirichlet

distribution and B(φ) is the beta distribution [5]. Due to the summation constraint

18

−150 −100 −50 0 50 100 1500

0.5

1

1.5

2

2.5·10−2

DoA [deg.]

P(D

oA)

Histogram of obs. DoAsTrue DoAGMM fit

Figure 3.1: Example of GMM fitting result without a Dirichlet prior. The whitecolumns represent the observed DoAs, the red stems are the correct position of eachsource and the black lines plot the GMM fitting results.

∑Kk=1 αk = 1, the Dirichlet distribution over α is confined to a probability simplex,

Fig. 2.2 shows an example for α = [α1, α2, α3]. Fig. 3.3 illustrates the shape of

Dirichlet distribution looks like by varying the value of the hyper parameter φ. Different

colors indicate the trend of the density Dirichlet distribution p(α). When φ = 1, the

Dirichlet distribution over the simplex becomes a uniform distribution. When φ < 1

the distribution concentrates in the corners and along the boundaries of the simplex.

When φ > 1 the distribution tends to concentrate in the center of the simplex.In

this case the density p(α) reaches its maximum when α = [1, 0, 0], α = [0, 1, 0] or

α = [0, 0, 1]. This is desirable to represent the assumption of source sparsity. Araki et

al. proposed to set the Dirichlet prior φ < 1 to fix the problem of overfitting described

in Section 3.1.1, from Fig 3.4 we can see that each cluster is adequately fitted by one

Gaussian component by using Dirichlet prior φ = 0.01.

CHAPTER 3. MAP ESTIMATOR FOR GMM PARAMETERS 19

Figure 3.2: The Dirichlet distribution over three variables α1, α2, α3 is confined to aprobability simlex (the triangle area) [5].

(a) φ > 1 (b) φ = 1 (c) φ < 1

Figure 3.3: Shapes of the Dirichlet distribution over the simplex with different valuesof φ.

3.2 MAP Estimator for GMM Parameters

As described in Section 2.4.2, the observed DoAs are modeled by a wrapped phase

GMM, written as

p(dn) =K∑k=1

αk

+∞∑ν=−∞

N (dn + 2πν;µk, σk), (3.2)

where the mean µk, the standard deviation σk and the weight αk for each Gaussian

component has to be estimated. In equation (3.2), the number of the Gaussian compo-

nents m and the index ν for the wrapped phase model are handled as latent variables.

20

−150 −100 −50 0 50 100 1500

0.5

1

1.5

2

2.5·10−2

DoA [deg.]

P(D

oA)

Histogram of obs. DoAsTrue DoAGMM fit

Figure 3.4: GMM fitting example using the Dirichlet prior φ = 0.01.

Let θ = [αk, µk, σk]Kk=1 = [α,µ,σ] be the parameter set of the model. The observa-

tions are the DoA values d = [d1, d2, ..., dn, ..., dN ], where N = TF is the number of

observations, T is the time frames and F is the frequency bins of the STFT. In order to

emphasize the observations of the active speakers with high power by power weighting,

Araki et al. defined the power value a(f, t) = |x(f, t)|2 and the vector of the power

values over all frequency bins and time frames is a = [a1, a2, ..., an, ..., aN ].

The parameter set can be estimated via a MAP estimator by maximizing the posterior

probability

θ = arg maxθ

p(θ|d) (3.3)

According to Bayes Theorem,

arg maxθ

p(θ|d) = arg maxθ

p(d|θ)p(θ)

p(d)

= arg maxθ

p(d|θ)p(θ)

= arg maxθ

p(d,θ).

(3.4)

CHAPTER 3. EM ALGORITHM FOR COUNTING ACOUSTIC SOURCES 21

The log of the joint probability density function p(d,θ) is

L(θ) = log [ p(d|θ) p(α) p(µ) p(σ) ]

= log p(d|θ) + log p(α) + const.(3.5)

The power weight function in relation to a is defined by

f(an) =can∑Nn=1 an

(3.6)

We emphasize the log of probability of observations with this weight function, because

the observations are independent from each other

log p(d|θ) =N∑n=1

f(an)log p(dn|θ), (3.7)

so L(θ) can be written as

L(θ) =N∑n=1

f(an)log p(dn|θ) + log p(α) + const,

=N∑n=1

f(an) log p

( K∑k=1

∞∑ν=−∞

p(k, ν, dn|θ))

+ log p(α) + const,

(3.8)

where p(dn|θ) = p

(∑Kk=1

∑∞ν=−∞ p(k, ν, dn|θ)

)by marginalizing out the variable k

and ν, here the conditional probability p(k, ν, dn|θ) is written as

p(k, ν, dn|θ) =αk√2πσ2

k

exp

(−(dn + 2πν − µk)2

2σ2k

). (3.9)

3.3 EM Algorithm For Counting Acoustic Sources

3.3.1 Method Flow

To find the MAP solution and estimate parameters for GMM, Araki et al. use the EM

algorithm summarized in Tab. 3.1

22

EM Algorithm [3]

Initialization

1. Select a number of initial Gaussian components for the observed DoA mixture,

this number should be bigger than the maximum possible number of sources.

2. Initialize the means of the GMM by k-means clustering.

Repeat

1. E step: calculate the posterior probability p(k, ν|dn,θt) using equation (3.15),

here θt is the parameters calculated in the previous iteration.

2. M step: calculate the parameters θ with equation (3.22), (3.23) and (3.36).

Until the convergence of the GM parameters is arrived, i.e. the parameter difference

between two iterations is smaller than a given threshold.

Table 3.1: Summery of the EM Algorithm[3].

3.3.2 EM Formula Derivation

As described in Section 3.2, the log of the joint probability is given as

log p(d,θ) =N∑n=1

f(an) log p

( K∑k=1

∞∑ν=−∞

p(k, ν, dn|θ)

)+ log p(α) + const. (3.10)

where

p(k, ν, dn|θ) =αk√2πσ2

k

exp

(−(dn + 2πν − µk)2

2σ2k

), (3.11)

and


. (3.12)

The auxiliary function Q [5] is given as the conditional expectation of log p (d,θ) given

dn and θt as

Q(θ|θt) = E[log p (d,θ)|dn,θt], (3.13)


where θt is the estimate of the parameters after the t-th iteration. With Jensen’s

formula (log is concave):

log

( K∑k=1

ν=+∞∑ν=−∞

p(k, ν, dn|θ)

)≥

K∑k=1

ν=+∞∑ν=−∞

log p(k, ν, dn|θ), (3.14)

where we define

p(k, ν|dn,θt) =p(k, ν, dn|θt)∑K

k=1

∑ν=+∞ν=−∞ p(k, ν, dn|θt)

. (3.15)

With (3.10) to (3.15), the auxiliary function Q can be extended to

Q(θ|θt) =∑n

∑k

∑ν

[p(k, ν|dn,θ)f(an)log p(k, ν, dn|θ)] + log (α). (3.16)

The partial derivation of Q(θ|θt) by µk is written as

∂Q(θ|θt)∂µk

=∑n

∑ν

f(an)∂

∂µkp(k, ν|dn,θt)log p(k, ν, dn|θ)

=∑n

∑ν

f(an) p(k, ν|dn,θt)∂

∂µklog p(k, ν, dn|θ),

(3.17)

where the last term of (3.17) can be written based on (3.11) by

∂

∂µklog p(k, ν, dn|θ) =

∂

∂µk

(log

αk√2πσ2

k

+

(−(dn + 2πν − µk)2

2σ2k

))=dn + 2πν − µk

σ2k

(3.18)

Setting ∂Q(θ|θt)∂µk

= 0 and inserting (3.18) in (3.17) we have

∑n

∑ν

f(an) p(k, ν|dn,θt)(dn + 2πν − µk

σ2k

)!

= 0. (3.19)

Now we solve (3.19) for µk

µkσ2k

∑n

∑ν

f(an) p(k, ν|dn,θt) =∑n

∑ν

f(an) p(k, ν|dn,θt)dn + 2πν

σ2k

, (3.20)

using the definition (3.15),

24

µk∑n

∑ν

f(an)p(k, ν|dn,θt) =∑n

∑ν

f(an)p(k, ν|dn,θt)(dn + 2πν) (3.21)

Finally, we obtain

µt+1k =

∑n

∑ν f(an)p(k, ν|dn,θt)(dn + 2πν)∑n

∑ν f(an)p(k, ν|dn,θt)

. (3.22)

By setting ∂Q(θ|θt)∂σ2k

= 0, we obtain the parameter (σ2k)

(t+1) by

(σ2k)

(t+1) =

∑n

∑ν f(an)p(k, ν|dn,θt)(dn + 2πν)∑n

∑ν f(an)p(k, ν|dn,θt)

− (µt+1k )2. (3.23)

In addition, by using the Lagrange multiplier method with the constraint∑K

k=1 αk = 1,

the partial derivative of Q(θ|θt) w.r.t αk is calculated by

∂

∂αk

(Q(θ|θt)− λL(

∑k

αk − 1))

=∂

∂αkQ(θ|θt)− λL (3.24)

∂

∂αkQ(θ|θt) =

∑n

∑ν

f(an)p(k, ν|dn,θt)∂

∂αklog p(k, ν, dn|θ)︸︷︷︸

(I)

+∂

∂αklog p(α)︸︷︷︸(II)

(3.25)

where the derivative in (I) of (3.25) is computed by

∂

∂αklog p(k, ν, dn|θ) =

∂

∂αk

(log

αk√2πσ2

k

− (dn + 2πν − µk)2

2σ2k

)=

∂

∂αk

(logαk − log

√2πσ2

k

)=

1

αk,

(3.26)


and (II) is calculated by

∂

∂αklog p(α) =

∂

∂αklog

1

B(φ)

K∏k=1

αφ−1k

=∂

∂αk

(− logB(φ) +

K∑k=1

logαφ−1k

)

=∂

∂αk

K∑k=1

(φ− 1)logαk

= (φ− 1)1

αk.

(3.27)

Inserting the results for (I) and (II) in (3.25) yields

∂

∂αkQ(θ|θt) =

∑n

∑ν

f(an)p(k, ν|dn,θt)1

αk+ (φ− 1)

1

αk. (3.28)

Setting (3.28) to zero and include λL in the both sides

∂

∂αkQ(θ|θt)− λL =

1

αk

((φ− 1) +

∑n

∑ν

f(an)p(k, ν|dn,θt))− λL

!= 0. (3.29)

αk is dependent on λL

αk =1

λL

((φ− 1) +

∑n

∑ν

f(an) p(k, ν|dn,θt)). (3.30)

Inserting (3.30) in (3.29) we have

∂

∂λL

(Q(θ|θt)− λL

(∑k

αk − 1))

=∑k

αk − 1!

= 0, (3.31)

the derivative of λL is

∑k

1

λL

((φ− 1) +

∑n

∑ν

f(an) p(k, ν|dn,θt))− 1 = 0. (3.32)

Inserting (3.31) in (3.32) we have

λL =∑k

((φ− 1) +

∑n

∑k

f(an) p(k, ν|dn,θt))

(3.33)

26

the weight parameter α is finally calculated by

αk =(φ− 1) +

∑n

∑ν f(an) p(k, ν|dn,θt)

K(φ− 1) +∑

k

∑n


=(φ− 1) +

∑n


K(φ− 1) +∑

n f(an)∑

k

∑ν p(k, ν|dn,θt)

,

(3.34)

where∑

k

∑ν p(k, ν|dn,θt) = 1, so

αk =1

K(φ− 1) +∑

n f(an)

((φ− 1) +

∑n

∑ν


(3.35)

because of∑

n f(an) =

∑n can∑n an

= c,

αk =1

K(φ− 1) + c

((φ− 1) +

∑n

∑ν


(3.36)

CHAPTER 4. SPLIT AND MERGE METHOD 27

Chapter 4

Split and Merge Method

This chapter introduces the Split and Merge method[15]. Similar to the GMM clus-

tering method described in Chapter 3, the goal of the Split and Merge method is also

to estimate the number of sources and Gaussian mixture parameters from an observed

set of DoA mixture using the EM algorithm. The main difference of them is how the

overfitting problem is handled. Araki et al. applied a Dirichlet prior to the weight

parameter vector of the Gaussian components of the GMM, whereas the Split and

Merge method introduces conditions to remove and merge extra Gaussian components

as well as to prune data in the fitting process. The advantage of this Split and Merge

method is its fast convergence speed compared with the standard EM algorithm in the

GMM clustering method with a Dirichlet prior. The pseudo code of the method flow

is summarized in Table 4.1. We start with the Split and Merge method proposed in

[15], which fits the position estimates in the 2D. Since the observed data in thesis is

1D, the author of this thesis made some necessary modifications to the constraints and

parameters in the following.

4.1 Tolerance Region of a Gaussian Distribution

For a Gaussian distribution with mean µ and variance σ2, the probability mass pmass

is a cumulative distribution function, which gives the probability that value x is in a

28

Split and Merge Method

Initialization

1. Select a number of initial Gaussian components to represent the observed DoA

data by a GMM,this number corresponds to the maximum number of sources.

2. Initialize the GMM by k-means clustering.

Repeat

1. Perform the standard EM algorithm for 100 iterations.

2. Estimate number of components.

2 (a). Removing a Gaussian component.

2 (b). Merging Gaussian components.

3. Prune training data.

Until The difference of parameter values of the GMM between two iterations

is sufficiently small. Run a final Mahalanobis distance-based merger.

Table 4.1: Pseudo code of the Split and Merge method[15].

certain interval [µ − zpσ, µ − zpσ][1]. Such an interval with a certain value of pmass

is called a tolerance interval[15]. The blue area of Fig. 4.3 illustrates this tolerance

interval. TR is the abbreviation of Tolerance Region. The cumulative probability pmass

can be written as

pmass = p(− zp ≤

x− µσ≤ zp

)= p

((x− µ)2

σ2≤ z2p

),

(4.1)

where zp is called the standard score of the Gaussian distribution. Let Ψpmass = z2p ,

pmass can be represented as

p

((x− µ)2

σ2≤ z2p

)= p

((x− µ)2

σ2≤ Ψpmass

). (4.2)

CHAPTER 4. THE PROPOSED SPLIT AND MERGE METHOD 29

Given the value of pmass, zp as well as Ψpmass can be calculated with the z-table of

Gaussian distribution in [1].

pmassµ− zpσ µ µ+ zpσ

x

p

Figure 4.1: Figure of a Gaussian distribution, the blue area represents the toleranceregion [µ−zpσ, µ−zpσ], the cumulative probability pmass is calculated by the integrationof the distribution PDF within the tolerance region[1].

4.2 The Proposed Split and Merge Method

Given the vector of observed DoAs d = [d1, ...dN ], the GMM parameter θ can be

estimated with the EM algorithm[5]. Taseska et al. implemented the Split and Merge

method to update the number of sources by removing redundant components and

merging components that model the same source.

4.2.1 Removal of Gaussian Components

The first step in the Split and Merge approach is a component removing step, in which

Taseska et al. defined three empirical criteria C1, C2 and C3 to decide if the k-th

Gaussian component in the current iteration models a source.

The first criterion C1 is a variance criterion, which aims to remove the Gaussian compo-

nents with significantly larger variance among all components in the mixture. Because

30

a component with large variance is assumed to model noise [15]. This criterion C1 is

written as

C1 : σk < cvar · σk∗ ,

k∗ = arg mink′

(σk′), (4.3)

where cvar is a constant which represents the upper limit of the standard deviation of

the Gaussian components, and k∗ is the Gaussian component with minimal variance

value in the current iteration.

The second criterion C2 aims at removing the components with variance value which

is under the lower limit clm for the standard deviation of the component

C2 : σk < clm, (4.4)

where clm is a pre-defined constant. The third criterion C3 aims to remove the compo-

nents, whose tolerance region includes at least two means from other Gaussian com-

ponents. For example in Fig. 3.2, the blue area is the tolerance region of the middle

Gaussian component with mean µ2. It can be clearly seen that the means µ1 and µ3 of

other two components are within this blue area, so the component in the middle will

be removed. Formally this criterion is written as

C3 : (µk′ − µk)2/σ2k ≤ Ψpµ , (4.5)

where Ψpµ is the Mahalanobis distance for the tolerance region defined and calculated

by a probability pµ = 0.9. If the Mahalanobis distance between the µk and µ′k (here

two values of k′) is smaller than Ψpµ , component k will be removed.

4.2.2 Merging Gaussian Components

If the means of two Gaussian components are located close to each other, these two

components are likely to model the same source and should be merged. This component


TR2µ1 µ2 µ3

x

p

Figure 4.2: Figure of three Gaussian components with their means µ1, µ2 and µ3, TR2represents the tolerance region of the Gaussian component in the middle. Accordingto the criterion C3, this middle component will be removed.

merging condition can be defined as

‖ µk′ − µk ‖2< cµ, (4.6)

where ‖ · ‖2 is the Euclidean norm and cµ is a user-defined constant. If the condition

(4.6) is satisfied, component k′ and k are merged into one component with

µmerge =µk′ + µk

2, σmerge =

σ(k′) + σ(k)

2and αmerge = αk′ + αk. (4.7)

4.2.3 Pruning Training Data

After the removing and merging step for components, the redundant data points, which

do not belong to any remaining components will be removed. More clearly, if a DoA

estimate dn dose not belong to the tolerance region of any remaining Gaussian compo-

nents, it will be removed for the next iteration. This is called data pruning. Fig. 3.3

illustrates this with an example for the case that two Gaussian components remained

after the removing and merging step, and data points included by an orange area,

which will be removed in this pruning step since they do not belong to the tolerance

region of any Gaussian component. the pruning is written by

(dn − µk)2

σ2k

≥ Ψpmass (4.8)

32

where Ψpmass is calculated with the given cumulative probability pmass = 0.95.

TR1 TR2µ1 µ2

x

p

Figure 4.3: Figure of two Gaussian components, where TR1 and TR2 represent thetolerance regions of the Gaussian components. The data in the blue areas ate includedin the tolerance regions, while the data in orange areas ate outside the tolerance regions.

4.2.4 Final Merging Step After Convergence

After repeating the following steps: Em algorithm, removing of extra components,

merging multiple components and pruning data until convergence, a final merging step

need to be done to make sure that each cluster is modeled bu only one Gaussian

component. Two Gaussian components k and k′ are merged to one component if one

of these following conditions holds:

(µk − µk′)2

σ2k

≥ Ψpmerge , or (4.9)

(µk − µk′)2

σ2k′

≥ Ψpmerge . (4.10)

Similar to Ψpmass mentioned before, Ψpmerge is calculated in the same way with a given

probability pmerge = 0.95

In summary, this chapter introduces the Split and Merge method[15] for data clustering.

It reduces the iteration steps of the EM algorithm by removing redundant components

and data points in the M-step. The performance of the Split and Merge method will


be evaluated in the experimental chapter.

34 CHAPTER 5. INFINITE GAUSSIAN MIXTURE MODEL

Chapter 5

Infinite Gaussian Mixture Model

This chapter introduces the Infinite Gaussian Mixture Model (IGMM) clustering method

proposed in [12]. As described in the Chapter 3 and Chapter 4, Araki et al. and Taseska

et al. fitted the observed DoAs with a finite number of Gaussian components. Ras-

mussen considered that it is not necessary a priori to limit the number of components

to be finite [12] and pointed out the main disadvantage of the finite GMM clustering

method is that the correct number of mixture components needs to be defined before

clustering. Unfortunately, this is not possible in many situations, because the number

of clusters is unknown.

From the experience on implementing the GMM clustering method proposed by Araki

et al., we know that the in advance specified number of Gaussian components K must

exceeds the maximum possible number of speech sources, which is 5 in our experi-

ment. However, if this component number is too large, it brings the side effects of

”overfitting”. The hierarchical nonparametric IGMM method by Rasmussen success-

fully avoids this problem. In the proposed method, Rasmussen derived a hierarchical

IGMM and sampled model parameters with Gibbs sampling. For parameters whose

density function very difficult to sample from, he used the adaptive rejection sampling

in [6]. The Chinese restaurant process representation of the Dirichlet process is used

in the proposed method to limit the number of components and assign observations to

them.

CHAPTER 5. GMM AND IGMM 35

5.1 GMM and IGMM

The finite GMM with K components is written as

p(d|µ1, ..., µK , s1, ..., sK , α1, ..., αK) =K∑k=1

αkN (µk, s−1k ), (5.1)

where d = [d1, ..., dN ] is the training data and N is the data length, µk is the mean,

sk is the precision and αk is the mixing proportions of the k-th Gaussian component.

The IGMM where the component number k →∞ can be written as

p(d|µ1, ..., µK , s1, ..., sK , α1, ..., αK) =N∑i=1

p(zn = k)∞∑k=1

αkN (µk, s−1k ), (5.2)

Rasmussen defined here a indicator zn for each observation, which takes value 1...k.

The indicator zn = k indicates the k-th component has generated the observation di

and p(zn = k) is the conditional prior for components.

5.2 Component Parameters

This section specifies the priors on component parameters and hyper parameters of the

IGMM. The conditional distributions for the parameters will also be derived for the

Gibbs sampling. So the component means µk are given Gaussian priors,

p(µk|λ, r) ∼ N (λ, r−1), (5.3)

where the mean λ and precision r are hyper parameters common to all components.

The hyper parameters themselves are given vague Normal and Gamma Priors,

p(λ) ∼ N (µd, σ2d), p(r) ∼ G(1, σ2

d) ∝ r−1/2exp(−rσ2d/2), (5.4)

here the µd and σ2d are the mean and variance of the observed data. The shape of the

the Gamma prior is set to unity, corresponding to a very broad distribution.

The component precision sk are given Gamma priors with

p(sk|β, ω) ∼ G(β, ω−1), (5.5)

36

its shape β and mean ω−1 are given inverse Gamma form

p(β−1) ∼ G(1, 1)⇒ p(β) ∝ β−3/2 exp(−1/(2β)), p(ω) ∼ G(1, σ2d). (5.6)

By multiplying the (5.2) and the (5.3) we obtain the conditional posterior distribution

for the means

p(µk|z,d, sk, λ, r) ∼ N(dklksk + λr

lksk + r,

1

lksk + r

), dk =

1

lk

∑n:zn=k

dn, (5.7)

here the lk is the number of observations belonging to component k and dk is the

mean of these observations. Meanwhile, the conditional posterior distribution for the

precision are obtained by multiplying the Eq. (5.5) and Eq. (5.2) conditioned on the

indicators zn,

p(sk|z,d, µk, β, ω) ∼ G(β + lk, [

1

β + lk(ωβ +

∑n:zn=k

(dn − µk)2)]−1). (5.8)

Combining the (5.3) and (5.7) we obtain the conditional posteriors for the hyper pa-

rameters λ and r

p(λ|µ1, ..., µK , r) ∼ N(µd σ

−2d + r

∑Kk=1 µk

σ−2d +Kr,

1

σ−2d +Kr

), (5.9)

p(r|µ1, ..., µK , r) ∼ G(K + 1, [

1

K + 1(σ2

d +K∑k=1

(µk − λ)2)]−1). (5.10)

Similarly, we obtain the conditional posteriors for the hyper parameters ω and β by

combining the Eq. (5.5), Eq. (5.6) and Eq. (5.8).

p(ω|s1, ..., sK , β) ∼ G(Kβ + 1, [

1

Kβ + 1(σ−2d + β

K∑m=1

sk)]−1), (5.11)

p(β|s1, ..., sK , ω) ∝ Γ(β

2)−Kexp(

−1

2β)(β

2)(Kβ−3)/2

K∏k=1

(skω)β/2exp(− βskω

2

). (5.12)

It is very difficult to sample β directly from the density function of Eq. (5.12), but

p(log(β)|s1, ..., sK , ω) is log-concave, so we can generate the independent samples from

CHAPTER 5. ADAPTIVE REJECTION SAMPLING 37

the distribution for log(β) utilizing the adaptive rejection sampling technique, and

transform these to get values for β.

The mixing weight for the components αk are given symmetric Dirichlet prior with the

concentration parameter φ/K,

p(α1, ..., αK |φ) ∼ Dirichlet(φ/K, ..., φ/K) =Γ(φ)

Γ(φ/K)K

K∏k=1

αφ/K−1k , (5.13)

and the concentration parameter φ is given the inverse Gamma shape

p(φ−1) ∼ G(1, 1)⇒ p(φ) ∝ φ−3/2exp(− 1/(2φ)

). (5.14)

Given the mixing proportions the joint distribution of the indicators zn is written as

p(z1, ..., zK |α1, ..., αK) =K∏k=1

αlkk , lk =N∑i=1

δKronecker(zn, k). (5.15)

We can integrate out the mixing proportions using the standard Dirichlet integral

p(z1, ..., zK |φ) =

∫p(z1, ..., zK |α1, ..., αK)p(α1, ..., αK)dα1...dαK

=Γ(φ)

Γ(φ/K)K

∫ K∏k=1

αlk+φ/K−1k dαk

=Γ(φ)

Γ(N + φ)

K∏k=1

Γ(lk + φ/K)

Γ(φ/K).

(5.16)

Combing the (5.14) and (5.16), we obtain

p(l1, ..., lK |φ) =φKΓ(φ)

Γ(N + φ), p(φ|K,N) ∝ φK−3/2 exp(−1/(2φ))Γ(φ)

Γ(N + φ). (5.17)

Because the distribution p(log(φ)|K,N) is log-concave, the samples of this distribution

can be easily generated using the adaptive rejection sampling.

5.3 Adaptive Rejection Sampling

The Gibbs sampling is a famous Markov chain Monte Carlo (MCMC) technique to

generate samples from complicated multivariate probability distributions. It is com-

38

monly used to update the value of variables from its conditional distribution given all

other variables in the system[5].

For example, suppose we want to sample from the indicator distribution p(z) =

p(z1, ..., zN). In each step of the Gibbs sampling, we draw a value from the distri-

bution p(zi|zri) to replace the original value zi, where zi denotes the i−th component

of z, zri denotes [z1, ..., zi−1, zi−1, ..., zN ] and p(zi|zri) is the distribution of zi condi-

tioned on other variables. The procedure of Gibbs sampling to get T samples for p(z)

can be summarized as in the table 5.1.

Gibbs Sampling

1. Initialize zi : i = 1, ..., N

2. For τ = 1, ..., T :

Sample zτ+11 ∼ p(z1|zτ2 , zτ3 , ..., zτN).

Sample zτ+12 ∼ p(z2|zτ+1

1 , zτ3 , ..., zτN).

:

Sample zτ+1k ∼ p(zk|zτ+1

1 , ..., zτ+1j−1 , z

τj−1, ..., z

τN).

:

Sample zτ+1N ∼ p(zj|zτ+1

1 , ..., zτ+1N−1).

Table 5.1: Summary of steps in Gibbs sampling [5].

However for a hierarchical model, the Gibbs sampling for a conditional distribution can

be very computational expensive, because the conditional distribution might includes

the product of many terms [6]. To improve the computational efficiency of Gibbs sam-

pling, Gilks and Wild proposed the Adaptive rejection sampling (ARS) technique for

Gibbs sampling in [6].

To introduce ARS we first introduce the idea of rejection sampling. Suppose we want

to sample from the complex density distribution p(z) in its domain D, because the

CHAPTER 5. ADAPTIVE REJECTION SAMPLING 39

distribution p(z) is complicated, which means it is very difficult to sample directly

from. However, another density function g(z) is considered from which we can sample

directly under the restriction that g(z) = bp(z) for some possibly unknown value of b,

i.e., g(z) ∝ p(z). To sample T points from p(z), we define an envelope function ge(z)

that ge(z) ≥ g(z) and a squeezing function gl(x) that gl(z) ≤ g(z) for all z in D. The

steps of rejection sampling algorithm is described in the Table. 5.2.

Non-adaptive Rejection Sampling

1. Define an envelope function ge(z) that ge(z) ≥ g(z) for all x in D.

2. Define a squeezing function gl(z) that gl(z) ≤ g(z) for all x in D.

Repeat

1. Sample a value z∗ from ge(z) and gl(z).

2. Sample a value u independently from uniform distribution [0, 1].

3. Perform the rejection test

if u ≤ gl(z∗)/ge(z

∗)

z∗ accepted

else if u ≤ g(z∗)/ge(z∗)

z∗ accepted

else z∗ rejected

Until T values have been accepted.

Table 5.2: Summary of steps in non-adaptive rejection sampling.[6]

However, it is always very difficult to find a suitable ge(z) in D. Alternatively, the

envelope function can be constructed on the fly based on some measured values of

the distribution p(z) [6]. The basic idea of ARS is to form a piece-wise linear upper

envelope function g′(z) from a set of tangents to the log of the target density p(z)

40

and improve this envelope function g′(z) adaptively during the sampling process. The

piece-wise linear upper envelope function g′(z) is written as

g′(z) ≥ log p(z) (5.18)

We evaluate log p(z) and g′(z) at τ points a1 < a2... < aτ , the vector of this points is a.

The envelope function g′(z) is composed of the tangents of log p(z) at these points. Fig.

5.1 plots the construction of the envelope function. Each time a point z = a is rejected,

it is added to the evaluation points to improve the quality of g′(z). Because g′(z) is

piecewise-linear, we can easily get samples from exp g′(z), where exp g′(z) ≥ p(z). Like

in the rejection sampling, we also define a squeeze function g′l(z) to apply the squeeze

test. The steps of ARS is summarized in Tab. 4.2.

a1

a2a3

a4

z

log

p(z

)

Figure 5.1: Plot of ARS, the envelope function is built using the tangent lines of a logconcave function, the tangent lines are computed at a set of points a. [5]

Under the assumption that p(z) is log-concave, the ARS is a computational efficient

sampling method because it reduces the number of evaluations. Each sample is more

CHAPTER 5. CHINESE RESTAURANT PROCESS 41

Adaptive Rejection Sampling

1. Select the evaluation points a.

2. Form the piecewise-exponential upper bound exp g′(z)

and lower bound g′l(z) using points a.

Repeat

1. Sample a value z∗ from exp g′(z)/∑

exp g′(z).

2. sample a value u independently from uniform distribution (0, 1).

3. Perform the rejection test:

if u ≤ g′l(z∗)/exp g′(z∗)

z∗ accepted

else if u ≤ p(z∗)/exp g′(z∗)

z∗ accepted

else Add the corresponding z∗ to a and recompute upper

and lower bounds exp g′(z) and g′l(z).

Until number of points required have been accepted.

Table 5.3: Summary of steps in ARS.[13]

likely to be accepted with the improving quality of the envelope function g′(z).

5.4 Chinese Restaurant Process

To generate the indicator variable zn, Rasmussen uses the Chinese restaurant process.

The Chinese restaurant process can be described briefly according to [16]. Suppose

there is a Chinese Restaurant with an infinite number of tables, each of which can seat

an infinite number of customers, see Fig. 5.2.

42

Figure 5.2: Simple diagram of Chinese restaurant process, the tables are marked bythe big circles, the customers are represented by the small squares.

The tables are chosen according the following random process:

1. The first customer sits at the first table.

2. The second customer chooses to sit with the first customer in the first table or

sit by himself in a new empty table.

3. ...

4. The N -th customer chooses to sit by himself with the probabilityφ

N − 1 + φor

sit in the k-th table with other customers with the probabilitylk

N − 1 + φ, where

lk denotes the number of customers already sitting at the k-th table.

Note that it is more likely that the new customer chooses the table with more customers

(the rich-gets richer phenomenon), such that the growth speed of the number of tables

is much slower than the growth speed of the number of customers. In our case of

data clustering, the customers are the observed data, and the tables in the restaurant

correspond to the mixture components. The rich-gets-richer phenomenon means that

we expect large clusters, i.e., the number of components K is smaller than the number

of the observations N [16]. Rasmussen uses this characteristic of the Chinese restaurant

process to realize the infinite limit of the component number, so we can work with the

finite number of clusters. The limits of conditional prior for the indicators can be

written as

CHAPTER 5. POST PROCESSING FOR THE IGMM METHOD 43

• components where l−n,k > 0 :

p(zn = k|z−n, φ) =l−n,k

N − 1 + φ(5.19)

• new component:

p(zn = knew|z−n, φ) =φ

N − 1 + φ. (5.20)

By combining the density function Eq. (5.2) and the prior limit Eq. (5.4) we obtain

the conditional posteriors for the indicators

• components where l−n,k > 0 :

p(zn = k|z−n, µk, sk, φ) ∝

p(zn = k|z−n, φ)p(dn|µk, sk, z−n) ∝ l−n,kN − 1 + φ

s1/2k exp(−sk(dn − µk)2/2) (5.21)

• new component:

p(zn = knew|z−n, λ, r, β, ω, φ) ∝φ

N − 1 + φ

∫p(dn|µk, sk) p(µk, sk|λ, r, β, ω)dµkdsk. (5.22)

The integral part∫p(dn|µk, sk) p(µk, sk|λ, r, β, ω)dµkdsk is not analytically tractable,

Rasmussen used the method subscribed in [10], which suggests that this integral can

be sampled from a normal-gamma distribution. After we obtained the conditional

posteriors for the indicators, we re-assign all the observations to the components, the

components with no data will be removed after the re-assignment.

5.5 Post processing for the IGMM Method

According to the experimental results, the IGMM clustering method also has the prob-

lem of ”overfitting”, see Fig. 5.3. In order to obtain the final number of sources, some

44

−180 −120 −60 0 60 120 1800

1

2

3

4·10−2

DoA [deg.]

P(D

oA)

Histogram of obs. DoAsTrue DoAIGMM fit

Figure 5.3: The fitting result of the IGMM for 4 sources. The observed DoAs areplotted by white bars, the red stems show the accurate position of each source and thefitting result is represented by the blue lines.

further steps similar to the split and merge method from [17] are applied after the

Gibbs sampling. First of all, we define a threshold αlm = 0.1 as the lower limit of

component weight αk. The component k′ will be removed if this condition holds

αk′ < αlm. (5.23)

This criterion aims to remove the components with small weight values. Next the com-

−180 −120 −60 0 60 120 1800

2 · 10−2

4 · 10−2

6 · 10−2

8 · 10−2

0.1

DoA [deg.]

P(D

oA)

Histogram of obs. DoAsTrue DoAIGMM fit

Figure 5.4: The fitting result of IGMM for 4 sources with the extra split and mergestep. The observed DoAs are plotted by white bars, the red stems show the trueposition of each source, the fitting result is represented by the blue lines.

ponents with big variance (higher than 10 times the minimal variance of all mixture

CHAPTER 5. POST PROCESSING FOR THE IGMM METHOD 45

components) will be removed. Finally the components with closely located means will

be merged into one component. The results of applying this post processing step after

the iterations are illustrated in Fig. 5.4. We can see that the problem of overfitting is

solved.

In summary, this chapter introduces the IGMM method[12]. The observed DoA mix-

ture are modeled by infinite number of Gaussians and the model parameters are esti-

mated using Gibbs sampling. The final number of components is finite by using the

CRP. To avoid the ”overfitting” problem, a post processing step[17] is applied to the

IGMM after the iterations.

46 CHAPTER 6. EXPERIMENTAL RESULTS

Chapter 6

Experimental Results

This chapter presents the experimental results for the three clustering methods for

counting acoustic sources, including the GMM clustering method with Dirichlet prior,

the Split and Merge method and the IGMM method. The source counting accuracy

of these three methods are tested with both artificially generated data and observed

DoAs under different experimental environments. The test results and conclusions for

artificial data will be presented in Section 6.1. In Section 6.2, the observed DoAs

under different acoustic scenarios are used to evaluate the performance of these three

algorithms.

6.1 Results with Artificial Data

This section presents the accuracy test results of three clustering methods with artificial

data. The test results with artificial data help us to prove the viability of the clustering

methods due to the fact that the artificial data is actually a ideal presentation of the

observed DoAs in reality, i.e. the artificial data is more Gaussian like. Also, the

simulation test with artificial is much more time saving than the test with observed

DoAs, which allows us to run the test for hundreds of times in order to obtain better

statistics. Moreover, some features of the artificial data can be easily modified, e.g.

the data variance, which might influences the performance of the clustering algorithms.

CHAPTER 6. RESULTS WITH ARTIFICIAL DATA 47

In summary, the simulation test results of artificial data helps to explain the fitting

results of the clustering methods with the observed DoAs.

6.1.1 Test Scenarios and Artificial Training Data

First we introduce the test scenarios and the generating process of the artificial data.

Artificial data of one artificial speech source are generated according to a Gaussian

distribution N (µs, σ2s), where µs is the mean, σ2

s is the variance of the data. The

population of the data points in one artificial source is set to be 1000.

We test the clustering methods with up to five artificial sources. Similar to the observed

DoAs, the artificial data also locates within the interval [−180◦, 180◦). Each artificial

source location is randomly generated from a uniform distribution of angles within this

interval in the test. Fig. 6.1 shows an example of a histogram of one artificial source

with µs = 7◦.

The accuracy of source counting is calculated by running each test 100 times and

−180 −120 −60 0 60 120 1800

1

2

3

4·10−2

DoA [deg.]

P(D

oA)

Histogram of art. DoAsTrue DoA

Figure 6.1: Histogram of artificial data for one artificial speech source. The columnsrepresent the normalized probability of the artificial data and the red stem shows thetrue source location at µs = 7◦.

48

averaging the amount of correct estimates. We also define a criterion for an accurate

estimation: if the estimated source number K equals to the true source number Ns and

the estimated mean µ locates near the setting mean µs with ‖ µ − µs ‖< 15◦, we say

that the applied method has accurately estimated the source number from the data

mixture. Fig. 6.2 shows an example of an accurate source counting using the Split and

Merge method: there is only one Gaussian component and the mean µ = 6.56◦, which

satisfied the condition that ‖ µ− µs ‖< 15◦.

[tbh!]

−180 −120 −60 0 60 120 1800

1

2

3

4

5·10−2

DoA [deg.]

P(D

oA)

Histogram of art. DoAsTrue DoAS&M fit

Figure 6.2: Histogram of artificial data of one source and the data fitting result of theSplit and Merge method. The columns represent the PDF of the artificial data andthe red stem shows the true source location µs = 7◦. The fitting result is plotted bythe blue Gaussian component whose mean is µk = 6.56◦.

In the following sections, the source counting results for the three methods will be

presented. The variables of test scenario consist of different source numbers, SNRs,

data variances σ2s and angular differences between neighboring sources.


6.1.2 Convergence Condition for EM algorithm in the GMM

Clustering Method with DP

To start the evaluation of the GMM clustering with Dirichlet prior method, the initial-

ization and the convergence condition of the EM algorithm need to be specified. In this

test the EM algorithm is initialized with K = 8 Gaussian components, where the means

of the components µ are initialized by applying the k-means[5] clustering method to

the artificial data. The parameter αk is uniformly initialized as αk = 1/K = 0.125

and the overall variance of the data is set as the initial variance σ2k for each Gaussian

component.

Araki et al. [3] did not define a convergence condition of EM algorithm. The experi-

mental results for the case of one source Ns = 1 are also not given in [3]. During the

implementation we found out that the different convergence conditions do effects the

experiment results. We tested three convergence conditions at the end of the M-step

to see if it influences the experimental results. The conditions are

ct1 : µt+1 − µt < 10(−20)

ct2 : µt+1 − µt < 10(−30)

ct3 : µt+1 = µt,

(6.1)

where µt are the estimated means of Gaussian components from the previous EM

iteration and µt+1 are the means of Gaussian components calculated in the current

iteration. By implementing the evaluation for 100 trials, we calculate the accuracy of

source counting on artificial data for each variables of the scenario. Given the exper-

iment settings presented in Tab. 6.1, the source counting accuracy in relation to the

convergence conditions is presented in Fig. 6.3.

We can see that the convergence conditions do effect the source counting results, es-

pecially for cases when source number Ns = 1 and Ns = 2. This is because the EM

algorithm reaches the convergence too early (with only 500 iterations) with the conver-

gence condition c11 and ct2, which leads to the problem of ”overfitting”. A more strict

50

convergence condition ct3 increases the iteration steps and improves the accuracy of

source counting. We found out that with more than 2000 iterations, the results of the

GMM clustering method with Dirichlet prior are always good. So we perform in the

following tests for the GMM clustering method by using the convergence condition of

ct3 and the maximum number of iteration steps is 2500.

Source Number SNR in dB Standard Deviation Minimal Angular Distance

1,2,3,4,5 30 15◦ 70◦

Table 6.1: Experimental variables for the evaluation of the influence of the convergencecondition.

1 2 3 4 50

20

40

60

80

100

Source Number

Acc

ura

cyofNs

esti

mat

ion

in%

ct1ct2ct3

Figure 6.3: Source counting accuracy for artificial data in % using different convergenceconditions over different numbers of speech sources Ns = [1, 2, 3, 4, 5].

For the Split and Merge method we use the convergence condition ct1 with µt+1−µt <

10(−20) for the EM algorithm. This is because the fast convergence speed(in 300 steps)

is a main advantage of the Split and Merge method.


K Initial number of components 8

ν Parameter for wrapped phase model -1,0,1

φ Parameter of the Dirichlet prior 0.001

c Parameter for the power weight function Data length

σbig Parameter for removing components with big σ 20◦

αlm Parameter for removing components with small α 0.1

Table 6.2: Parameters of the GMM clustering method with Dirichlet prior

6.1.3 Parameters for Clustering Methods

This section gives the values of the parameter settings in each clustering method for the

evaluation of the source counting accuracy. Note that for artificial data we don’t have

the STFT values, so the parameter c for the power weight function, which is written

as


, (6.2)

equals to the artificial data length (the value of c is recalculated from the derivation

of EM algorithm in Chapter 3). As the post processing step of the GMM method and

the IGMM method, the components with big variance σbig or with very small weight

αlm are removed. The parameters of three methods are given in Tab. 6.2, Tab. 6.3

and Tab. 6.4.

6.1.4 The Effect of Varying Angular Difference

The angular difference between the neighboring artificial sources is also a factor of

interest. With the experiment settings in Tab. 6.5, the estimation accuracy of the

three clustering methods in relation to the minimal angular difference between the

neighboring sources are presented in Fig. 6.4.

The performance of the Split and Merge method and the IGMM method increases with

52

K Initial number of components 8

cvar Parameter for upper limit of standard derivation 2

clm Parameter for lower limit of standard derivation 2.5◦

pµ Probability mass for condition C3 in Chapter 4 0.9

cµ Parameter for merging components 23◦

pmass Probability mass for data pruning 0.95

pmerge Probability mass for the final merging step after convergence 0.95

αlm Parameter for removing components with very small weight 0.1

Table 6.3: Parameters of the Split and Merge method.

cvar Parameter for upper limit of standard deviation 10

cµ Parameter for merging components 30◦

αlm Parameter for removing components with very small weight 0.1

Table 6.4: Parameters of the IGMM method.

increasing minimal angular distance between the neighboring sources. For example,

we can see that the angular difference between two neighboring sources affects the

estimation accuracy for the case which Ns = 5 the most. The source counting accuracy

for Ns = 5 improves sharply with the increasing angular difference between sources

after a threshold (for the Split and Merge method and the IGMM method 50◦and

for the GMM clustering method 30◦. Since we estimate up to five speech sources in

the experiment, the minimum angular difference between the neighboring sources is

supposed to be more than 65◦.



1,2,3,4,5 30 15 20◦, 30◦, 40◦, 50◦, 60◦, 70◦

Table 6.5: Experimental variables for the evaluation of the influence of the minimalangular distance between neighboring sources.

20 25 30 35 40 45 50 55 60 65 700

20

40

60

80

100GMM with Dirichlet Prior

Ns=1Ns=2Ns=3Ns=4Ns=5

20 25 30 35 40 45 50 55 60 65 700

20

40

60

80

100

Acc

ura

cyof

sourc

eco

unti

ng

in%

Split and Merge


20 25 30 35 40 45 50 55 60 65 700

20

40

60

80

100

Minimal Angular difference (in degree)

IGMM


Figure 6.4: Accuracy of source number estimation in relation to the angular differencebetween the neighboring artificial sources.

54

6.1.5 Source Counting Accuracy with Varying SNR Values

Speech signals in reality always contain noise, therefore, we also test the clustering

methods with data of varying values of the signal-to-noise ratio (SNR). The SNR is a

measure that compares the level of a desired signal to the level of background noise

and is defined as

SNR = 10 log10

(Psignal

Pnoise

), (6.3)

wherePsignal

Pnoise

is the ratio of the signal power and the background noise power.

With the experiment settings in Tab. 6.6, the estimation accuracy in relation to the

SNR values of the artificial data are presented in Fig. 6.5. From the results we can

see that with increasing values of the SNR, the performance of the clustering methods

improves. For the cases where the number of artificial sources equals Ns = 4 and

Ns = 5, none of the methods are able to estimate the accurate number of sources for

SNR = 10dB.


1,2,3,4,5 10,20,30 15◦ 70◦

Table 6.6: Experimental variables for the evaluation of the influence of the SNR values.

6.1.6 Source Counting Accuracy with Varying Data Variance

The variance σ2s is a measure of spread of the data around the mean of the data set.

We tested the clustering methods with different data variances varying from 10 to 25,

Tab. 6.7 gives the experimental settings for the investigation of the influence of varying

variances. From the corresponding source counting results in Fig. 6.6, we can see that

the performance of the clustering methods decrease as the data variance increases.

For the GMM clustering method with Dirichlet prior, the source counting accuracy

for the cases Ns = 1 and Ns = 2 also decreases sharply with the increasing standard


10 20 300

20

40

60

80



10 20 300

20

40

60

80

100

Acc

ura

cyof

sourc

eco

unti

ng

in%

Split and Merge


10 20 300

20

40

60

80

100

Input SNR in dB

IGMM


Figure 6.5: Source counting accuracy of clustering methods in relation to SNR valuesof artificial data.

deviation from σ = 15◦, while other two methods are less affected. This is because the

”overfitting” problem of the GMM clustering method happens more frequently. For the

case Ns = 5 with the standard deviation values from 15◦to 20◦, the Split and Merge

method has much better performance than the other two methods. After σ = 20◦,

none of the methods are able to estimate the accurate source number when Ns = 5.

Overall, the GMM method with Dirichlet prior is more sensitive to the data variance

and other two methods have better performance in this case.

56


1,2,3,4,5 30 10,15,20,25 70

Table 6.7: Experiment variables for the evaluation of the influence of the varying datavariance.

10 12 14 16 18 20 22 240

20

40

60

80



10 12 14 16 18 20 22 240

20

40

60

80

100

Acc

ura

cyof

sourc

eco

unti

ng

in%

Split and Merge


10 12 14 16 18 20 22 240

20

40

60

80

100

The data variance value of each artificial source

IGMM


Figure 6.6: Source counting accuracy of clustering methods in relation to the varianceof artificial data.

CHAPTER 6. EXPERIMENTAL RESULTS WITH OBSERVED DOAS 57

6.1.7 Comparison and Conclusions

In conclusion, all three clustering methods show better performance of source counting

for data of smaller variance, large SNR values and larger minimal angular distance

between artificial sources. The number of iterations of the EM algorithm and the

convergence condition affects the performance of the GMM clustering method with

Dirichlet prior. This is because the problem of ”overfitting” happens for small number

of iterations in EM algorithm. Moreover, the GMM clustering method with Dirichlet

prior is more sensitive to varying data variance than the other two methods.

The performance of the methods become better with the increasing minimal angular

distance between neighboring sources. To estimate 5 sources, the minimal angular dis-

tance between neighboring sources should be more than 60 degree.

6.2 Experimental Results with Observed DoAs

This section presents the experimental results of the three clustering methods for source

counting based on observed DoAs. The performances of the three methods will be

tested and evaluated under different acoustic scenarios. In the simulation environment,

up to 5 speech sources are located in a room of dimension 8.8m×3.75m ×2.4m[4]. The

experimental setup is illustrated in Fig. 6.7. Both speech sources and sensors are

arranged on circles around the center point. We definite the center of the circle to be

located at [2.82m, 2m, 1.2m]. The radius of the sources to the center is denoted as rs

and the radius of the microphone array to the center is rm.

In the experiment, the source locations are generated randomly from a uniform distri-

bution of the angles within [-180,180). For the speech sources we choose five English

speech signals sampled at 16kHz and of duration of 5 seconds. The frame size for the

STFT is 1024, and the frame shift is 256.

The room impulse response h is generated using the room impulse response generator

[7]. The experiment is performed in a reverberant condition. Fig. 6.8 plots observed

58

Figure 6.7: Simulated experiment setup for locations of sensors and sources. The roomsize is 8.8m × 3.75m × 2m Speech sources and sensors locate in circles with radius rsand rm. All sources and sensors are in the same height of 1.2m.

DoAs and artificial DoAs. We can see that compared with the artificial DoAs, the

variance of the observed DoAs is larger.

In the experiment, the problem of ”overfitting” always occurs when we apply the GMM

clustering method to the observed DoAs, which results in the lower accuracy of source

counting. We attempted to increase the number of iteration steps to 20000, but the

severe ”overfitting” problem still exists even with the Dirichlet prior. In order to im-

prove the performance of the GMM clustering method with Dirichlet prior, we applied

the same post processing step of the IGMM method as described in Section 5.5 to the

GMM clustering method. The post processing steps include a component merging step

with the same cµ in Tab. 6.4. Other parameters are the same as the parameter settings


−180−120 −60 0 60 120 1800

0.5

1

1.5

2·10−2

DoA [deg.]

P(D

oA)

Histogram of obs. DoAsTrue DoA

−180−120 −60 0 60 120 1800

2

4

6

8·10−3

DoA [deg.]

P(D

oA)

Histogram of art. dataTrue DoA

Figure 6.8: Histogram of artificial data (left) and artificial data (right).

in Section 6.1.3.

The experiment variables include varying source numbers, sensor numbers, reverbera-

tion times, sensor distances and the source distances to the center. Since the number of

data points in the observed DoAs is very large, each single test is very time consuming,

especially for the EM algorithm, we decrease the trial times from 100 to 20 for each

test to calculate the source counting accuracy.

6.2.1 Test with Varying Source Numbers and Sensor Numbers

1 Source number 1,2,3,4,5

2 Sensor number 3,4,5,6

3 Reverberation time 0.12s

4 Sensor distance to circle center 0.017m

5 Source distance to circle center 0.8m

6 Minimal angular distance between neighboring sources 70◦

7 SNR 50dB

Table 6.8: Experiment variables for the evaluation of the influence of the differentsensor numbers

60

From the test results in we can not find a relation between the source counting accuracy

and the number of sensors. However, the accuracy of the three methods tends to

decrease with the increasing source number. For the cases where Ns = 1 and Ns = 2,

the IGMM method has better performance (with accuracy over 90%) in comparison

with other two methods. All three methods are not able to estimate the correct source

number when Ns = 5. This might result from the large data variance of the observed

DoAs.

GMM with DP Split & Merge IGMM

Nm\Ns 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

3 85 65 40 55 0 80 55 70 45 0 100 90 50 30 0

4 90 55 70 65 5 65 45 45 50 10 100 100 35 35 0

5 90 50 55 55 0 80 55 50 50 5 95 95 35 20 0

6 90 65 45 0 5 85 55 65 40 5 100 90 70 5 0

Table 6.9: Accuracy of source counting on observed DoAs mixture with varying sensornumbers Nm from 3 to 6 and source numbers Ns from 1 to 5.

6.2.2 Test with Varying Reverberation Time

In this test the mixture of DoAs calculated under varying reverberation time is used to

evaluate the three clustering methods. From Fig. 6.9 we can see that the performance

of the three clustering methods tends to degrade with increasing reverberation time.

It means that the three clustering methods work only for the DoAs that are observed

with short reverberation time in small rooms.



2 Sensor number 5

3 Reverberation time in s 0.12,0.15,0.20,0.25,0.3


5 Source Distance to Circle Center 0.8m

6 Minimal Angular Distance between Neighboring Sources 70◦

7 SNR 50dB

Table 6.10: Experiment variables for the evaluation of the influence of the varyingreverberation time.

6.2.3 Test with Varying Sensor Distance to Circle Center

The varying sensor distances also influences the performances of the clustering methods.

Based on the values of the experimental variables described in Tab.6.11, the source

counting accuracy of all three clustering methods tends to decrease with the increasing

sensor distance to the center from 0.05m.

1 Source Number 1,2,3,4,5

2 Sensor Number 5

3 Reverberation Time in s 0.12

4 Sensor Distance to Circle Center in m [0.017 0.05 0.1 0.15 0.2]

5 Source Distance to Circle Center in m 0.5


7 SNR 50dB

Table 6.11: Experiment variation settings for the evaluation of the influence of thevarying sensor distance to the circle center

62

0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.30

20

40

60

80



0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.30

20

40

60

80

100

Acc

ura

cyof

sourc

eco

unti

ng

in%

Split and Merge


0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.30

20

40

60

80

100

The Reverberation Time(in Seconds)

IGMM


Figure 6.9: Source counting accuracy in relation to the reverberation time

6.2.4 Test with Varying Source Distance to Circle Center

To investigate the influence of varying source distance to the center on the performance

of source counting. We evaluate the three methods with different source distance from

0.5m to 1.5m. The experiment variables are given in the Tab 6.12. Here we denote

the source distance to the center as rs. The experiment results is presented in Tab

6.13. We don’t find a strong relation between the performance of the methods and the

source distance. However, we can see that the performance of the GMM method with

Dirichlet prior is not stable for Ns = 4 with varying source distance.


0 5 · 10−2 0.1 0.15 0.2 0.250

20

40

60

80



0 5 · 10−2 0.1 0.15 0.2 0.250

20

40

60

80

100

Acc

ura

cyof

sourc

eco

unti

ng

in%

Split and Merge


0 5 · 10−2 0.1 0.15 0.2 0.250

20

40

60

80

100

Sensor distance to the circle center in m

IGMM


Figure 6.10: Source counting accuracy in relation to the sensor distance to the circlecenter in m

6.2.5 Comparison and Conclusions

The accuracy of source counting decreases with the the increasing source number. The

varying sensor number has no obvious influence on the performance of three methods.

We also don’t find a reasonable explanation for the results with varying sensor distance

to the center. For a better performance of the clustering methods on observed DoAs,

the sensors should locate near with the center.

64


2 Sensor number 5

3 Reverberation time 0.12s


5 Source distance to circle center 0.5,0.8,1.0,1.2,1.5


7 SNR 50dB

Table 6.12: Experiment variation settings for the evaluation of the influence of thevarying source distance to the circle center.

GMM with DP Split & Merge IGMM

rs\Ns 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

0.5m 75 65 40 70 0 65 60 70 50 10 100 90 45 15 0

0.8m 95 80 45 0 0 70 60 60 80 10 100 90 60 20 0

1m 75 70 55 85 5 50 50 80 50 15 90 85 55 20 0

1.2 75 65 50 0 0 50 55 50 35 15 80 85 50 25 5

1.5m 60 60 45 100 0 50 50 70 55 5 90 75 35 20 15

Table 6.13: Accuracy of source counting on observed DoAs mixture with varying sensordistance

CHAPTER 7. SUMMARY AND OUTLOOK 65

Chapter 7

Summary and Outlook

In summary, in this thesis three clustering methods for estimating the number of acous-

tic sources are implemented and evaluated. The three BSS methods include the GMM

clustering with Dirichlet prior[3], the Split and Merge method[15] and the IGMM clus-

tering method [12].

Because of the large number of iteration steps of the EM algorithm, the GMM cluster-

ing method with Dirichlet prior is the most time consuming method among the three

methods. Compared with other two methods, the performance of the GMM method

is more affected by the varying data variance.

The Split and Merge method is the most easiest method to understand. It reduces

the iteration steps of traditional EM algorithm. Compared with the GMM method

with Dirichlet prior, it presents a better performance for handling the ”overfitting”

problem. However, this method contains 6 empirical parameters, which results in more

simulations to find the suitable parameter values.

The IGMM method is competitive compared with other two methods. It has the min-

imal number of empirical parameters in three methods. For the cases where source

number equals to 1 and 2, the IGMM method has better performance than the other

two methods.

Improvements of the IGMM method can be done in the future work. For example,

Walter et al. built a wrapped phase IGMM for the weighted observed DoAs in [17],

66

which could improve the source counting accuracy of the IGMM method on observed

DoAs. Processing methods for the observed DoAs can be investigated to improve the

accuracy by reducing the data variance.

LIST OF FIGURES 67

List of Figures

2.1 Missing time information of the Fourier transform illustrated by two

different signals and their magnitude Fourier transforms. (a) Two sub-

sequent sinusoids of frequency 1 Hz and 5 Hz. (b) Superposition of the

same sinusoids.[9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Signal and Fourier transform consisting of two subsequent sinusoids of

frequency 1 Hz and 5 Hz [9]. . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Definition of the DoA. A speech source i locates in a spherical coordinate.

The DoA of i is represented with a unit norm qi‖qi‖ marked by the red

arrow [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Far-field model of a speech source i and two sensors j and J . The TDoA

between sensors is (dj − dJ)T qi‖qi‖ marked by the red arrow [2]. . . . . . 6

2.5 Plot of the univariate Gaussian showing the mean µ and the standard

deviation σ [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 Histogram of observed DoAs of three speech sources. The white columns

represent the observed DoAs, the red stems indicate the accurate direc-

tion of speech sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 Plot of mixtures of three Gaussian components, the linear superposition

of these Gaussian components builds a GMM. . . . . . . . . . . . . . . 10

68

2.8 Histogram of observed DoAs. On the left, little phase wrapping hap-

pens which results into data that can be well modeled by a Gaussian

distribution. On the right severe phase wrapping occurs which results

in a poor fit of the data by a single Gaussian distribution [14]. . . . . . 10

2.9 Illustration of the wrapped phase Gaussian distribution model. Repli-

cating a Gaussian distribution with of 2π (dashed lines) and summing

the result in the interval between [−π, π) (solid blue line) results into

accurate modeling of a wrapped Gaussian distribution [14]. . . . . . . . 11

3.1 Example of GMM fitting result without a Dirichlet prior. The white

columns represent the observed DoAs, the red stems are the correct

position of each source and the black lines plot the GMM fitting results. 18

3.2 The Dirichlet distribution over three variables α1, α2, α3 is confined to

a probability simlex (the triangle area) [5]. . . . . . . . . . . . . . . . . 19

3.3 Shapes of the Dirichlet distribution over the simplex with different values

of φ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 GMM fitting example using the Dirichlet prior φ = 0.01. . . . . . . . . 20

4.1 Figure of a Gaussian distribution, the blue area represents the tolerance

region [µ − zpσ, µ − zpσ], the cumulative probability pmass is calculated

by the integration of the distribution PDF within the tolerance region[1]. 29

4.2 Figure of three Gaussian components with their means µ1, µ2 and µ3,

TR2 represents the tolerance region of the Gaussian component in the

middle. According to the criterion C3, this middle component will be

removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Figure of two Gaussian components, where TR1 and TR2 represent the

tolerance regions of the Gaussian components. The data in the blue

areas ate included in the tolerance regions, while the data in orange

areas ate outside the tolerance regions. . . . . . . . . . . . . . . . . . . 32

LIST OF FIGURES 69

5.1 Plot of ARS, the envelope function is built using the tangent lines of a

log concave function, the tangent lines are computed at a set of points

a. [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 CRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 The fitting result of the IGMM for 4 sources. The observed DoAs are

plotted by white bars, the red stems show the accurate position of each

source and the fitting result is represented by the blue lines. . . . . . . 44

5.4 The fitting result of IGMM for 4 sources with the extra split and merge

step. The observed DoAs are plotted by white bars, the red stems show

the true position of each source, the fitting result is represented by the

blue lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.1 Histogram of artificial data for one artificial speech source. The columns

represent the normalized probability of the artificial data and the red

stem shows the true source location at µs = 7◦. . . . . . . . . . . . . . 47

6.2 Histogram of artificial data of one source and the data fitting result of

the Split and Merge method. The columns represent the PDF of the

artificial data and the red stem shows the true source location µs = 7◦.

The fitting result is plotted by the blue Gaussian component whose mean

is µk = 6.56◦. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3 Source counting accuracy for artificial data in % using different conver-

gence conditions over different numbers of speech sourcesNs = [1, 2, 3, 4, 5]. 50

6.4 Accuracy of source number estimation in relation to the angular differ-

ence between the neighboring artificial sources. . . . . . . . . . . . . . 53

6.5 Source counting accuracy of clustering methods in relation to SNR values

of artificial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.6 Source counting accuracy of clustering methods in relation to the vari-

ance of artificial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.7 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

70

6.8 Artificial data and observed DoAs . . . . . . . . . . . . . . . . . . . . . 59

6.9 Source counting accuracy in relation to the reverberation time . . . . . 62

6.10 Source counting accuracy in relation to the sensor distance to the circle

center in m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

LIST OF TABLES 71

List of Tables

3.1 Summery of the EM Algorithm[3]. . . . . . . . . . . . . . . . . . . . . . 22

4.1 Pseudo code of the Split and Merge method[15]. . . . . . . . . . . . . . 28

5.1 Summary of steps in Gibbs sampling [5]. . . . . . . . . . . . . . . . . . 38

5.2 Summary of steps in non-adaptive rejection sampling.[6] . . . . . . . . 39

5.3 Summary of steps in ARS.[13] . . . . . . . . . . . . . . . . . . . . . . . 41

6.1 Experimental variables for the evaluation of the influence of the conver-

gence condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 Parameters of the GMM clustering method with Dirichlet prior . . . . 51

6.3 Parameters for Split and Merge . . . . . . . . . . . . . . . . . . . . . . 52

6.4 Parameters for IGMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.5 Experimental variables for the evaluation of the influence of the minimal

angular distance between neighboring sources. . . . . . . . . . . . . . . 53

6.6 Experimental variables for the evaluation of the influence of the SNR

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.7 Experiment variables for the evaluation of the influence of the varying

data variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.8 Experiment variables for the evaluation of the influence of the different

sensor numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.9 Accuracy of source counting on observed DoAs mixture with varying

sensor numbers Nm from 3 to 6 and source numbers Ns from 1 to 5. . 60

72

6.10 Experiment variables for the evaluation of the influence of the varying

reverberation time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.11 Experiment variation settings for the evaluation of the influence of the

varying sensor distance to the circle center . . . . . . . . . . . . . . . . 61

6.12 Experiment variation settings for the evaluation of the influence of the

varying source distance to the circle center. . . . . . . . . . . . . . . . . 64

6.13 Accuracy of source counting on observed DoAs mixture with varying

sensor distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

BIBLIOGRAPHY 73

Bibliography

[1] Z-table. http://www.statisticshowto.com/tables/z-table/. Accessed: 20.10.2017.

[2] S. Araki, Sawada H., R. Mukai, and S. Makino. DoA estimation for multiple sparse

sources with normalized observation vector clustering. In 2006 IEEE International

Conference on Acoustics Speech and Signal Processing Proceedings, volume 5, May

2006.

[3] S. Araki, T. Nakatani, H. Sawada, and S. Makino. Blind sparse source separation

for unknown number of sources using Gaussian mixture model fitting with Dirich-

let prior. In 2009 IEEE International Conference on Acoustics, Speech and Signal

Processing, pages 33–36, April 2009.

[4] S. Araki, H. Sawada, R. Mukai, and S. Makino. Underdetermined blind sparse

source separation for arbitrarily arranged multiple sensors. Signal Processing,

87(8):1833 – 1847, 2007. Independent Component Analysis and Blind Source

Separation.

[5] C. M. Bishop. Pattern recognition and machine learning. Springer, New York,

2006.

[6] W. R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Journal

of the Royal Statistical Society. Series C (Applied Statistics), 41(2):337–348, 1992.

[7] E. A. P. Habets. RIR generator from audiolabs-erlangen. https://www.audiolabs-

erlangen.de/fau/professor/habets/software/rir-generator. Accessed: 20.05.2017.

74

[8] M. I. Mandel, D. P. Ellis, and J. Tony. An EM algorithm for localizing multiple

sound sources in reverberant environments. In B. Scholkopf, J. Platt, and T. Hoff-

man, editors, Advances in Neural Information Processing Systems, pages 953–960,

Cambridge, MA, 2007. MIT Press.

[9] M. Muller. Fundamentals of Music Processing - Audio, Analysis, Algorithms,

Applications. Springer, Berlin, Heidelberg, 2015.

[10] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models.

Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.

[11] M. Pal, R Roy, J Basu, and M. S. Bepari. Blind source separation: A review

and analysis. In 2013 International Conference Oriental COCOSDA held jointly

with 2013 Conference on Asian Spoken Language Research and Evaluation (O-

COCOSDA/CASLRE), pages 1–5, Nov 2013.

[12] C. E. Rasmussen. The infinite Gaussian mixture model. In Advances in Neural

Information Processing Systems, pages 554–560. MIT Press, 2000.

[13] D. Sheldon. Discrete adaptive rejection sampling. School of Computer Science,

University of Mssachusetts, 2013.

[14] P. Smaragdis and P. Boufounos. Learning source trajectories using wrapped-phase

hidden Markov models. In IEEE Workshop on Applications of Signal Processing

to Audio and Acoustics (WASPAA), pages 114–117, oct 2005.

[15] M. Taseska, A. H. Khan, and E. A. P. Habets. Speech enhancement with a low-

complexity online source number estimator using distributed arrays. In 2014 22nd

European Signal Processing Conference (EUSIPCO), pages 929–933, Sept 2014.

[16] Y. W. Teh. Dirichlet processes. In Encyclopedia of Machine Learning. Springer,

2010.

BIBLIOGRAPHY 75

[17] O. Walter, L. Drude, and R. Haeb-Umbach. Source counting in speech mixtures

by nonparametric Bayesian estimation of an infinite Gaussian mixture model. In

2015 IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP), pages 459–463, April 2015.

Documents

Master Thesis - lms.tf.fau.de · DoA Direction of Arrival DTFT Discrete Time Fourier Transform DP Dirichlet Prior EM Expectation Maximization MAP Maximum A Posteriori ML Maximum Likelihood