AdiAudio - vi l t ti h itiisual automatic speech ... · AdiAudio - vi l t ti h itiisual automatic speech recognition (AV-ASR) Rainer Stiefelhagen Vorlesung „Visuelle Perzeption

A di i l t ti h itiAudio - visual automatic speech recognition (AV-ASR)

Rainer Stiefelhagen

Vorlesung „Visuelle Perzeption für Mensch-Maschine Schnittstellen“ WS 2009/2010Maschine Schnittstellen , WS 2009/2010

February 8 2010

Interactive Systems Laboratories, Universität Karlsruhe (TH)

February 8, 20101

Overviewer

actio

nH

)

I t d ti

ompu

ter I

nte

Kar

lsru

he (T

H IntroductionMotivation, McGurk effect

Vis al feat re e traction

or H

uman

-CU

nive

rsitä

tK Visual feature extractionAppearance based featuresModel-based features

uter

Vis

ion

forc

h G

roup

, U Model based features

AV-Speech recognitionBasic building blocks of ASR systems

Com

puR

esea

rci

Basic building blocks of ASR systemsVisemes vs. phonemesAV-Fusion approaches

cv:h

c

Recent work at ISLAV-ASR from multiple views

2

McGurk Experimenter

actio

nH

)McGurk Experiment

ompu

ter I

nte

Kar

lsru

he (T

Hor

Hum

an-C

Uni

vers

itätK

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

3

McGurk Experimenter

actio

nH

)p

ompu

ter I

nte

Kar

lsru

he (T

Hor

Hum

an-C

Uni

vers

itätK

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

4

McGurk Experimenter

actio

nH

)p

ompu

ter I

nte

Kar

lsru

he (T

Hor

Hum

an-C

Uni

vers

itätK

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

5

McGurk Effecter

actio

nH

) People Fuse Visual and Acoustic Info

ompu

ter I

nte

Kar

lsru

he (T

H People Fuse Visual and Acoustic InfoVisual Info Complements AcousticEff t k i l t ll L

or H

uman

-CU

nive

rsitä

tK Effect works in almost all LanguagesWeaker in Some (Japanese, Chinese)

uter

Vis

ion

forc

h G

roup

, U Appears to Work particularly well for Visible Phones

Com

puR

esea

rci

Bateson ExperimentsIn Conversation, Random Eye-Gaze is Reduced under

cv:h

c , yNoiseVisual Info Becomes more Important in Noise

6

What is automatic audio-visual speech recognition (ASR)?

erac

tion

H)

(ASR)?om

pute

r Int

eK

arls

ruhe

(TH

Conventional ASR systems use only audio (speech) data as input

or H

uman

-CU

nive

rsitä

tK data as input.

A di i l AS di d i l

uter

Vis

ion

forc

h G

roup

, U Audio-visual ASR systems use audio and visual (video) data.

Com

puR

esea

rci

Images around lip areas are mainly used as visual data.

Audio-visual speech recognition is also called bi-

cv:h

c

modal speech recognition.

7

What is the motivation?er

actio

nH

)What is the motivation?

Humans use both audio and visual information to

ompu

ter I

nte

Kar

lsru

he (T

H Humans use both audio and visual information to smoothly communicate with each other.People can compensate insufficient speech

or H

uman

-CU

nive

rsitä

tK

People can compensate insufficient speech information with visual one.Visual cues are often complementary to audio cues

uter

Vis

ion

forc

h G

roup

, U

p y“ma” vs. “na” (easier from vision)“pa” vs. “ba” (easier from audio)

Com

puR

esea

rci

cv:h

c Can we improve performances of ASR systems by using both audio and visual information?

8

Are visual cues useful for human perception?(Potamianos et al., Proc. Euro Speech, Sep. 2001)

erac

tion

H)

( p p )om

pute

r Int

eK

arls

ruhe

(TH

or H

uman

-CU

nive

rsitä

tKut

er V

isio

n fo

rch

Gro

up, U

Com

puR

esea

rci

cv:h

c

9Human improves the performace by using visual information!

Basic processing blockser

actio

nH

) Audio Data Visual Data

ompu

ter I

nte

Kar

lsru

he (T

H Audio Data

or H

uman

-CU

nive

rsitä

tK Face Detection

Lip Detection

Audio Feature Extraction

uter

Vis

ion

forc

h G

roup

, U Lip Detection

Com

puR

esea

rci

Visual Feature Extraction

Audio vectorVisual vector

cv:h

c

Audio-Visual ASR

Visual vector

10

Audio Visual ASR

Mouth Localization Approacheser

actio

nH

)pp

Early Work: Manual/Semi-automatic approaches

ompu

ter I

nte

Kar

lsru

he (T

H y ppUse fixed window / no head movementUse lip-stick with easy to extract colors

or H

uman

-CU

nive

rsitä

tK

Automatic Approaches

uter

Vis

ion

forc

h G

roup

, U

Automatic ApproachesSimple Templates (very problematic)Integral Images ( see lecture 6 on head pose)

Com

puR

esea

rci

g g ( p )Haar-Filter Cascades ( lecture 3)Deformable Models: Snakes, Active Contours, Active

cv:h

c Shape Models, Active Appearance Models

11

Visual feature extractionA (i ) b d f t

erac

tion

H)

Appearance (image) based featuresPixel values of region-of-interest (ROI) like a lip image are directly used.

ompu

ter I

nte

Kar

lsru

he (T

H Easier, more robust extractionHigh dimensionality (-> PCA, LDA, FFT, DCT, Differences between adjacent frame i )

or H

uman

-CU

nive

rsitä

tK images)

Model-based features

uter

Vis

ion

forc

h G

roup

, U Assumes that most information is in the shape of the lipsModel parameters used for recognition

Com

puR

esea

rci

Lower dimensionalityMore difficult to obtain Example: Active Shape Model (ASM)

cv:h

c p p ( )

Hybrid ApproachesActive Appearance Model

12

Active Appearance Model

Appearance-based featureser

actio

nH

)pp

ompu

ter I

nte

Kar

lsru

he (T

H

Pixel values of region-of-interest (ROI) like a lip i d l d

or H

uman

-CU

nive

rsitä

tK image are directly usedROI / feature vector

uter

Vis

ion

forc

h G

roup

, U Advantage: Easier, more robust extraction

Com

puR

esea

rci

Disadvantages:Ill i ti i ti

cv:h

c Illumination variations histogram normlization, etc.

High dimensionality of feature g yvector

PCA, LDA 13Histogram Normalization

Use Normalized Greyscale Image of Mouther

actio

nH

)Use Normalized Greyscale Image of Mouth

grayvalue modification - example histogram :li i l)(

))(()´(f

pfTpf =

ompu

ter I

nte

Kar

lsru

he (T

H

grayvaluenew:)´(functionon modificati:

grayvalueoriginal:)(

pfT

pf

or H

uman

-CU

nive

rsitä

tK

grayvaluenew:)(pf

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

14

FFTer

actio

nH

) Transform the image of the mouth region using

ompu

ter I

nte

Kar

lsru

he (T

H Transform the image of the mouth region using FFT

Transformation to the frequency domain

or H

uman

-CU

nive

rsitä

tK

Transformation to the frequency domainInvariant to translationFrequency-based features are known to be helpful for

uter

Vis

ion

forc

h G

roup

, U Frequency based features are known to be helpful for ASR

Lower-frequency components contain most relevant

Com

puR

esea

rci

information for visual speech recognition Too many high-frequency components in the feature vector are not useful (contain information about wrinkles etc )

cv:h

c not useful (contain information about wrinkles etc.)

15

FFT based featureer

actio

nH

)based eatu eom

pute

r Int

eK

arls

ruhe

(TH

or H

uman

-CU

nive

rsitä

tK

Normalization of an illumination condition

uter

Vis

ion

forc

h G

roup

, U

FFT

Com

puR

esea

rci

FFT

cv:h

c

(Smoothing)

16[□□□□□] feature vector

Discrete Cosine Transform (DCT)er

actio

nH

)( )

ompu

ter I

nte

Kar

lsru

he (T

H

• Transform the mouth image by DCT• Easy & Fast Implementation

or H

uman

-CU

nive

rsitä

tK • Compact respresentation

h b f C ffi i i

uter

Vis

ion

forc

h G

roup

, U • The number of DCT coefficients is too high – only coefficients with high energy are

Com

puR

esea

rci

y g gyused as elements of the feature vector

– the extracted coefficients are usually in the low frequency

cv:h

c q y

⎤⎡1 1M N

∑∑−

=

−

=⎥⎦⎤

⎢⎣⎡ ×

+×

+××=

1

0

1

0,, )

212cos()

212cos(

M

m

N

nnmjivu I

Nnv

MmuCCD ππ

17

Model-based approacheser

actio

nH

)pp

Deformable Templates

ompu

ter I

nte

Kar

lsru

he (T

H p

Uses a-priori knowledge

or H

uman

-CU

nive

rsitä

tK about the shape and appearance of the object

H d t d t i d l

uter

Vis

ion

forc

h G

roup

, U Hand-tuned parametric model and energy functionFitting by minimizing energy-f i

Com

puR

esea

rci

function

Model-parameters can be used for audio visual

cv:h

c used for audio-visual speech recognition

18

Model-based approaches (2)er

actio

nH

)pp ( )

Active Shape Models

ompu

ter I

nte

Kar

lsru

he (T

H Active Shape ModelsStatistical modelTrained on sample data

or H

uman

-CU

nive

rsitä

tK

Trained on sample dataFitting mainly based on shape

uter

Vis

ion

forc

h G

roup

, U

p

Com

puR

esea

rci

cv:h

c

• Shape and intensity parameters can be used for i l h iti

19

visual speech recognition

Hybrid Approaches er

actio

nH

)y pp

Active Appearance Model (AAM)

ompu

ter I

nte

Kar

lsru

he (T

H pp ( )Statistical modelAAM trains the correlation betweenh d

or H

uman

-CU

nive

rsitä

tK shape and appearanceOptimize parameters, so as to minimize the difference of a

uter

Vis

ion

forc

h G

roup

, U synthesized image and the target image

Fitting based on whole appearance of

Com

puR

esea

rci

Fitting based on whole appearance of the face

Model parameters used for visual

cv:h

c speech recognitionParameter models shape and texture

20

Summary of visual feature extractioner

actio

nH

)y

I i f ll d b h b d

ompu

ter I

nte

Kar

lsru

he (T

H In experiments for small databases, shape based methods outperform appearance based ones.

or H

uman

-CU

nive

rsitä

tK Relies on good lip-tracking

uter

Vis

ion

forc

h G

roup

, U

In experiments for large databases, appearance based methods seem to be superior to them

Com

puR

esea

rci

based methods seem to be superior to them.More robust than shape-based features

cv:h

c

21

erac

tion

H)

ompu

ter I

nte

Kar

lsru

he (T

Hor

Hum

an-C

Uni

vers

itätK

Joint audio-visual speech recognition

uter

Vis

ion

forc

h G

roup

, U Joint audio-visual speech recognition

Com

puR

esea

rci

cv:h

c

22

Basic Processing Blockser

actio

nH

) Audio Data Visual Data

ompu

ter I

nte

Kar

lsru

he (T

H Audio Data

or H

uman

-CU

nive

rsitä

tK Face Detection

Lip Tracking

Audio Feature Extraction

uter

Vis

ion

forc

h G

roup

, U Lip Tracking

Com

puR

esea

rci

Visual Feature Extraction

cv:h

c

Audio-Visual ASR

23

Audio Visual ASR

The fundamentals of ASRer

actio

nH

)

1. Make HMMs of all phonemes from feature vectors (train)

ompu

ter I

nte

Kar

lsru

he (T

H / a /

FeatureHMM / a /

or H

uman

-CU

nive

rsitä

tK Feature extraction Training

uter

Vis

ion

forc

h G

roup

, U

Each states has an output probability of feature vectors

Com

puR

esea

rci

2. Recognize input speech with the trained HMMs (test)

cv:h

c

Input speech Trained HMMs

R ltF t24

Recognizing Result(text)

Feature extraction

Speech Recognition (S t C t )

erac

tion

H)

(System Components)Recognizer Components:

ompu

ter I

nte

Kar

lsru

he (T

H Recognizer Components:

or H

uman

-CU

nive

rsitä

tK

Front

RecognitionO1O2 OT

W1W2 W T

d

uter

Vis

ion

forc

h G

roup

, U

FrontEnd

Analog ObservationBest WordSequence

Decoder

Com

puR

esea

rci

AnalogSpeech

ObservationSequence

Sequence

cv:h

c

AcousticModel Dictionary Language

Model

25

erac

tion

H)

Continuous Speech Recognitionom

pute

r Int

eK

arls

ruhe

(TH

Goal:Given observed features O = o1, o2, ..., okFind word sequence W = w1 w2 wn

or H

uman

-CU

nive

rsitä

tK Find word sequence W = w1, w2, ... wnSuch that P(W | O) is maximized

Bayes Rule:

uter

Vis

ion

forc

h G

roup

, U

P(W | O) =P(O | W) • P(W)

acoustic model (HMMs) language modely

Com

puR

esea

rci

P(W | O) = P(O)

P(O) is a constant for a complete sentence

cv:h

c ( ) p

In the case of audio-visual speech recognition:

26

- maximise P(W|Oa, Ov)

Phoneme and visemeer

actio

nH

)

A h i th b i li i ti it d

ompu

ter I

nte

Kar

lsru

he (T

H A phoneme is the basic linguistic unit and acoustically distinguishable.

The English language can be classified into about 35

or H

uman

-CU

nive

rsitä

tK The English language can be classified into about 35-70 phonemes. ASR usually uses about 40 to 50 ones.

A viseme is visually distinguishable speech unit

uter

Vis

ion

forc

h G

roup

, U A viseme is visually distinguishable speech unit.Several phonemes can correspond to the same viseme.Number of visemes is much smaller than phonemes.

Com

puR

esea

rci

Number of visemes is much smaller than phonemes. Typically around 15No universal agreement about exact mapping between

cv:h

c phonemes and visemesIt highly depends on speakers and speaking style.

.27

The example of visemes in ASRer

actio

nH

)The example of visemes in ASR

Neti et al., Final Workshop 2000 at The Johns Hopkins Univ.

ompu

ter I

nte

Kar

lsru

he (T

Hor

Hum

an-C

Uni

vers

itätK

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

28

The phonems on each line belong to the same viseme.

erac

tion

H)

Audio Visual Speech Modeling for ASR

ompu

ter I

nte

Kar

lsru

he (T

H

How should we model audio and visual features f ASR?

or H

uman

-CU

nive

rsitä

tK for ASR?

uter

Vis

ion

forc

h G

roup

, U

What is the relation between audio and visual

Com

puR

esea

rci

What is the relation between audio and visual features like?

cv:h

c

29

Characteristics between audio and visual featureser

actio

nH

)Characteristics between audio and visual features• Audio and Visual phonetic events happen

synchronously with time lag

ompu

ter I

nte

Kar

lsru

he (T

H synchronously with time lag

Example:speech “aida”

or H

uman

-CU

nive

rsitä

tK

speec a da

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

Time lag

After lip is opened, a voice is uttered.

30

After lip is opened, a voice is uttered.After finishing to utter, the lip is closed

Techniques integrating audio and visual information

erac

tion

H)

information

• Feature fusion

ompu

ter I

nte

Kar

lsru

he (T

H

- combines audio and visual information at a feature vector level

or H

uman

-CU

nive

rsitä

tK feature vector level.

- One classifier is used.

uter

Vis

ion

forc

h G

roup

, U

• Decision fusion

Com

puR

esea

rci

- integrates audio and visual information at a classifier level

cv:h

c classifier level.

- two classfiers, audio and visual classifiers, are

31

used.

Feature fusioner

actio

nH

)om

pute

r Int

eK

arls

ruhe

(TH

Feature fusion uses a single classifier to model the d f i h di d

or H

uman

-CU

nive

rsitä

tK concatenated vector of time-synchronous audio and visual features.

uter

Vis

ion

forc

h G

roup

, U

1. A simple concatenation

Com

puR

esea

rci

p2. Hierarchical LDA feature fusion

cv:h

c

32

Hierarchical LDA feature fusioner

actio

nH

) Audio feature vector Visual feature vector

Potamianos et al., ICASSP, 2001

ompu

ter I

nte

Kar

lsru

he (T

H Audio feature vector Visual feature vector

LDAConcatenation of adjacent

or H

uman

-CU

nive

rsitä

tK

jframe vectors

Concatenation of adjacent frame vectors

uter

Vis

ion

forc

h G

roup

, U LDAframe vectors

LDA

Com

puR

esea

rci Concatenation of audio & visual vectors

cv:h

c

LDA

Concatenation of audio & visual vectors

33

LDA

Audio visual feature vector

Overview of IBM‘s systemer

actio

nH

)Overview of IBM s system

Potamianos et al 2004

ompu

ter I

nte

Kar

lsru

he (T

Hor

Hum

an-C

Uni

vers

itätK

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

34

Decision fusioner

actio

nH

)

Cl ifi i t ti t hidd t t l l

ompu

ter I

nte

Kar

lsru

he (T

H Classifier integration at a hidden state levelSynchronous Multi-Stream HMMs

or H

uman

-CU

nive

rsitä

tK Intermediate integrationClassifier integration at a phone or word level

uter

Vis

ion

forc

h G

roup

, U Classifier integration at a phone or word levelAsynchronous Product HMM

Com

puR

esea

rci

Intermediate integrationClassifier integration at an utterance level

cv:h

c C ass e teg at o at a utte a ce eveLate integration

35

A scheme of classifier integration at a hidden state level

erac

tion

H)

state levelom

pute

r Int

eK

arls

ruhe

(TH

or H

uman

-CU

nive

rsitä

tKut

er V

isio

n fo

rch

Gro

up, U

Recognition(A di i l)

Com

puR

esea

rci

(Audio visual)

cv:h

c

36

Synchronous multi-stream HMMser

actio

nH

)om

pute

r Int

eK

arls

ruhe

(TH

Audio HMM 1 2 3

or H

uman

-CU

nive

rsitä

tK

1 2 3Visual HMM

uter

Vis

ion

forc

h G

roup

, U

))((, tP vjv O

aλ vλ×output probability of an audio-visual feature at a state j =

))(( tP O

Com

puR

esea

rci

))((, tP aja O

a v

: output probability of an audio feature

))((, tP aja O

cv:h

c

))((, tP vjv O : output probability of an visual featureλ λ :Stream weights which represent reliabilities of audio

37

aλ vλ :Stream weights which represent reliabilities of audio and visual information.

A scheme of classifier integration at a phone or word level (Intermediate integration)

erac

tion

H)

word level (Intermediate integration)om

pute

r Int

eK

arls

ruhe

(TH

or H

uman

-CU

nive

rsitä

tKut

er V

isio

n fo

rch

Gro

up, U

Com

puR

esea

rci

cv:h

c

38

Asynchronous Product HMMer

actio

nH

)y

ompu

ter I

nte

Kar

lsru

he (T

Hor

Hum

an-C

Uni

vers

itätK

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

Output Probability at State ij :

39

va vvj

aaiij ObObOb λλ )()()( )()()()( ×=

Re-training Product HMMer

actio

nH

) ( asynchronous event )

ompu

ter I

nte

Kar

lsru

he (T

Hor

Hum

an-C

Uni

vers

itätK

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

40

A typical scheme of classifier integration at l l (L i i )

erac

tion

H)

utterance level (Late integration)om

pute

r Int

eK

arls

ruhe

(TH

or H

uman

-CU

nive

rsitä

tKut

er V

isio

n fo

rch

Gro

up, U

Com

puR

esea

rci

cv:h

c

41

Late integration (LI) er

actio

nH

)g ( )

Integration at an utterance level

ompu

ter I

nte

Kar

lsru

he (T

H

vvisual

aaudioresult LLL λλ

)()()( ×=Integration at an utterance level

or H

uman

-CU

nive

rsitä

tK

∏=

=

=tendt

tstarttatjaaudio tPL ))(()(,)( O

uter

Vis

ion

forc

h G

roup

, U =tstartt

Output probability of an audio utterance

Com

puR

esea

rci

∏=

=tendt

tstarttvtjvvisual tPL ))(()(,)( O

cv:h

c =tstartt

Output probability of an visual utterance

42

Summaryer

actio

nH

) Synchronous Multi-Stream HMMs

ompu

ter I

nte

Kar

lsru

he (T

H

- decides phoneme’s durations based on audio labels cannot sufficiently represent visual features.

or H

uman

-CU

nive

rsitä

tK

y pLate integration - processes independently audio and visual data

uter

Vis

ion

forc

h G

roup

, U - processes independently audio and visual dataignore the synchronization between audio and

visual features

Com

puR

esea

rci

visual features- runs two process when recognizing speech

i t ti ( S i bl i l

cv:h

c increase computation ( Serious problem in large vocabulary speech recognition )

43

Discussionser

actio

nH

)Discussions

Advantages of Intermediate integration

ompu

ter I

nte

Kar

lsru

he (T

H g g• Asynchronous (AS) vs. Synchronous multi-stream

HMMs

or H

uman

-CU

nive

rsitä

tK - AS HMMs allows audio and visual events to occur asynchronously.

can represent the relationship between audio and

uter

Vis

ion

forc

h G

roup

, U can represent the relationship between audio and visual feature.

Com

puR

esea

rci

• Intermediate integration (II) vs. Late integration- One path (Viterbi) algorithm is available.

cv:h

c II doesn’t need to run two processes

A disadvantage of Intermediate integration

44

A disadvantage of Intermediate integration • It uses lot of memory.

Results of a word recognition experiment (Audio SNR 5dB)

erac

tion

H)

SNR -5dB)om

pute

r Int

eK

arls

ruhe

(TH

or H

uman

-CU

nive

rsitä

tKut

er V

isio

n fo

rch

Gro

up, U

Sychronous multi-stream HMMs

Com

puR

esea

rci

cv:h

c

45

How can we decide which information is reliable?er

actio

nH

)

-Estimating which information more reliable is→improves the recognition performance

ompu

ter I

nte

Kar

lsru

he (T

H →improves the recognition performance

Big acoustic noises → Visual information is more reliable

or H

uman

-CU

nive

rsitä

tK Big acoustic noises → Visual information is more reliableBig image noises → Audio information is more reliable

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

46

Estimating Stream Weights er

actio

nH

)g g

O tp t Probabilit at a State ij

ompu

ter I

nte

Kar

lsru

he (T

H

va vvj

aaiij ObObOb λλ )()()( )()()()( ×=

Output Probability at a State ij :

or H

uman

-CU

nive

rsitä

tK jiij ObObOb )()()(

Purpose:

uter

Vis

ion

forc

h G

roup

, U

Audio stream weight aλ

pAutomatically optimize

AND

Com

puR

esea

rci

g a

vλVisual stream weight AND

cv:h

c

What measure is appropriate measure in order

47

to estimate them?

What is the confidence measure to estimate er

actio

nH

)

stream exponents?om

pute

r Int

eK

arls

ruhe

(TH

Based on minimum classification error criterionadjust weights during training phasekeeps weights fixed during testing!

or H

uman

-CU

nive

rsitä

tK

Use Stream entropystrong peak in log-likelihood of HMMs (entropy close to zero) indicates strong

fid

uter

Vis

ion

forc

h G

roup

, U confidence

N-best output score dispersionIf h i i did l h fi h

Com

puR

esea

rci

If the competitive candidates are closer to the first one, that modality is considered as unreliable one.

B d di i l t i ti (SNR)

cv:h

c Based on audio signal-to-noise ratio (SNR)The worse the audio SNR gets, the higher the weight of the video stream (does not consider any video noise!)

Train something (e.g. ANNs) to learn best weights48

Comparision of the confidence measureser

actio

nH

)

S W d

Potamianos et al. ICSLP2000

ompu

ter I

nte

Kar

lsru

he (T

H System Word accuracy

Audio only 50 38%

or H

uman

-CU

nive

rsitä

tK Audio-onlyVisual-onlyStream entropy

50.38%28.34%54 44%

uter

Vis

ion

forc

h G

roup

, U Stream entropyN-best output score dispersionA erage of N best o tp t scores

54.44%55.19%55 05%

Com

puR

esea

rci

Average of N-best output scoresMinimum classification error

55.05%59.88%

cv:h

c

Experimental conditions) Context independent GMM with 5 mixtures

49

Context independent GMM with 5 mixtures

Word Error Rate for Audio SNRer

actio

nH

)

Potamianos et al., MIT press

ompu

ter I

nte

Kar

lsru

he (T

Hor

Hum

an-C

Uni

vers

itätK

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

50

Summaryer

actio

nH

)y

H i li i l d i l b h d li i

ompu

ter I

nte

Kar

lsru

he (T

H Humans implicitely and unconsciously use both modalities, speech and visual appearance

U i b h d li i i i h i i

or H

uman

-CU

nive

rsitä

tK Using both modalities improves automatic speech recognitionboth for humans and for automatic computer systemsin particular under noisy audio conditions

uter

Vis

ion

forc

h G

roup

, U

Video featuresappearance-based: transformed image of the lip-region is used for recognition

Com

puR

esea

rci

recognitionnormalized greyscale image, FFT, DCT, (plus LDA, PCA)

model-based: lip-model is extracted, recognition is based on (transformed) model parameters

cv:h

c

active shape models, active contours, snakesHybrid approach: active appearance models

51

Summary (2)er

actio

nH

)y ( )

Phonemes and Visemes

ompu

ter I

nte

Kar

lsru

he (T

H Phonemes and VisemesVisemes are classes of visually distinguishable sounds

or H

uman

-CU

nive

rsitä

tK

Classification typically with HMMs

uter

Vis

ion

forc

h G

roup

, U

Fusion on various levels possibleEarly feature integration

Com

puR

esea

rci

y gLate integration (word or phoneme/viseme-level) Intermediate integration seems to work best

cv:h

c at sub-phone/viseme level (HMM-states)synchronous, asynchrounous Multi-stream HMMs

52

Referenceser

actio

nH

) Gerasimos Potamianos, Chalapathy Neti, Juergen Luettin, Iain

ompu

ter I

nte

Kar

lsru

he (T

H

Matthews, Audio-Visual Automatic Speech Recognition: An Overview, Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier, Eds., MIT Press, 2004

or H

uman

-CU

nive

rsitä

tK

y, , , , ,

uter

Vis

ion

forc

h G

roup

, UC

ompu

Res

ear

cicv

:hc

53

Documents

AdiAudio - vi l t ti h itiisual automatic speech ... · AdiAudio - vi l t ti h itiisual automatic speech recognition (AV-ASR) Rainer Stiefelhagen Vorlesung „Visuelle Perzeption