Maschinelles Lernen: Methoden, Algorithmen, Potentiale und ...Neural Information Processing Group and Bernstein Center for Computational Neuroscience, Eberhard Karls Universität Tübingen

Neural Information Processing Group and

Bernstein Center for Computational Neuroscience,

Eberhard Karls Universität Tübingen

Max Planck Institute for Intelligent Systems, Tübingen

Felix Wichmann

Maschinelles Lernen:

Methoden, Algorithmen, Potentiale und

gesellschaftliche Herausforderungen

http://www.appblogger.de/wp-content/uploads/2013/03/pb-130314-pope-2005.photoblog900.jpg

http://msnbcmedia.msn.com/j/MSNBC/Components/Photo/_new/pb-130314-pope-2013.photoblog900.jpg

❶

One way to think about vision: inverse optics

Laws of physics “generate” 2D images on

our retinae from 3D scenes

(forward optics / rendering)

light source

(e.g. sun light)

object reflectance

amount of light

entering the eye

is a product of

light source intensity

and object reflectance





Starting point to think about visual

perception: we want to infer the 3D scene

from the 2D retinal images:

inverse optics!

light source

(e.g. sun light)

object reflectance

amount of light

entering the eye

is a product of







Starting point to think about visual

perception: we want to infer the 3D scene

from the 2D retinal images:

inverse optics!

But: Inverse optics is mathematically

impossible.

light source

(e.g. sun light)

object reflectance

amount of light

entering the eye

is a product of



N = 0

N = 1

N = 2

N = 5

N = 9

N = 24 (considered fully rendered)

modified from Matthias Bethge



illumination

(„light field“)

objects & surfaces

(geometry, materials)

resultingimage

(non-linear) „information entanglement“

visual inference („untangling“)

illumination

(„light field“)

objects & surfaces

(geometry, materials)

resultingimage

(non-linear) „information entanglement“


❷

Machine learning (ML) and statistics


Statistics is the science of learning from data. … [ML] is the science of

learning from data. These fields are identical in intent although they differ in

their history, conventions, emphasis and culture. (Wasserman, 2014)





ML is a comparatively new sub-branch of computational statistics jointly

developed in computer science and statistics.







ML is inference performed by computers based on past observations and

learning algorithms: ML algorithms are mainly concerned with discovering

hidden structure in data in order to predict novel data—exploratory

methods, to get things done!







ML is inference performed by computers based on past observations and

learning algorithms: ML algorithms are mainly concerned with discovering

hidden structure in data in order to predict novel data—exploratory

methods, to get things done!

“Classical” statistics typically is concerned with making precise probabilistic

statements about known data coming from known distributions, i.e. interest

in accurate models of data!

What is the difference between statistics and machine learning?

Machine Learning is AI people doing data analysis.

Data Mining is database people doing data analysis.

Applied Statistics is statisticians doing data analysis

Infographics is Graphic Designers doing data analysis.

Data Journalism is Journalists doing data analysis.

Econometrics is Economists doing data analysis

(and here you can win a Nobel Prize).

Psychometrics is Psychologists doing data analysis.

Chemometrics and Cheminformatics are Chemists doing data analysis.

Bioinformatics is Biologists doing data analysis.

30Aleks Jakulin, https://www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

https://www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

What is the difference between statistics and machine learning? (cont’d)

… if you look at what the goals both fields are trying to achieve, you see that

there is actually quite a big difference:

Statistics is interested in learning something about data, for example, which

have been measured as part of some biological experiment. … . But the

overall goal is to arrive at new scientific insight based on the data.

In Machine Learning, the goal is to solve some complex computational task by

“letting the machine learn”. Instead of trying to understand the problem well

enough to be able to write a program which is able to perform the task (for

example, handwritten character recognition), you instead collect a huge

amount of examples of what the program should do, and then run an

algorithm which is able to perform the task by learning from the examples.

Often, the learning algorithms are statistical in nature. But as long as the

prediction works well, any kind of statistical insight into the data is not

necessary.

31Mikio Braun, https://www.quora.com/What-is-the-difference-between-statistics-and-machine-learning


What is the difference between statistics and machine learning? (cont’d)

The primary differences are perhaps the types of the problems attacked, and

the goal of learning.

At the risk of data and models oversimplification, one could say that in

statistics a prime focus is often in understanding the data and relationships

in terms of models giving approximate summaries such as linear relations or

independencies. In contrast, the goals in algorithms and machine learning are

primarily to make predictions as accurately as possible and predictions to

understand the behaviour of learning algorithms.

These differing objectives have led to different developments in the two

fields: for example, neural network algorithms have been used extensively as

black-box function approximators in machine learning, but to many

statisticians they are less than satisfactory, because of the difficulties in

interpreting such models.

32Franck Dernoncourt, https://www.quora.com/What-is-the-difference-between-statistics-and-machine-learning


Terminology: types of learning


Supervised learning is the ML task of inferring a function from labeled training data. In supervised

learning, each example is a pair consisting of an input object (typically a vector) and a desired

output value (also called the supervisory signal). A supervised learning algorithm analyzes the

training data and produces an inferred function, which can be used for prediction.

https://en.wikipedia.org/wiki/Behaviorism

https://en.wikipedia.org/wiki/Software_agent

https://en.wikipedia.org/wiki/Action_selection

https://en.wikipedia.org/wiki/Supervised_learning


https://en.wikipedia.org/wiki/Reinforcement_learning

https://en.wikipedia.org/wiki/Data

https://en.wikipedia.org/w/index.php?title=Labeled_data&action=edit&redlink=1

https://en.wikipedia.org/wiki/Unsupervised_learning








Reinforcement learning is an area of ML inspired by behaviorist psychology, concerned with how

software agents ought to take actions in an environment so as to maximize some notion of

cumulative reward. Unlike supervised ML correct input/output pairs are never presented, nor sub-

optimal actions explicitly corrected; only global reward for an action.





















Unsupervised learning is the ML task of inferring a function to describe hidden structure from

unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward

signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised

learning and reinforcement learning. A good example is identifying close-knit groups of friends in

social network data; clustering algorithms, like k-means)





















Unsupervised learning is the ML task of inferring a function to describe hidden structure from

unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward

signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised

learning and reinforcement learning. A good example is identifying close-knit groups of friends in

social network data; clustering algorithms, like k-means)

Semi-supervised learning is a class algorithms making use of unlabeled data for training—

typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised

learning falls between unsupervised learning (without any labeled training data) and supervised

learning (with completely labeled training data).












Terminology: types of problems in supervised ML


Classification: Problems where we seek a yes-or-no prediction, such as “Is

this tumour cancerous?”, “Does this cookie meet our quality standards?”, and

so on.




so on.

Regression: Problems where the value being predicted falls somewhere on a

continuous spectrum. These systems help us with questions of “How much?”

or “How many?”




so on.



or “How many?”

Support vector machine (SVM) is a supervised classification algorithm




so on.



or “How many?”

Support vector machine (SVM) is a supervised classification algorithm

Neural networks, including the now so popular convolutional deep neural

networks (DNNs), are supervised algorithms, too, typically however for multi-

class classification

Success of supervised classification in ML


ML—and in particular kernel methods as well as very recently so-called deep

neural networks (DNNs)—have proven successful whenever there is an

abundance of empirical data but a lack of explicit knowledge how the data

were generated:





were generated:

• Predict credit card fraud from patterns of money withdrawals.





were generated:


• Predict toxicity of novel substances (biomedical research).





were generated:



• Predict engine failure in airplanes.





were generated:




• Predict what people will google next.





were generated:




• Predict what people will google next.

• Predict what people want to buy next at amazon.

The Function Learning Problem

x

x

x

x

x

x

y


x

x

x

x

x

x

y


x

x

x

x

x

x

y

Learning Problem in General


Training examples (x1,y1),…,(xm,ym)



Task: given a new x, find the new y

strong emphasis on prediction, that is, generalization!





Idea: (x,y) should look “similar” to the training examples






Required: similarity measure for (x,y)







Much of creativity and difficulty in kernel-based ML: Find suitable similarity

measures for all the practical problems discussed before, e.g. credit card

fraud, toxicity of novel molecules, gene sequences, … .










When are two molecules, with different atoms, structure, configuration etc.

the same? When are two strings of letters or sentences similar? What would

be the mean, or the variance of strings? Of molecules?










When are two molecules, with different atoms, structure, configuration etc.

the same? When are two strings of letters or sentences similar? What would

be the mean, or the variance of strings? Of molecules?

Very recent deep neural network success:

The network learns the right similarity measure from the data!

The Support Vector Machine


Computer algorithm that learns by example to assign labels to objects



Successful in handwritten digit recognition, credit card fraud detection,

classification of gene expression profiles etc.





Essence of the SVM algorithm requires understanding of:

i. the separating hyperplane

ii. the maximum-margin hyperplane

iii. the soft margin

iv. the kernel function





Essence of the SVM algorithm requires understanding of:

i. the separating hyperplane

ii. the maximum-margin hyperplane

iii. the soft margin

iv. the kernel function

For SVMs and machine learning in general:

i. regularisation

ii. cross-validation

1566 U NU D N U HN G

5

ft r

r f t

12

10

8

6

4

2

0

0 2 4 6 8 10 12

MARCKSL1

MARCKSL1 MARCKSL1 MARCKSL1 MARCKSL1

MARCKSL1 MARCKSL1

HOXA9Z

YX

a

ZY

XZ

YX

f

ZY

X

ZY

X

ZYX

ZY

X

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2

006 N

atu

re P

ub

lis

hin

g G

rou

p

htt

p:/

/ww

w.n

atu

re.c

om

/na

ture

bio

tech

no

log

y

Two Genes and Two Forms of Leukemia

(microarrays deliver thousands of genes, but hard to draw ...)


5

ft r

r f t

MARCKSL1


0 2 4 6 8 10 12

MARCKSL1 MARCKSL1

HOXA9

ZY

X

12

10

8

6

4

2

0

ZY

X

b

ZY

X

f

ZY

X

ZY

X

ZYX

ZY

X

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2

006 N

atu

re P

ub

lis

hin

g G

rou

p

htt

p:/

/ww

w.n

atu

re.c

om

/na

ture

bio

tech

no

log

y

Separating Hyperplane


5

ft r

r f t

MARCKSL1


0 2 4 6 8 10 12

MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

ZY

X

ZY

X

c

ZYX

ZY

X

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2

006 N

atu

re P

ub

lis

hin

g G

rou

p

htt

p:/

/ww

w.n

atu

re.c

om

/natu

reb

iote

ch

no

log

y

Separating Hyperplane in 1D — a Point


5

ft r

r f t

MARCKSL1


MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

ZY

X

ZY

X

h

d

121086420

–2

0

0

224 4

6

6

8

8

1

1

12

12

ZYX

ZY

X

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2

006 N

atu

re P

ub

lis

hin

g G

rou

p

htt

p:/

/ww

w.n

atu

re.c

om

/na

ture

bio

tech

no

log

y

... and in 3D: a plane


5

ft r

r f t

MARCKSL1

0 20 40 60 80 100 120


MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

ZY

X

ZY

X

ZYX

12

10

8

6

4

2

0

ZY

X

e

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2006 N

atu

re P

ub

lish

ing

Gro

up

h

ttp

://w

ww

.na

ture

.co

m/n

atu

reb

iote

ch

no

log

y

Many Potential Separating Hyperplanes ...

(all “optimal” w.r.t. some loss function)


5

ft r

r f t

MARCKSL1

MARCKSL1

0 2 4 6 8 10 12

MARCKSL1 MARCKSL1 MARCKSL1

MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

ZY

X

ZY

X

ZYX

12

10

8

6

4

2

0

ZY

X

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E©

20

06 N

atu

re P

ub

lis

hin

g G

rou

p

htt

p:/

/ww

w.n

atu

re.c

om

/na

ture

bio

tech

no

log

y The Maximum-Margin Hyperplane


5

ft r

r f t

MARCKSL1

MARCKSL1 MARCKSL1

0 2 4 6 8 10 12

MARCKSL1 MARCKSL1

MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

ZY

X

g

ZY

X

ZYX

12

10

8

6

4

2

0

ZY

X

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2006 N

atu

re P

ub

lish

ing

Gro

up

h

ttp

://w

ww

.natu

re.c

om

/natu

reb

iote

ch

no

log

y

What to Do With Outliers?


5

ft r

r f t

MARCKSL1

MARCKSL1 MARCKSL1 MARCKSL1

0 2 4 6 8 10 12

MARCKSL1

MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

l

ZY

X

ZY

X

h

ZYX

12

10

8

6

4

2

0

ZY

X

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2006 N

atu

re P

ub

lish

ing

Gro

up

h

ttp

://w

ww

.na

ture

.co

m/n

atu

reb

iote

ch

no

log

y The Soft-Margin Hyperplane


5

ft r

r f t

MARCKSL1


MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

ZY

X

ZY

X

ZYX

ZY

X

i

–1 –5 0 5 1

Expression

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2

006 N

atu

re P

ub

lis

hin

g G

rou

p

htt

p:/

/ww

w.n

atu

re.c

om

/natu

reb

iote

ch

no

log

y

The Kernel Function in 1D


5

ft r

r f t

MARCKSL1


MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

1.0× 1e6

0.8

0.6

0.4

0.2

0

Expre

ssio

n *

expre

ssio

n

j

ZY

X

ZY

X

ZYX

ZY

X

–1 –5 0 5 1

Expression

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2

006 N

atu

re P

ub

lis

hin

g G

rou

p

htt

p:/

/ww

w.n

atu

re.c

om

/natu

reb

iote

ch

no

log

y

Mapping the 1D data to 2D (here: squaring)

Not linearly separable in input space ...

wT ( )T ( )v

( w)T ( v)

w v

Remem er t at f r a tw matrices A a B t at ca e m l-ti lie (AB)T BTAT B si t e ect rs v a w are r -tate s c t at t e c i ci e wit t e ri ci al c m e ts

After t at t e are rescale si t e ia al matriI t is c r i ate s stem t e i er r ct w v K am tst a sta ar i er r ct I r er f r K t im leme t ai er r ct all ei e al es a e t e siti e If t e weret t ere c l e ect rs wit a s are le t smaller t a

zer clearl i c tra icti wit E cli ea i t iti s Allt is ill strates t e cl se c ecti s etwee t e sta ar i -er r ct E cli ea s ace a siti e efi ite matricesP siti e efi ite matrices ca efi e a i er r ct If t ec r i ate a is are r tate a rescale a r riatel t is i -er r ct ec mes a sta ar i er r ct a t eref reca i ce a rm a metric a a les t at e a e li e t efamiliar E cli ea es

Pr t ty es Ort lity

As a e am le c si er t e f ll wi classificati r -lem T e left a el f Fi re s ws tw cate ries rawfr m tw Ga ssia istri ti s Eac i t is a stim l st at is escri e tw ime si s T e ime si s arei l c rrelate f r t stim l s classes O a first la cea r t t e lear er will t fi a ecisi tse arate t e tw classes T e tw mea s f r t e tw classesare c ecte wit a s li li e a t e ecisi res lt-i fr m a r t t e classifier is als s w as a s li li eAs t e ecisi is rt al t t e s rtest c ec-ti etwee t e tw mea s it ca t a e res ect t t ec rrelati s i t e classes T e r lem es t act all liei t e r t t e classifier as s c it lies i t e i er r ctt at is se t efi e rt alit A r t t e lear er t ates t fail f r e e t e sim lest cate r str ct res s l

ta e t e c aria ce f t e stim l s ime si s i t acc t(Ree ; Frie & H l a ; As & G tt )If we ta e t e i er r ct K t e i e a siti eefi ite matri K t at is t e i erse f t e c aria ce matrif t e classes we et t e "ri t" efi iti f rt alitT e res lti ecisi is e icte as a as e li e Kis t e i erse f a siti e efi ite matri (c aria ce matri-ces are alwa s siti e efi ite) a is t eref re als a s-iti e efi ite matri He ce it c rres s t t e sta ari er r ct after r tati a scali t e s ace wit t e

matrices a T e mi le a el i Fi re s wst e r tate s ace a t e ri t a el s ws t e s ace afterscali I t e tra sf rme s ace t e tw classes t a ec rrelate a es a m re a t eref re t e r t t e classi-fier wit t e sta ar i er r ct i t is tra sf rme s aceca classif all stim li c rrectl

Figure 3. The crosses and the circles cannot be separated by a

linear perceptron in the plane.

N -li e r Cl ssi c ti Pr lems

A li ear classifier li e t e erce tr is a er attracti emet f r classificati eca se it il s str e -metric i t iti s a t e e tremel well- e el e mat e-matics f li ear al e ra H we er t ere are r lems t at ali ear classifier ca t s l e at least t irectl As se -eral s c l ical t e ries f cate rizati are ase li -ear classifiers t is iss e as als attracte s me atte ti it e s c l ical literat re (Me i & Sc wa e fl el ;Smit M rra & Mi a ) O e e am le f a r lemt at ca t e s l e wit a li ear classifier ca e see iFi re F r a l time t e m st lar a r ac t s l e

-li ear r lems li e t is e was t se a m lti-la ererce tr M lti-la er erce tr s are w t e a le ta r imate a f cti (H r i Sti c c m e & W ite

) a ca e trai e e cie tl si t e ac r -a ati al rit m (R mel art Hi t & Williams )T e a r ac t at will e rese te ere is f ame talli ere t I stea f tr i t lear a c m licate ecisif cti t e strate is t se a -li ear f cti t mat e i t atter s i t a s ace w ere t e r lem ca es l e a li ear classifier T e f ll wi t -e am le il-l strates t is a r ac

Fi re s ws e am les fr m tw classes (cr sses acircles) t at ca t e se arate a er la e i t e i -t s ace (i e a strai t li e i tw ime si s) I steaf tr i t classif t e e am les i t e i t s ace t at isi e t e al es x a x t e ata are tra sf rme ia -li ear wa Li ear classificati f t e ata is t e at-tem te i t e tra sf rme s ace I mac i e lear i s c a

-li ear tra sf rm is calle a feat re ma a t e res ltis ace a feat re s ace T e term “feat re” is alrea ea ilerl a e i s c l T eref re we will se t e m ree tral terms li earizati f cti a li earizati s ace

Map from 2D to 3D ...

Fi re W et er a r t t e classifier ca se arate tw classes e e s als t e i er r ct t at is c se T e left a el s ws twclasses (circles a cr sses) wit i l c rrelate ime si s i t is case t e sta ar i er r ct is t a r riate T e s rt s li

li e c ects t e mea s f t e cate r istri ti s a t e l s li li e is t e c rres i ecisi w e t e sta ar i err ct is se T e as e li e e icts a ecisi wit a i ere t i er r ct T is i er r ct c rres s t t e sta ari er r ct i t e s ace e icte t e ri t t at ca e tai e r tati a rescali t e ri i al s ace

Fi re T e cr sses a circles fr m Fi re ca e ma e

t a t ree- ime si al s ace i w ic t e ca e se arate ali ear erce tr

i stea F r e am le c si er t e f ll wi li earizatif cti : R !→ R

Φ(x) =

φ1(x)φ2(x)φ3(x)

=

x21√

2x1x2x22

. ( )

T is tra sf rmati ma s t e e am le atter s t a t ree-ime si al s ace t at is e icte i Fi re T e e am lesli e a tw - ime si alma if l f t is t ree- ime si als ace I t is s ace t e tw classes ec me li earl se a-ra le t at is it is ssi le t fi a tw - ime si al la es c t at t e circles fall e si e a t e cr sses t et er si e T is s ws t at wit a a r riate -li eartra sf rmati f t e i t a sim le li ear classifier ca s l e

t e r lem T e li earizati a r ac is a i t rescalit e ata i ata-a al sis ef re fitti a li ear m el I t ec rre t e am le eac er la e i t e li earizati s aceefi es a a ratic e ati i t e i t s ace He ce itis ssi le t eal wit a ratic (i e -li ear) f cti s

l si li ear met s I e eral t e strate is tre r cess t e ata wit t e el f a f cti s c t at ali ear erce tr m el is li el t e a lica le F rmallt is ca e e resse as

w (x)

i

wi i(x) ( )

w ere is w t e ime si f t e li earizati s ace aw is a wei t ect r i t e li earizati s ace It is cleart at t ere is a wi e ariet f -li ear f cti s t at cae se t re r cess t e i t I fact t is a r ac waser lar i t e earl a s f mac i e lear i (Nilss

) T e r lem is f c rse t at t e f cti as t ec se ef re lear i ca r cee I r t e am le wea e l s w w e ca se li ear met s t eal wita ratic f cti s t s all e will t w i a a ce

w et er it is ssi le t se arate t e ata wit a a raticf cti H we er if is c se t e s cie tl fle i lee i stea f a a ratic f cti wit l t ree c e -cie ts e c l c se a i r er l mial wit mac e cie ts t e it ma e ssi le t a r imate e e erc m licate ecisi f cti s T is c mes at t e c st fi creasi t e ime si alit f t e li earizati s ace at eref re earl mac i e lear i researc as trie t a it is

T e term li earizati s ace was se i a earl a er er-

el met s (Aizerma Bra erma & R z er )

... linear separability in 3D

(actually: data still 2D, “live” on a manifold of original D!)

Fi re W et er a r t t e classifier ca se arate tw classes e e s als t e i er r ct t at is c se T e left a el s ws twclasses (circles a cr sses) wit i l c rrelate ime si s i t is case t e sta ar i er r ct is t a r riate T e s rt s li

li e c ects t e mea s f t e cate r istri ti s a t e l s li li e is t e c rres i ecisi w e t e sta ar i err ct is se T e as e li e e icts a ecisi wit a i ere t i er r ct T is i er r ct c rres s t t e sta ari er r ct i t e s ace e icte t e ri t t at ca e tai e r tati a rescali t e ri i al s ace

Figure 4. The crosses and circles from Figure 3 can be mapped

to a three-dimensional space in which they can be separated by alinear perceptron.

i stea F r e am le c si er t e f ll wi li earizatif cti : R R

(x)

(x)(x)(x)

x

x x

x

( )

T is tra sf rmati ma s t e e am le atter s t a t ree-ime si al s ace t at is e icte i Fi re T e e am lesli e a tw - ime si alma if l f t is t ree- ime si als ace I t is s ace t e tw classes ec me li earl se a-ra le t at is it is ssi le t fi a tw - ime si al la es c t at t e circles fall e si e a t e cr sses t et er si e T is s ws t at wit a a r riate -li eartra sf rmati f t e i t a sim le li ear classifier ca s l e

t e r lem T e li earizati a r ac is a i t rescalit e ata i ata-a al sis ef re fitti a li ear m el I t ec rre t e am le eac er la e i t e li earizati s aceefi es a a ratic e ati i t e i t s ace He ce itis ssi le t eal wit a ratic (i e -li ear) f cti s

l si li ear met s I e eral t e strate is tre r cess t e ata wit t e el f a f cti s c t at ali ear erce tr m el is li el t e a lica le F rmallt is ca e e resse as

w (x)

i

wi i(x) ( )

w ere is w t e ime si f t e li earizati s ace aw is a wei t ect r i t e li earizati s ace It is cleart at t ere is a wi e ariet f -li ear f cti s t at cae se t re r cess t e i t I fact t is a r ac waser lar i t e earl a s f mac i e lear i (Nilss

) T e r lem is f c rse t at t e f cti as t ec se ef re lear i ca r cee I r t e am le wea e l s w w e ca se li ear met s t eal wita ratic f cti s t s all e will t w i a a ce

w et er it is ssi le t se arate t e ata wit a a raticf cti H we er if is c se t e s cie tl fle i lee i stea f a a ratic f cti wit l t ree c e -cie ts e c l c se a i r er l mial wit mac e cie ts t e it ma e ssi le t a r imate e e erc m licate ecisi f cti s T is c mes at t e c st fi creasi t e ime si alit f t e li earizati s ace at eref re earl mac i e lear i researc as trie t a it is

T e term li earizati s ace was se i a earl a er er-

el met s (Aizerma Bra erma & R z er )


5

ft r

r f t

MARCKSL1


MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

10

8

6

4

2

0

k

ZY

X

ZY

X

ZYX

ZY

X

0 2 4 6 8 10

Expression

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E©

20

06 N

atu

re P

ub

lis

hin

g G

rou

p

htt

p:/

/ww

w.n

atu

re.c

om

/natu

reb

iote

ch

no

log

y

Projecting the 4D Hyperplane Back into 2D Input Space

SVM magic?

SVM magic?

For any consistent dataset there is a kernel that allows perfect

separation of the data

SVM magic?



Why bother with soft-margins?

SVM magic?



Why bother with soft-margins?

The so-called curse of dimensionality: as the number of variables

considered increases, the number of possible solutions increases

exponentially … overfitting looms large!


5

ft r

r f t

MARCKSL1


MARCKSL1 MARCKSL1

HOXA9

ZY

X

ZY

XZ

YX

f

10

8

6

4

2

0

l

ZY

X

ZY

X

ZYX

ZY

X

0 2 4 6 8 10

Expression

r S S A

A E A S

A

A A A

f A

A E A S

A A S

1 000

P E

©2

006 N

atu

re P

ub

lis

hin

g G

rou

p

htt

p:/

/ww

w.n

atu

re.c

om

/natu

reb

iote

ch

no

log

y

Overfitting

Regularisation & Cross-validation


Find a compromise between complexity and classification performance,

i.e. kernel function and soft-margin




Penalise complex functions via a regularisation term or regulariser




Penalise complex functions via a regularisation term or regulariser

Cross-validate the results (leave-one-out or 10-fold typically used)

SVM Summary

SVM Summary

Kernel essential—best kernel typically found by trial-and-error and

experience with similar problems etc.

SVM Summary



Inverting not always easy; need approximations etc. (i.e. science hard,

engineering easy as they don’t care as long as it works!)

SVM Summary





Theoretically sound and a convex optimisation (no local minima)

SVM Summary





Theoretically sound and a convex optimisation (no local minima)

Choose between:

• complicated decision functions and training (neural networks)

• clear theoretical foundation (best possible generalisation), convex

optimisation but need to trade-off complexity versus soft-margin and skilful

selection of the “right” kernel.

(= “correct” non-linear similarity measure for the data!)

Regularisation, Cross-Validation and Kernels

Much of the success of modern machine learning methods can attributed to three ideas:



1. Regularisation. Given are N “datapoints” (xi,yi) with …

and a model f . Then the “error” between data and model is:

In machine learning we not only take the “error” between model and data into account but

in addition a measure of the complexity of the model f:

x = x1, ..., xN

y = y1, ..., yN

E(y, f(x))

E(y, f(x)) + λR(f)







2. Cross-Validation. Regularisation is related to the prior in Bayesian statistics. Unlike

in Bayesian statistics the trade-off between small error and low-complexity of the

model is controlled by a parameter λ— this is optimized using cross-validation.

x = x1, ..., xN

y = y1, ..., yN

E(y, f(x))

E(y, f(x)) + λR(f)







2. Cross-Validation. Regularisation is related to the prior in Bayesian statistics. Unlike

in Bayesian statistics the trade-off between small error and low-complexity of the

model is controlled by a parameter λ— this is optimized using cross-validation.

3. Non-linear mapping with linear separation.

True for kernels as well as DNNs.

x = x1, ..., xN

y = y1, ..., yN

E(y, f(x))

E(y, f(x)) + λR(f)

❸

What changed vision research in 2012?


ImageNet challenge: 1000 categories, 1.2 million training images.



AlexNet by Krizhevsky, Sutskever & Hinton (2012) appears on the stage, and

basically reduces the prediction error by nearly 50%:



AlexNet by Krizhevsky, Sutskever & Hinton (2012) appears on the stage, and

basically reduces the prediction error by nearly 50%:

VisionDeep CNN

LanguageGenerating RNN

A group of people shopping at an outdoor

market.

There are many vegetables at the

fruit stand.

A woman is throwing a frisbee in a park.

A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest withtrees in the background.

A dog is standing on a hardwood floor. A stop sign is on a road with amountain in the background

?−→

Problem of finding a sharp image from a blurry photo:

Blind Image Deconvolution

modified from Michael Hirsch

from Michael Hirsch

from Michael Hirsch

Sequence of Blurry Photos (Image Burst)

from Michael Hirsch


from Michael Hirsch


from Michael Hirsch


from Michael Hirsch

Result of Proposed Image Burst Deblurring Method

from Michael Hirsch

EnhanceNet: Photo-realistic Super-resolution

from Michael Hirsch

EnhanceNet: Photo-realistic Super-resolution

from Michael Hirsch

from Michael Hirsch

from Michael Hirsch

Autonomous cars

Autonomous cars

Fundamentals of Neural Networks

Interest in shallow, 2-layer artificial neural networks (ANN)—so-called

perceptrons—began in the late 1950s and early 60s (Frank Rosenblatt),

based on Warren McCulloch and Walter Pitts’s as well Donald Hebb’s ideas

of computation by neurons from the 1940s.

https://kimschmidtsbrain.files.wordpress.com/2015/10/perceptron.jpg

http://cambridgemedicine.org/sites/default/files/styles/large/public/field/image/DonaldOldingHebb.jpg?itok=py9Uh4D5






Second wave of ANN research and interest in psychology—often termed

connectionism—after the publication of the parallel distributed processing

(PDP) books by David Rumelhart and James McClelland (1986), using the

backpropagation algorithm as a learning rule for multi-layer networks.










Three-layer network with (potentially infinitely many) hidden units in the

intermediate layer is a universal function approximator (Kurt Hornik, 1991).












Non-convex optimization problems during backpropagation training, and lack

of data and computing power limited the usefulness of the ANNs:












Non-convex optimization problems during backpropagation training, and lack

of data and computing power limited the usefulness of the ANNs:

Universal function approximator in theory, but in practice three-layer ANNs

could often not successfully solve complex problems.

Fundamentals of Neural Networks (cont’d)

Breakthrough again with so-called deep neural networks or DNNs, widely

known since the 2012 NIPS-paper by Alex Krizhevsky, Ilya Sutskever &

Geoffrey Hinton.



known since the 2012 NIPS-paper by Alex Krizhevsky, Ilya Sutskever &

Geoffrey Hinton.

https://www.wired.com/wp-content/uploads/

blogs/wiredenterprise/wp-content/uploads/

2013/03/hinton1.jpg



known since the 2012 NIPS-paper by Alex Krizhevsky et al.

DNN: loose terminology to refer to networks with at least two hidden or

intermediate layers, typically at least five to ten (or up to dozens):






1. Massive increase in labelled training data (“the internet”),

2. computing power (GPUs),

3. simple non-linearity (ReLU) instead of sigmoid,

4. convolutional rather than fully connected layers,

and

5. weight sharing across deep layers

appear to be the critical ingredients for the current success of DNNs, and

makes them the current method of choice in ML, particular in application.






1. Massive increase in labelled training data (“the internet”),

2. computing power (GPUs),

3. simple non-linearity (ReLU) instead of sigmoid,

4. convolutional rather than fully connected layers,

and

5. weight sharing across deep layers

appear to be the critical ingredients for the current success of DNNs, and

makes them the current method of choice in ML, particular in application.

At least superficially DNNs appear to be similar to the human object

recognition system: convolutions (“filters”, “receptive fields”) followed by

non-linearities and pooling is thought to be the canonical computation of

cortex, at least within sensory areas.


90

a

Linear

Threshold

SigmoidRecti!ed linear

1

0

–1

–2 –1 21

1

0

y

y

bw1

x1

w2

x2

z = b + Σ xiwi

i

Annu. R

ev. V

is. S

ci. 2015.1

:417-4

46. D

ow

nlo

aded

fro

m w

ww

.annual

revie

ws.

org

Acc

ess

pro

vid

ed b

y W

IB6134 -

Univ

ersi

ty o

f T

ueb

ingen

on 0

2/1

8/1

6. F

or

per

sonal

use

only

.

Kriegeskorte (2015)


91

a b cy

2

y1

x1

x2

W2

W1

y1

x2

x1

y1

x2

x1

y2 = f (f (x W

1) • W2)y2 = x W

1 W

2 = x W'

approximator:

An

nu

. R

ev.

Vis

. S

ci.

20

15

.1:4

17

-44

6.

Do

wn

load

ed f

rom

ww

w.a

nn

ual

rev

iew

s.o

rg A

cces

s p

rov

ided

by

WIB

61

34

- U

niv

ersi

ty o

f T

ueb

ing

en o

n 0

2/1

8/1

6.

Fo

r p

erso

nal

use

on

ly.

Kriegeskorte (2015)

Example: VGG-16

VGG16 by Simonyan & Zisserman (2014); 92.7% top-5 test accuracy on ImageNet

http

s://

ww

w.c

s.to

ron

to.e

du/~

fros

sard

/pos

t/vg

g16/

#arc

hit

ectu

re

http://scs.ryerson.ca/~aharley/vis/conv/flat.html

Deep Neural Networks (DNNs)

18

2

Input(2)

Output(1 sigmoid)

Hidden(2 sigmoid)

a b

yy

xy

x=yz

xy

z yzz

y=Δ Δ

Δ Δ

Δ Δz yz

xy x=

xz

yz

xxy

=

H1

❹

Adversarial attacks?

Szegedy et al. (2014)

Adversarial examples? (cont’d)

Reese

Witherspoon

Sharif et al. (2016)


Reese

Witherspoon



Reese

Witherspoon

Russel

Crowe



Reese

Witherspoon

Russel

Crowe




DARPA Challenge 2015

DARPA Challenge 2015

Boston Dynamics 2017

Boston Dynamics 2017

Human versus artificial intelligence

We learn unsupervised or semi-supervised, sometimes reinforcement, very

rarely supervised (school, University) – all successful AI is currently

supervised only, i.e. only when the correct answer is known!

We can do lots of things using the same network (or a set of closely coupled

networks) — all DNNs are typically only good at one or few tasks.

101

❺

Gesellschaftliche Herausforderungen

Arbeitsbedingungen und Arbeitsmarkt:

Einsatz von Technologie macht die Arbeit “einfacher” – typischerweise fällt die

Notwendigkeit einer Lehre oder Ausbildung weg.

Die Folge sind sinkende Löhne … schließlich kann “jeder” die Arbeit machen.

103

Arbeitslosigkeit?

104

Arbeitslosigkeit?

Autonome Fahrzeuge – womöglich kurz nach der Erlaubnis, solche Fahrzeuge im Straßenverkehr zu haben, die Pflicht, nur noch damit zu fahren.

104

Arbeitslosigkeit?


540.000 Berufskraftfahrer in Deutschland (Stand 2013)250.000 Taxifahrerlaubnisse (Stand 2017) 25.000 Lokführer (Stand 2017)815.000 Arbeitsplätze gefährdet (Quote von 5.8% auf 8.1%)

104

Arbeitslosigkeit?



Roboter in der Post? Abfallwirtschaft? Logistik?Deutsche Post DHL hat 211.000 Mitarbeiter in Deutschland (Stand 2016), in der Ver- und Entsorgung arbeiteten 2014 ca. 155.000 Menschen, als Reinigungskräfte 2014 offiziell fast 760.000; Amazon beschäftigt alleine in D 23.000 Menschen in Logistik-Zentren: 1.150.000 Arbeitsplätze!

104

Arbeitslosigkeit?



Roboter in der Post? Abfallwirtschaft? Logistik?Deutsche Post DHL hat 211.000 Mitarbeiter in Deutschland (Stand 2016), in der Ver- und Entsorgung arbeiteten 2014 ca. 155.000 Menschen, als Reinigungskräfte 2014 offiziell fast 760.000; Amazon beschäftigt alleine in D 23.000 Menschen in Logistik-Zentren: 1.150.000 Arbeitsplätze!

Humanoide Roboter in der Pflege?2014 arbeiteten in der Alten- und Krankenpflege in D über 900.000 Menschen … .

104






Politik und Gesellschaft:

Leben in der selben Wirklichkeit? Personalisierte Information in sozialen Medien

und der Verlust breit und kontrovers informierender Quellen – weit verbreiteter

Konsum von Propaganda.

105

Propaganda

Propaganda ist der Versuch der gezielten Beeinflussung des Denkens,

Handelns und Fühlens von Menschen. Wer Propaganda betreibt, verfolgt

damit immer ein bestimmtes Interesse. … Charakteristisch für Propaganda

ist, dass sie die verschiedenen Seiten einer Thematik nicht darlegt und

Meinung und Information vermischt. Wer Propaganda betreibt, möchte

nicht diskutieren und mit Argumenten überzeugen, sondern mit allen Tricks

die Emotionen und das Verhalten der Menschen beeinflussen, beispielsweise

indem sie diese ängstigt, wütend macht oder ihnen Verheißungen ausspricht.

Propaganda nimmt dem Menschen das Denken ab und gibt ihm stattdessen

das Gefühl, mit der übernommenen Meinung richtig zu liegen.

Quelle: Bundeszentrale für politische Bildungwww.bpb.de

106

http://www.bpb.de










Privatsphäre? Veränderung (zwischenmenschlicher) Kommunikation?

107

Weapons of Mass Destruction (WMDs)

https://www.wired.com/images_blogs/dangerroom/2011/03/powell_un_anthrax.jpg











Naïver Glaube an die Objektivität von Algorithmen

… und Ranglisten, die Vermessung und Quantifizierung des Lebens:

China, z.B., plant das Social Credit System einzuführen.

110

https://de.wikipedia.org/wiki/Nick_Bostrom











Naïver Glaube an die Objektivität von Algorithmen

… und Ranglisten, die Vermessung und Quantifizierung des Lebens: China plant

das Social Credit System einzuführen.

Doomsday-Szenarien

Kommt die Singularität? Wenn ja: Garten Eden oder Hölle?

112

Doomsday-Videos to watch

Google's Geoffrey Hinton - "There's no reason to think computers won't get much

smarter than us” (10 mins): https://www.youtube.com/watch?v=p6lM3bh-npg

Demis Hassabis, CEO, DeepMind Technologies - The Theory of Everything

(16 mins): https://www.youtube.com/watch?v=rbsqaJwpu6A

Nick Bostrom, What happens when our computers get smarter than we are?

(17 mins): https://www.ted.com/talks/nick_bostrom_what_happens_when_our_computers_get_smarter_than_we_are

Why Elon Musk is worried about artificial intelligence (3 mins)https://www.youtube.com/watch?v=US95slMMQis

https://www.youtube.com/watch?v=rbsqaJwpu6A

https://www.ted.com/talks/nick_bostrom_what_happens_when_our_computers_get_smarter_than_we_are

https://www.ted.com/talks/nick_bostrom_what_happens_when_our_computers_get_smarter_than_we_are

https://www.youtube.com/watch?v=US95slMMQis

Neural Information Processing Group and

Bernstein Center for Computational Neuroscience,

Eberhard Karls Universität Tübingen

Max Planck Institute for Intelligent Systems, Tübingen

Felix Wichmann

Thanks

Documents

Maschinelles Lernen: Methoden, Algorithmen, Potentiale und ...Neural Information Processing Group and Bernstein Center for Computational Neuroscience, Eberhard Karls Universität Tübingen