49
Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural Properties Philipp Christian Petersen

Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Deep Neural Networks and Partial DifferentialEquations:

Approximation Theory and Structural Properties

Philipp Christian Petersen

Page 2: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Joint work

Joint work with:

I Helmut Bolcskei (ETH Zurich)

I Philipp Grohs (University of Vienna)

I Joost Opschoor (ETH Zurich)

I Gitta Kutyniok (TU Berlin)

I Mones Raslan (TU Berlin)

I Christoph Schwab (ETH Zurich)

I Felix Voigtlaender (KU Eichstatt-Ingolstadt)

1 / 36

Page 3: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Today’s GoalGoal of this talk: Discuss the suitability of neural networks as anansatz system for the solution of PDEs.

Two threads:

Approximation theory:

I universal approximation

I optimal approximationrates for all classicalfunction spaces

I reduced curse of dimen-sion

1

0.8

0.6

0.4

0.2

00

0.2

0.4

0.6

0.8

1

1

0

-1

0

0.2

0.4

0.6

0.8

10

0.2

0.4

0.6

0.8

0

0.5

1

1

Structural properties:

I non-convex, non-closedansatz spaces

I parametrization not stable

I very hard to optimize over

2 / 36

Page 4: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Outline

Neural networksIntroduction to neural networksApproaches to solve PDEs

Approximation theory of neural networksClassical resultsOptimalityHigh-dimensional approximation

Structural resultsConvexityClosednessStable parametrization

3 / 36

Page 5: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Neural networks

We consider neural networks as a special kind of functions:

I d = N0 ∈ N: input dimension,

I L: number of layers,

I % : R→ R: activation function,

I T` : RN`−1 → RN` , ` = 1, . . . , L: affine-linear maps.

Then Φ% : Rd → RNL given by

Φ%(x) = TL(%(TL−1(%(. . . %(T1(x)))))), x ∈ Rd ,

is called a neural network (NN). The sequence (d ,N1, . . . ,NL) iscalled the architecture of Φ%.

4 / 36

Page 6: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Why are neural networks interesting? - I

Deep Learning: Deep learning describes a variety of techniquesbased on data-driven adaptation of the affine linear maps in a neuralnetwork.

Overwhelming success:

I Image classification

I Text understanding

I Game intelligence

Hardware design of thefuture!

Ren, He, Girshick, Sun; 2015

5 / 36

Page 7: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Why are neural networks interesting? - II

Expressibility: Neural networks constitute a very powerfularchitecture.

Theorem (Cybenko; 1989, Hornik; 1991, Pinkus; 1999)

Let d ∈ N, K ⊂ Rd compact, f : K → R continuous, % : R→ Rcontinuous and not a polynomial. Let ε > 0, then there exist atwo-layer NN Φ%: ‖f − Φ%‖∞ ≤ ε.

Efficient expressibility: RM 3 θ 7→ (T1, . . . ,TL) 7→ Φ%θ yields a

parametrized system of functions. In a sense this parametrization isoptimally efficient. (More on this below).

6 / 36

Page 8: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

How can we apply NNs to solve PDEs?

PDE problem: For D ⊂ Rd , d ∈ N find u such that

G (x , u(x),∇u(x),∇2u(x)) = 0 for all x ∈ D.

Approach of [Lagaris, Likas, Fotiadis; 1998]: Let (xi )i∈I ⊂ D,find a NN Φ%

θ such that

G (xi ,Φ%θ(xi ),∇Φ%

θ(xi ),∇2Φ%θ(xi )) = 0 for all i ∈ I .

Standard methods can be used to find parameters θ.

7 / 36

Page 9: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Approaches to solve PDEs - Examples

General Framework: Deep Ritz Method [E, Yu; 2017]: NNsas trial functions, SGD naturally replaces quadrature.

High-dimensional PDEs: [Sirignano, Spiliopoulos; 2017]: LetD ⊂ Rd d ≥ 100 find u such that

∂u

∂t(t, x) + H(u)(t, x) = 0, (t, x) ∈ [0,T ]× Ω,+ BC + IC

As the number of parameters of the NNs increases the minimizer ofassociated energy approaches true solution. No mesh generationrequired!

[Berner, Grohs, Hornung, Jentzen, von Wurstemberger;2017]:Phrasing problem as empirical risk minimization provably nocurse of dimension in approximation problem or number of samples.

8 / 36

Page 10: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

How can we apply NNs to solve PDEs?

Deep learning and PDEs: Both approaches above are based ontwo ideas.

I Neural networks are highly efficient in representing solutions ofPDEs, hence the complexity of the problem can be greatlyreduced.

I There exist black box methods from machine learning thatsolve the optimization problem.

This talk:

I We will show exactly how efficient the representations are.

I Raise doubt that the black box can produce reliable results ingeneral.

9 / 36

Page 11: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Approximation theory of neuralnetworks

10 / 36

Page 12: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Complexity of neural networksRecall:

Φ%(x) = TL(%(TL−1(%(. . . %(T1(x)))))), x ∈ Rd .

Each affine linear mapping T` is defined by a matrix A` ∈ RN`×N`−1

and a translation b` ∈ RN` via T`(x) = A` x + b`.

The number of weights W (Φ%) and the number of neuronsN(Φ%) are

W (Φ%) =∑j≤L

(‖Aj‖`0 + ‖bj‖`0) and N(Φ%) =L∑

j=0

Nj .

11 / 36

Page 13: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Power of the architecture — Exemplary results

Given f from some class of functions, how many weights/neuronsdoes an ε-approximating NN need to have?

Not so many...

Theorem (Maiorov, Pinkus; 1999)

There exists an activation function %weird : R→ R that

I is analytic and strictly increasing,

I satisfies limx→−∞ %weird(x) = 0 and limx→∞ %weird(x) = 1,

such that for any d ∈ N, any f ∈ C ([0, 1]d), and any ε > 0, there isa 3-layer %-network Φ%weird

ε with ‖f − Φ%weirdε ‖L∞ ≤ ε and

N(Φ%weirdε ) = 9d + 3.

12 / 36

Page 14: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Power of the architecture — Exemplary results

Given f from some class of functions, how many weights/neuronsdoes an ε-approximating NN need to have?

Not so many...

Theorem (Maiorov, Pinkus; 1999)

There exists an activation function %weird : R→ R that

I is analytic and strictly increasing,

I satisfies limx→−∞ %weird(x) = 0 and limx→∞ %weird(x) = 1,

such that for any d ∈ N, any f ∈ C ([0, 1]d), and any ε > 0, there isa 3-layer %-network Φ%weird

ε with ‖f − Φ%weirdε ‖L∞ ≤ ε and

N(Φ%weirdε ) = 9d + 3.

12 / 36

Page 15: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Power of the architecture — Exemplary results

I Barron; 1993: Approximation rate for functions with one finiteFourier moment using shallow networks with activation function% sigmoidal of order zero.

I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have ‖f − Φ%n‖L∞ . N(Φ%

n)−s/d andL(Φ%

n) = L(d , s, k).

I Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for %(x) = x+

(called ReLU) that ‖f − Φ%n‖L∞ .W (Φ%

n)−s/d andL(Φ%

ε) log(n).

I Shaham, Cloninger, Coifman; 2015: One can implementcertain wavelets using 4–layer NNs.

I He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:ReLU NNs reproduce approximation rates of h-, p- andhp-FEM.

13 / 36

Page 16: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Power of the architecture — Exemplary results

I Barron; 1993: Approximation rate for functions with one finiteFourier moment using shallow networks with activation function% sigmoidal of order zero.

I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have ‖f − Φ%n‖L∞ . N(Φ%

n)−s/d andL(Φ%

n) = L(d , s, k).

I Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for %(x) = x+

(called ReLU) that ‖f − Φ%n‖L∞ .W (Φ%

n)−s/d andL(Φ%

ε) log(n).

I Shaham, Cloninger, Coifman; 2015: One can implementcertain wavelets using 4–layer NNs.

I He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:ReLU NNs reproduce approximation rates of h-, p- andhp-FEM.

13 / 36

Page 17: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Power of the architecture — Exemplary results

I Barron; 1993: Approximation rate for functions with one finiteFourier moment using shallow networks with activation function% sigmoidal of order zero.

I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have ‖f − Φ%n‖L∞ . N(Φ%

n)−s/d andL(Φ%

n) = L(d , s, k).

I Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for %(x) = x+

(called ReLU) that ‖f − Φ%n‖L∞ .W (Φ%

n)−s/d andL(Φ%

ε) log(n).

I Shaham, Cloninger, Coifman; 2015: One can implementcertain wavelets using 4–layer NNs.

I He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:ReLU NNs reproduce approximation rates of h-, p- andhp-FEM.

13 / 36

Page 18: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Power of the architecture — Exemplary results

I Barron; 1993: Approximation rate for functions with one finiteFourier moment using shallow networks with activation function% sigmoidal of order zero.

I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have ‖f − Φ%n‖L∞ . N(Φ%

n)−s/d andL(Φ%

n) = L(d , s, k).

I Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for %(x) = x+

(called ReLU) that ‖f − Φ%n‖L∞ .W (Φ%

n)−s/d andL(Φ%

ε) log(n).

I Shaham, Cloninger, Coifman; 2015: One can implementcertain wavelets using 4–layer NNs.

I He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:ReLU NNs reproduce approximation rates of h-, p- andhp-FEM.

13 / 36

Page 19: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Power of the architecture — Exemplary results

I Barron; 1993: Approximation rate for functions with one finiteFourier moment using shallow networks with activation function% sigmoidal of order zero.

I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have ‖f − Φ%n‖L∞ . N(Φ%

n)−s/d andL(Φ%

n) = L(d , s, k).

I Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for %(x) = x+

(called ReLU) that ‖f − Φ%n‖L∞ .W (Φ%

n)−s/d andL(Φ%

ε) log(n).

I Shaham, Cloninger, Coifman; 2015: One can implementcertain wavelets using 4–layer NNs.

I He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:ReLU NNs reproduce approximation rates of h-, p- andhp-FEM.

13 / 36

Page 20: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Lower bounds

Optimal approximation rates: Lower bounds on required networksize only exist under additional assumptions. (Recall networks basedon %weird).

Options:

(A) Place restrictions on activation function (e.g. only considerthe ReLU), thereby excluding pathological examples like %weird.( VC dimension bounds)

(B) Place restrictions on the weights.( Information theoretical bounds, entropy arguments)

(C) Use still other concepts like continuous N-widths.

14 / 36

Page 21: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Lower bounds

Optimal approximation rates: Lower bounds on required networksize only exist under additional assumptions. (Recall networks basedon %weird).

Options:

(A) Place restrictions on activation function (e.g. only considerthe ReLU), thereby excluding pathological examples like %weird.( VC dimension bounds)

(B) Place restrictions on the weights.( Information theoretical bounds, entropy arguments)

(C) Use still other concepts like continuous N-widths.

14 / 36

Page 22: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Asymptotic min-max rate distortionEncoders: Let C ⊂ L2(Rd), ` ∈ N

E` :=E : C → 0, 1`

, D` :=

D : 0, 1` → L2(Rd)

.

0, 1, 0, 0, 1, 1, 1

Min-max code length:

L(ε, C) := min

` ∈ N : ∃D ∈ D`,C ∈ C` : sup

f ∈C‖D(E (f ))− f ‖2 < ε

.

Optimal exponent:

γ∗(C) := infγ > 0 : L(ε, C) = O(ε−γ)

.

15 / 36

Page 23: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Asymptotic min-max rate distortion

Theorem (Boelcskei, Grohs, Kutyniok, P.; 2017)

Let C ⊂ L2(Rd), % : R→ R, then for all ε > 0:

supf ∈C

infΦ% NN with quantised weights

‖Φ%−f ‖2≤ε

W (Φ%)

& ε−γ∗(C). (1)

Optimal approximation/parametrization: If for C ⊂ L2(Rd) onealso has . in (1), then NNs approximate a function class optimally.

Versatility: It turns out that NNs achieve optimal approximationrates for many practically-used function classes.

16 / 36

Page 24: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Some instances of optimal approximation

I Mhaskar; 1993: Let % be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have ‖f − Φ%n‖L∞ . N(Φ%

n)−s/d .We have γ∗(f ∈ C s([0, 1]d : ‖f ‖ ≤ 1) = d/s.

I Shaham, Cloninger, Coifman; 2015: One can implementcertain wavelets using 4–layer ReLU NNs. Optimal, whenwavelets are optimal.

I Bolcskei, Grohs, Kutyniok, P.; 2017: Networks yieldoptimal rates if any affine system does. Example: shearlets forcartoon-like functions.

17 / 36

Page 25: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

ReLU Approximation

Piecewise smooth functions:

Eβ,d denotes the d-dimensionalCβ-piecewise smooth functionson [0, 1]d with interfaces in Cβ.

1

0.8

0.6

0.4

0.2

00

0.20.4

0.60.8

1

0.5

1

0

Theorem (P., Voigtlaender; 2018)

Let d ∈ N, β ≥ 0, % : R→ R, %(x) = x+, then

supf ∈Eβ,d

infΦ% NN with quantised weights

‖Φ%−f ‖2≤ε

W (Φ%)

∼ ε−γ∗(Eβ,d ) = ε−2(d−1)/β.

The optimal depth of the networks is ∼ β/d .

18 / 36

Page 26: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

High-dimensional approximation

Curse of dimension: To guarantee approximation with error ≤ εof functions in Eβ,d one requires networks with O(ε−2(d−1)/β)weights.

Symmetries and invariances:Image classifiers are often:

I translation, dilation, and rotation invariant,

I invariant to small deformations,

I invariant to small changes in brightness, contrast, color.

19 / 36

Page 27: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Curse of dimension

Two-step setup: f = χ τI τ : RD → Rd is a smooth dimension reducing feature map.

I χ ∈ Eβ,d performs classification on low-dimensional space.

Theorem (P., Voigtlaender; 2017)

Let %(x) = x+. There are constants c > 0, L ∈ N such that for anyf = χ τ and any ε ∈ (0, 1/2), there is a NN Φ%

ε with at most L

layers, and at most c · ε−2(d−1)/β non-zero weights such that

‖Φ%ε − f ‖L2 < ε.

Asymptotic approximation rate depends only on d , not on D.

20 / 36

Page 28: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Curse of dimension

Two-step setup: f = χ τI τ : RD → Rd is a smooth dimension reducing feature map.

I χ ∈ Eβ,d performs classification on low-dimensional space.

Theorem (P., Voigtlaender; 2017)

Let %(x) = x+. There are constants c > 0, L ∈ N such that for anyf = χ τ and any ε ∈ (0, 1/2), there is a NN Φ%

ε with at most L

layers, and at most c · ε−2(d−1)/β non-zero weights such that

‖Φ%ε − f ‖L2 < ε.

Asymptotic approximation rate depends only on d , not on D.

20 / 36

Page 29: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Compositional functions

Compositional functions: [Mhaskar, Poggio; 2016]High-dimensional functions as dyadic composition of 2-dimensionalfunctions.

R8 3 x 7→ h31(h2

1(h11(x1, x2), h1

2(x3, x4)), h22(h1

3(x5, x6), h14(x7, x8)))

x1 x2 x3 x4 x5 x6 x7 x8

21 / 36

Page 30: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Extensions

Approximation with respect to Sobolev norms: ReLU NNs Φare Lipschitz continuous. Hence, for s ∈ [0, 1], p ≥ 1 andf ∈W s,p(Ω), we can measure

‖f − Φ‖W s,p(Ω).

ReLU Networks achieve the same approximation rates as h-, p-,hp-FEM, [Opschoor, P., Schwab; 2019].

Convolutional neural networks: Direct correspondence betweenapproximation by CNNs (without pooling) and approximation byfully-connected networks, [P., Voigtlaender; 2018].

22 / 36

Page 31: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Optimal parametrization

Optimal parametrization:

I Neural networks yield optimal representations of many functionclasses relevant in PDE applications,

I Approximation is flexible and quality is improved iflow-dimensional structure is present.

PDE discretization:

I Problem complexity drastically reduced,

I No design of ansatz system necessary, since NNs approximatealmost every function class well.

Can neural networks really be this good?

23 / 36

Page 32: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

The inconvenient structure ofneural networks

24 / 36

Page 33: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Fixed architecture networksGoal: Fix a space of networks with prescribed shape andunderstand the associated set of functions.

Fixed architecture networks: Let d , L ∈ N, N1, . . . ,NL−1 ∈ N,% : R→ R then we define by

NN %(d ,N1, . . . ,NL−1, 1)

the set of NNs with architecture (d ,N1, . . . ,NL−1, 1).

d = 8 N1 = 12 N2 = 12 N3 = 12 N4 = 8

25 / 36

Page 34: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Back to the basics

Topological properties: Is NN %(d ,N1, . . . ,NL−1, 1)

I star-shaped?

I convex? approximately convex?

I closed?

Is the map (T1, . . . ,TL)→ Φ open?

Implications for optimization:If we do not have the properties above, then we can have

I terrible local minima,

I exploding weights,

I very slow convergence.

26 / 36

Page 35: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Star-shapedness

Star-shapedness: NN %(d ,N1, . . . ,NL−1, 1) is trivially star-shapedwith center 0.

...but...

Proposition (P., Raslan, Voigtlaender; 2018)

Let d , L,N,N1, . . . ,NL−1 ∈ N and let % : R→ R be locallyLipschitz continuous. Then the number of linearly independentcenters of NN %(d ,N1, . . . ,NL) is at most

∑L`=1(N`−1 + 1)N`,

where N0 = d .

27 / 36

Page 36: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Convexity?

Corollary (P., Raslan, Voigtlaender; 2018)

Let d , L,N,N1, . . . ,NL−1 ∈ N, N0 = d , and let % : R→ R belocally Lipschitz continuous.If NN %(d ,N1, . . . ,NL−1, 1) contains more than

∑L`=1(N`−1 + 1)N`

linearly independent functions, then NN %(d ,N1, . . . ,NL−1, 1) isnot convex.

From translation invariance: If NN %(d ,N1, . . . ,NL−1, 1) onlyfinitely many linearly independent functions then % is a finite sum ofcomplex exponentials multiplied with polynomials.

28 / 36

Page 37: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Weak Convexity?

Weak convexity: NN %(d ,N1, . . . ,NL−1, 1) is almost never

convex, but what about NN %(d ,N1, . . . ,NL−1, 1) + B‖·‖∞ε (0) for a

hopefully small ε > 0?

Theorem (P., Raslan, Voigtlaender; 2018)

Let d , L,N,N1, . . . ,NL−1 ∈ N, N0 = d . For all commonly-usedactivation functions there does not exist ε > 0 such thatNN %(d ,N1, . . . ,NL−1, 1) + B

‖·‖∞ε (0) is convex.

As a corollary, we also get that NN %(d ,N1, . . . ,NL−1, 1) is usuallynowhere dense.

29 / 36

Page 38: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Weak Convexity?

Weak convexity: NN %(d ,N1, . . . ,NL−1, 1) is almost never

convex, but what about NN %(d ,N1, . . . ,NL−1, 1) + B‖·‖∞ε (0) for a

hopefully small ε > 0?

Theorem (P., Raslan, Voigtlaender; 2018)

Let d , L,N,N1, . . . ,NL−1 ∈ N, N0 = d . For all commonly-usedactivation functions there does not exist ε > 0 such thatNN %(d ,N1, . . . ,NL−1, 1) + B

‖·‖∞ε (0) is convex.

As a corollary, we also get that NN %(d ,N1, . . . ,NL−1, 1) is usuallynowhere dense.

29 / 36

Page 39: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

IllustrationIllustration: The set NN %(d ,N1, . . . ,NL−1, 1) has very fewcenters, it is scaling invariant, not approximately convex, andnowhere dense.

30 / 36

Page 40: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Closedness in Lp

Compact weights: If the activation function % is continuous, thena compactness argument shows that the set of networks of acompact parameter set is closed.

Theorem (P., Raslan, Voigtlaender; 2018)

Let d , L,N1, . . . ,NL−1 ∈ N, N0 = d . If % has one of the propertiesbelow, then NN %(d ,N1, . . . ,NL−1, 1) is not closed in Lp,p ∈ (0,∞).

I analytic, bounded, not constant,

I C 1 but not C∞,

I continuous, monotone, bounded, %′(x0) exists and is non-zeroin at least one point x0 ∈ R.

I continuous, monotone, continuous differentiable outside acompact set, and limx→∞ %

′(x), limx→−∞ %′(x) exist and do

not coincide.

31 / 36

Page 41: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Closedness in Lp

Compact weights: If the activation function % is continuous, thena compactness argument shows that the set of networks of acompact parameter set is closed.

Theorem (P., Raslan, Voigtlaender; 2018)

Let d , L,N1, . . . ,NL−1 ∈ N, N0 = d . If % has one of the propertiesbelow, then NN %(d ,N1, . . . ,NL−1, 1) is not closed in Lp,p ∈ (0,∞).

I analytic, bounded, not constant,

I C 1 but not C∞,

I continuous, monotone, bounded, %′(x0) exists and is non-zeroin at least one point x0 ∈ R.

I continuous, monotone, continuous differentiable outside acompact set, and limx→∞ %

′(x), limx→−∞ %′(x) exist and do

not coincide.

31 / 36

Page 42: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Closedness in L∞

Theorem (P., Raslan, Voigtlaender; 2018)

Let d , L,N,N1, . . . ,NL−1 ∈ N, N0 = d . If % is has one of theproperties below, then NN %(d ,N1, . . . ,NL−1, 1) is not closed inL∞.

I analytic, bounded, not constant,

I C 1 but not C∞

I ρ ∈ Cp and |ρ(x)− xp+| bounded, for p ≥ 1.

ReLU: The set of two-layer ReLU NNs is closed in L∞!

32 / 36

Page 43: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

IllustrationIllustration: For most activation functions (except the ReLU) % theset NN %(d ,N1, . . . ,NL−1, 1) is star-shaped with center 0, notapproximately convex, not closed.

33 / 36

Page 44: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Stable parametrization

Continuous parametrization: It is not hard to see, that if % iscontinuous, then so is the map R% : (T1, . . . ,TL)→ Φ.

Quotient map: We can also ask if R% is a quotient map, i.e., ifΦ1,Φ2 are NNs which are close (w.r.t. ‖ · ‖sup), then there exist(T 1

1 , . . . ,T1L ) and (T 2

1 , . . . ,T2L ) which are close in some norm and

R%((T 11 , . . . ,T

1L )) = Φ1 and R%((T 2

1 , . . . ,T2L )) = Φ2.

Proposition (P., Raslan, Voigtlaender; 2018)

Let % be Lipschitz continuous and not affine linear, then R% is not aquotient map.

34 / 36

Page 45: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Consequences

No convexity:

I Want to solve ∇J(Φ) = 0 for an energy J and NN Φ.

I Not only J could be non-convex, but also the set we optimizeover.

I Similar to N-term approximation by dictionaries.

No closedness:

I Exploding coefficients (if PNN (f ) 6∈ NN ).

I No low-neuron approximation.

No inverse-stable parametrization:

I Error term very small, while parametrization is far from optimal.

I Potentially very slow convergence.

35 / 36

Page 46: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Where to go from here?

Different networks:

I Special types of networks could be more robust.

I Convolutional neural networks are probably still too large aclass. [P., Voigtlaender; 2018].

Stronger norms:

I Stronger norms naturally help with closedness and inversestability.

I Example is Sobolev training [Czarnecki, Osindero, Jaderberg,Swirszcz, Pascanu; 2017].

I Many arguments of our results break down if W 1,∞ norm isused.

36 / 36

Page 47: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Conclusion

Approximation: NNs are a very powerful approximation tool:

I Often optimally efficientparametrization

I overcome curse of dimension

I surprisingly efficient black-boxoptimization

Topological structure: NNs form an impractical set:

I non-convex

I non-closed

I no inverse-stable parametrization.

37 / 36

Page 48: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Thank you for your attention!

References:

H. Andrade-Loarca, G. Kutyniok, O. Oktem, P. Petersen, Extraction of digital wavefront sets using applied

harmonic analysis and deep neural networks, arXiv:1901.01388

H. Bolcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep

Neural Networks, arXiv:1705.01714

J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods,

SAM, ETH Zurich, 2019.

P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural

networks, Neural Networks, (2018)

P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural

networks of fixed size, arXiv:1806.08459

P. Petersen, F. Voigtlaender, Equivalence of approximation by convolutional neural networks and

fully-connected networks, arXiv:1809.00973

37 / 36

Page 49: Deep Neural Networks and Partial Differential Equations ... · function spaces Ireduced curse of dimen-sion 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0-1 0 0.2 0.6 0.8 1 0.2 0.4 0.6 0.8 0 0.5 1

Thank you for your attention!

References:

H. Andrade-Loarca, G. Kutyniok, O. Oktem, P. Petersen, Extraction of digital wavefront sets using applied

harmonic analysis and deep neural networks, arXiv:1901.01388

H. Bolcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep

Neural Networks, arXiv:1705.01714

J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods,

SAM, ETH Zurich, 2019.

P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural

networks, Neural Networks, (2018)

P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural

networks of fixed size, arXiv:1806.08459

P. Petersen, F. Voigtlaender, Equivalence of approximation by convolutional neural networks and

fully-connected networks, arXiv:1809.00973

36 / 36