29
Jenny Seidenschwarz Technische Universität München Seminar Course AutoML Munich, 4 th of July 2019 Searching for Activation Functios

Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Jenny Seidenschwarz

Technische Universität München

Seminar Course AutoML

Munich, 4th of July 2019

Searching for Activation Functios

Page 2: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Lehrstuhl für MusterverfahrenFakultät für MustertechnikTechnische Universität München

2Jenny Seidenschwarz (TUM) | Seminar Course AutoML

Activation Functions

• Gradient preserving property

• More easy to optimize

x1

xn

w1j

Linear

Trafo

w2j

wnj

x2 Activation

Function… … …

State of the art default scalar activation function: ReLU 𝒎𝒂𝒙(𝟎, 𝒙)

Figure: ReLU activation function

Figure: Basic structure neural network

Page 3: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

3Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Research Goal

Find new scalar activation functions …

… using automated search technique …

… compare them systematically to existing activation functions …

… across multiple different challenging datasets!

Page 4: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Automated Search

Page 5: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Challenge: balance size and expressivity of search space

→ Simple binary expression tree [1]

→ Selection of unary and binary functions

5Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Search Space

x

x Unary

Unary

Binary

Binary

x

Unary: 𝑥, −𝑥, 𝑥 , 𝑥2, 𝑥3, 𝑥 , 𝛽𝑥, 𝑥 + 𝛽, log 𝑥 + 𝜖 ,

exp 𝑥 , sin 𝑥 , cos 𝑥 , sinh 𝑥 , cosh 𝑥 , tanh 𝑥 ,

𝑡𝑎𝑛−1 𝑥 , 𝑠𝑖𝑛ℎ−1 𝑥 , 𝑠𝑖𝑛𝑐 𝑥 ,max 0, 𝑥 ,

min 0, 𝑥 , 𝜎 𝑥 , log 1 + exp 𝑥 , exp −𝑥2 , erf x , β

BInary: 𝑥1 + 𝑥2, 𝑥1𝑥2, 𝑥1 − 𝑥2,𝑥1

𝑥2+𝜖, max(𝑥1, 𝑥2),

m𝑖𝑛 𝑥1, 𝑥2 , 𝜎 𝑥1 𝑥2, exp −𝛽(𝑥1 − 𝑥2)2 , exp(−𝛽|𝑥1 −

𝑥2|), β𝑥1 + (1 − 𝛽)𝑥2

core unit

Unary

Unary

Figure: Core Unit, adapted from [5]

Page 6: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

6Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Search approach

small search space

big search space

exhaustive

search

reinforcement

learning-based

search

train child

network with

found

activation

functions

get list of best

performing

update search

algorithm and

find best

empirical

evaluation and

experiments

Page 7: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

RNN-controller [2] with domain specific language [1]

Train batch of generated activation functions

• ResNet-20

• Image classification on CIFAR-10

• 10k steps

7Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

RNN-Controller

core unit

Figure: RNN-Controller architecture [5]

Page 8: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

8Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

RNN-Controller update

RNN Controller with PPO:

→ Clipping ensures updates in „trust region“

→ Sample efficient

• Objective function

𝐿𝐶𝐿𝐼𝑃 𝜃 = 𝔼𝑡 min 𝜎𝑡𝐺𝑡 , 𝑐𝑙𝑖𝑝 𝜎𝑡 , 1 − 𝜀, 1 + 𝜀 𝐺𝑡 , 𝑤𝑖𝑡ℎ 𝜎𝑡 =𝜋𝜃 𝑎𝑡 𝑠𝑡)

𝜋𝜃𝑜𝑙𝑑 𝑎𝑡 𝑠𝑡)

RNN Controller with REINFORCE:

• Objective function ℒ𝜃𝑐 = 𝔼𝜋𝜃,𝜏𝐺𝑡 , where 𝐺𝑡 = σ𝑘=𝑡

𝑇−1 𝛾𝑘−𝑡𝑟𝑘+1

Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐

Figure: clipping function,

adapted from [3]

Page 9: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

RNN Controller with PPO:

• Objective gradient

→ 𝐺𝑘 = accuracy of child network

→ 𝑏 = exponential moving average of rewards

RNN Controller with REINFORCE:

• Objective gradient

9Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

RNN-Controller update

Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐

One child network: ∇ℒ𝜃𝑐 = σ𝑡=1𝑇 ∇𝜃𝑐 𝑙𝑜𝑔 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 (𝐺 − 𝑏)

Figure: clipping function, adapted from [3]

Page 10: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Findings on Activation Functions

Page 11: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

1. 1-2 core units perform best

2. Top activation functions always take raw preactivation x as input to final binary function

3. Periodic functions (sin, cos, etc. ) used by some top performing activation functions

4. Activation functions that use division perform poorly

11Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Findings on Activation Functions

𝑥𝜎 𝛽𝑥

𝑥(𝑠𝑖𝑛ℎ−1 𝑥 )2

min 𝑥, sin 𝑥

(𝑡𝑎𝑛ℎ−1(𝑥))2−𝑥

max(𝑥, 𝜎 𝑥 )

cos 𝑥 − 𝑥

m𝑎𝑥 𝑥, tanh(𝑥)

𝑠𝑖𝑛𝑐 𝑥 + 𝑥

Figure: best 8 activation functions [5]

Page 12: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Experiments to ensure generalization to deeper networks

12Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Validation of Performance

Swish

Figure: Generalization to deeper architectures of 8 best activation functions found [5]

(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy

Page 13: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

• Nonlinearly interpolation between ReLU and linear function

• Smooth function

• Non-monotinoc function

• Unbounded above and bounded below (like ReLU)

13Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Swish𝑓 𝑥 = 𝑥 𝜎(𝛽𝑥)

Figure: Swish activation function for different 𝛽 values

Page 14: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Benchmark of Swish

Page 15: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Benchmarked Swish to ReLU and other baseline activation functions

• Different models

• Different challenging real world datasets

• Test with fixed β = 1 and trainable β

15Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Further Experiments with Swish

Page 16: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

CIFAR 10 and 100: ResNet-164, Wide ResNet 28-10 and DenseNet 100-12

• Median of 5 runs for comparison

ImageNet: Inception-ResNet-v2, Inception-v4, Inception-v3, MobileNet and Mobile NASNet-A

• Fixed number of steps, 3 learning rates with RMSProp

• Epsecially good performance on mobile sized modelm slightly underperform Inception-v4

English-German-translation: 12 layer Base Transformer

• Two different learning rates, 300K steps with Adam optimizer

16Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Further Experiments with Swish

Figure: Overview performace in experiments [5]

Page 17: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Performance of Swish

Page 18: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Learnable 𝛽:

18Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Swish – learnable parameter β

Figure: distribution of trained 𝛽 on Mobile NASNet-A [5]

Page 19: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Non-monotinic bump for 𝑥 < 0

19Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Swich – Challenging current Belief

Figure: Preactivations for 𝛽 = 1 on

ResNet-32 [5]

Figure: Swish function for different 𝛽

values with non-monotonic bump

Page 20: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

No gradient preserving characteristics of derivative:

20Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Swich – Challenging current Belief

Figure: Derivative of swish function for

different 𝛽 values

Figure: Derivative of ReLU

Page 21: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Main contributions:

• Used a search space as in [1] to find activation functions with a RNN controller [2], that was

updated with PPO [3]

• Systematically compared activation functions

• Found new activation function that constantly outperform or is on par with ReLU

Critical aspects:

• Search space restricts results

• Search space designed after human intuition

• Restriction of training steps and training on small architectures might suppress even better

activation functions

Future research:

• Only two core units, but more unary and binary functions

• Also take non-scalar activation functions into account

21Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Conclusion

Page 22: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

[1] Bello, I. & Zoph, B. & Vasudevan, V. & Le, Quoc V. (2017). Neural Optimizer Search with

Reinforcement Learning.

[2] Zoph, B., & Le, Quoc V. (2017). Neural Architecture Search with Reinforcement Learning.

ArXiv, abs/1611.01578.

[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy

Optimization Algorithms. ArXiv, abs/1707.06347.

[4] Elfwing, S. & Uchibe, E. & Doya, K. (2018). Sigmoid-Weighted Linear Units for Neural

Network Function Approximation in Reinforcement Learning. Neural Networks. 107.

10.1016/j.neunet.2017.12.012.

[5] Ramachandran, P., Zoph, B., & Le, Q.V. (2018). Searching for Activation Functions. ArXiv,

abs/1710.05941.

22Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

References

Page 23: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

Back-up

Page 24: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

If you want to use Swish:

• Already implemented in tensorflow as tf.nn.swish(x)

• When using batch norm: set scale parameter

• Derivative of swish: 𝑓′ 𝑥 = 𝛽𝑓 𝑥 + 𝜎(𝛽𝑥)(1 − 𝛽𝑓 𝑥 )

24Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Things to note

Page 25: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

25Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiment Results - CIFAR

Figure: Benchmark experiments of Swish function to baseline functions on CIFAR [5]

(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy

Page 26: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

26Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiments on ImageNet

Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]

(a) Training curves of Mobile NASNet-Aon

ImageNet. Best viewed in color

(b) Mobile NASNet-A on ImageNet, with3 different

runs ordered by top-1 accuracy. Theadditional 2 GELU

experiments are still trainingat the time of submission.

Page 27: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

27Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiments on ImageNet

Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]

(a) Inception-ResNet-v2 on ImageNetwith 3 different

runs. Note that the ELUsometimes has instabilities at

the start oftraining, which accounts for the first result

(b) MobileNet on ImageNet.

Page 28: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

28Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiments on ImageNet

Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]

(a) Inception-v3 on ImageNet (b) Inception-v4 on ImageNet

Page 29: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München

29Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiments on Machine Translation

Figure: Benchmark experiments of Swish function to baseline functions on WTM

English→German (BLEU score) [5]