担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model...

• 顶级学术期刊和会议上发表150余篇论文，被引用万余次。

• 担任诸多顶级学术会议(SIGIR/ICML/NIPS/KDD/WWW/AAAI/ WINE/ICTIR等)组委会主席或领域主席, 顶级学术期刊(TOIS/TWEB

/Neurocomputing等)副主编。

• 发布或联合发布知名开源项目• 微软认知工具包(CNTK)

• 微软图引擎 (Graph Engine)

• 微软分布式机器学习工具包(DMTK)

28.225.8

7.3 6.73.5

ILSVRC2010 NECAmerica

ILSVRC2011 Xerox

ILSVRC2012

AlexNet

ILSVRC2013 Clarifi

ILSVRC2014 VGG

ILSVRC2014

GoogleNet

ILSVRC2015 ResNet

ImageNet Classification top-5 error (%)

Microsoft had all 5 entries being the 1-st places this year: ImageNet classification,

ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation

https://arxiv.org/abs/1610.05256

vs.Sedol Lee

Atari Games

Machine Learning, as the driving force, is

entering a new era of big model and big data.

DNN: Deep Neural Networks

CNN: Convolutional Neural

Networks

RNN: Recurrent Neural

Networks

Model Hardware Time cost

ResNet on ImageNet

~1M image samples for

1K classes

K40 * 8 ~ 130 hours

GoogleNet on ImageNet K40 ~ 570 hours

2000h Speech LSTM

model training

K40 ~ 1100 hours

Neural Translation model K40 ~ 2000 hours

How to well utilize computation resources

to speed up the training of big model

over big data?

• Mini-tutorial on distributed machine learning

• Details on CNTK’s parallel training

• CNTK is Microsoft’s open-source, cross-platform toolkit for learning and evaluating deep neural networks.

• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.

• CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.

“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”

example: 2-hidden layer feed-forward NN

h1 = s(W1 x + b1)

h2 = s(W2 h1 + b2)

P = softmax(Wout h2 + bout)

with input x RM

h1 = s(W1 x + b1)

h2 = s(W2 h1 + b2)

P = softmax(Wout h2 + bout)

with input x RM and one-hot label y RJ

and cross-entropy training criterion

ce = yT log P

Scorpusce = max

h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)

h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)

P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)

with input x RM and one-hot label y RJ

and cross-entropy training criterion

ce = yT log P ce = cross_entropy (P, y)

Scorpusce = max

h1 = sigmoid (x @ W1 + b1)

h2 = sigmoid (h1 @ W2 + b2)

P = softmax (h2 @ Wout + bout)

ce = cross_entropy (P, y)

softmax

cross_entropy

h1 = sigmoid (x @ W1 + b1)

h2 = sigmoid (h1 @ W2 + b2)

P = softmax (h2 @ Wout + bout)

ce = cross_entropy (P, y)

softmax

cross_entropy

LEGO-like composability allows CNTK to supportwide range of networks & applications

• Select right reader and learner to do an LEGO-

like training

• Describe network as Computation Network

• Using Simple Network Builder, Brainscript or

python to build Computation Network

learner• SGD

(momentum,

AdaGrad, …)

• minibatching,

packing,

padding

reader• task-specific

deserializer

• automatic

randomization

network• network

definition

• CPU/GPU

execution

engine

• FCN-5/-8 – Fully connected network with different layers.

• AlexNet/ResNet – typical convolutional network for image classification /

recognition

• LSTM-32/-64 is for the LSTM model (2 lstm layer inside) with different batch size

• Iteration here means a (mini-batch) update.

• Caffe doesn’t support RNN/LSTM computation.

• Experiment with TitanX GPU

FCN-5 FCN-8 AlexNet ResNet LSTM-32 LSTM-64

Speed Comparison for Single GPU Trainingmeausred by iterations*/second, higher=better

Caffe CNTK MXNet TensorFlow Torch

• CNTK has 1bit SGD enabled.

• All the experiment is carried out based on FCN-5 model with

synchronous data parallel algorithm on K40 GPUs with

Infinitband connection.

FCN-5 FCN-5, 2 GPUs FCN-5, 4 GPUs FCN-5, 8 GPUs(2

nodes)

Speed Comparison for parallel TrainingMeasured by iterations / second, higher = better

Caffe CNTK TensorFlow Torch

http://github.com/Microsoft/cntk

• CNTK is fast for single-GPU training, and its speed is especially

outstanding for RNN training. CNTK is also the best among all

DNN tools in terms of scalability.

• Quick revisit on CNTK

• Details on CNTK’s parallel training

1. Partition the training data

2. Parallel training on different machines

3. Synchronize the local updates

4. Refresh the local model with new

parameters, go to 2.

Worker 1

Worker 2

Worker 3

Worker 4

∆𝜔𝑖𝑡1

2 𝜔𝑡

Global Model

Synchronization

Individual workers synchronize with each other every (k)

mini-batch:

1) Aggregate ∆𝜔𝑖’s from all workers to refine global model 𝜔.2) Broadcast global model 𝜔 back to each worker.

3) After receiving new global model, each worker starts next

step of training.

BSP: Bulk Synchronous

Parallel

Worker 1

Worker 2

Worker 3

Worker 4

∆𝜔𝑖𝑡1 2 𝜔𝑡

Global Model

No Synchronization until staleness threshold is hit

Finished Iteration #5

Finished Iteration #2

When staleness=3,

worker 3 will wait for

worker 1 to catch

Individual worker pushes update ∆𝝎𝒊 to global model

𝝎 every (k) mini-batch, until notice that another

worker is 𝒔 steps behind. Thus SSP tradeoffs between

BSP and ASP.

1) When 𝑠 = 0, 𝑆𝑆𝑃 = 𝐵𝑆𝑃.

2) When 𝑠 = ∞, 𝑆𝑆𝑃 = 𝐴𝑆𝑃.

SSP: Stale Synchronous

Parallel

• BSP is a well-defined mechanism, which

can be equivalent to a single-machine

SGD under certain conditions.

• BSP has convergence guarantee, but

might be inefficient due to frequent

synchronization.

• ASP always runs fast due to its

asynchronous nature, no time wasted on

waiting.

• ASP, in theory, might not converge when

differences between workers’ progresses

are unbounded (straggler will destroy

convergence by pushing stale ∆𝜔 onto

global model).

• SSP tradeoffs efficiency and

convergence: (1) It does not require

strict synchronization (2) It does not

allow workers’ progresses to have

large differences.

• SSP is proven to converge for convex

loss and bounded staleness.

Leslie G. Valiant

Worker 1

Worker 2

Worker 3

Worker 4

∆𝜔𝑖𝑡11

2 𝜔𝑡2

Global Model

No Synchronization

Individual worker push its update ∆𝝎𝒊 to global model

𝝎 every (k) mini-batch, without waiting for others.

1) Push update ∆𝜔𝑖 to global model 𝜔2) Pull back whatever global model in the parameter server

3) Proceed training based on the latest 𝜔 in local machine.

ASP: Asynchronous Parallel

A system approach

• The global model is partitioned into K sub-models without

overlap.

• The sub-models are distributed over K local workers and

serve as their local models.

• In each mini-batch, the local workers compute the

gradients of the local weights by back propagation.

Dataflow

(Deep learning)

Synchronous

Asynchronous

Data Parallelism Model Parallelism

Iterative

MapReduc

(LDA, LR)

Parameter Server

(Deep learning, LDA,

GBDT, LR)

Irregular Parallelism

Iterative

MapReduce• Use MapReduce /

AllReduce to sync

parameters among

workers

• Only synchronous

update

• Example: Spark and

other derived

systems

Local computation

Synchronous

update

Iterative

MapReduce

Parameter

Server

• Parameter server (PS) based

solution is proposed to

support: • Asynchronous update

• Different mechanisms for

model aggregation, especially

in asynchronous manner

• Model parallelism

• Example: • Google’s DistBelief; Petuum

• Multiverso PS

+ NIPS’12 DistBelief (Google), NIPS’13 Petuum (Eric Xing), OSDI’14 Parameter server (Mu Li), Multiverso PS… etc.

Iterative

MapReduce

Parameter

Server

Dataflow based solution is

proposed to support:• Irregular parallelism (e.g.,

hybrid data- and model-

parallelism), particularly in

deep learning

• Both high-level abstraction and

low-level flexibility in

implementation

Example: • Google’s TensorFlowDataflow

+ Tensorflow, Eusys’07 Dryad (Microsoft), NSDI’12 Spark (AMP Lab), CNTK, MXNet… etc.

Task scheduling &

execution based on:

1. Data dependency

2. Resource availability

Dataflow Resource

Computation

Infrastructureparallel execution

engine based on

gRPC, packup

message,

communication, etc.

Computation

allocation

methodManual assign to

devices vs. automatic

conduct optimized

allocation

Parameter

aggregation

logic Inherent all research

output from

parameter server

The focus is still in Data Parallelism

+ data is huge

+ model is in modest size

+ a cluster of machine are working together to speed up the data partition training

• Quick revisit on CNTK

FCN-5 FCN-5, 2 GPUs FCN-5, 4 GPUs FCN-5, 8 GPUs(2 nodes)

Speed Comparison for parallel TrainingMeasured by iterations / second, higher = better

Caffe CNTK TensorFlow Torch

• Communicate less each time

• Communicate less often

• Asynchronous and Pipelined processing

https://github.com/Microsoft/CNTK/wiki/Multiple-GPUs-and-machines

• quantize gradients to but 1 bit per value with error feedback• All parameter was decided weather to plus/minus a same value, and carries

over the rest value to next minibatch

• Delay the updates procedure

0 5 10 15 20 25 30 35

Transferred Gradient (bits/value), smaller is better

1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs, InterSpeech 2014, F. Seide, H. Fu, J. Droppo, G. Li, D. Yu

K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering,” ICASSP 2016

Block Momentum

2.9 5.4

8.0 3.3

6.7 10.8 3.7 6.9

4.1 8.1

4 GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs

1bit/BMUF Speedup Factors in LSTM Training

1bit-average

1bit-peak

BMUF-average

BMUF-peak

Asynchronous and Pipelined processing

Communication barrier in synchronous parallelization asynchronous parallelization

Asynchronous and Pipelined processing

Main Thread (training)

Communication Thread

……….. Training 1 Training

Communication

1 ………..

………..

GPU Buffer

22Communicatio

Fill model

to gpu

buffer

Prepare

updatesPrepare

Switch

buffer

• Sequential SGD𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡+τ

• Async SGD𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡

Delayed communication in asynchronous parallelization

ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡

DC-ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡 − λ𝜙(𝑔 𝑤𝑡 ) · ()

𝑤𝑡+τ −𝑤𝑡

Training curve for a Resnet

DNN model for cifra-10

A work that directly targets to handle the delay, and it

is experimentally effective and with convergence

analysis.

• CNTK is Microsoft’s open-source, cross-platform toolkit for learning and evaluating deep neural networks.• Linux, Windows, docker, .Net

• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.• automatic differentiation, deferred computation, optimized execution and memory use

• powerful description language, composability

• implicit time; efficient static and recurrent NN training through batching

• data parallelization, GPUs & servers: 1-bit SGD, Block Momentum

• feed-forward DNN, RNN, LSTM, convolution, DSSM; speech, vision, text

• CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.

• Web site: https://cntk.ai/

• Github: https://github.com/Microsoft/CNTK

• Wiki: https://github.com/Microsoft/CNTK/wiki

• Issues: https://github.com/Microsoft/CNTK/issues

mailto:fseide@microsoft.com

CNTK: democratizing the AI tool chain

taifengw@Microsoft.com

担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017 · 2000h Speech LSTM model...

Documents

Herzlich im Restaurant Zlatá Hvězda Opening hours

The Effect of Working Hours on Health

MODEL S MODEL X MODEL 3

OECD Model

Autopheresis-C Plasmapheresis System - narod.ru€¦ · Autopheresis-C Plasmapheresis System Service Manual Model A-200 (4R4550) Model A-201 (4R4560) Model A-401 (4R4561) Model A-200

MODEL AGR+ MODEL AGRC

Was ist der easyCITY.PASS? Hier erhältlich available here ... · inclusive: Gültigkeit & Preise / validity & prices Stunden hours Stunden hours Tage days Tage days Tage days 48

Schriftliche Prüfungsarbeit zum mittleren Schulabschluss ......TOOTHACHE • Children of 12 years and over, 1-2 tablets every 4 hours • Do not take more than 8 tablets in 24 hours

07 TH Capatect Edelkratzputz K40 Spritzdatenblatt Layout 1€¦ · Gerätetyp* Durchlaufmischer Durchlaufmischer + Förderpumpe Förderpumpe Mischpumpe Trockenförderanlage ++ sehr

Stewarts Model

Model Checking for SCChartsbiblio/... · model-to-model transformations. 1.2 Model Checking Model checking is a formal method used in practice to show that a property holds in a system

Kohler 55 kW · 2016. 7. 11. · Engine Specs Mfr: John Deere Year: 2008 Hours: 326 Fuel: Diesel Model#: 4039TF001 Serial#: T04039T349001 Description: Kohler 55 kW Standby Diesel

MODEL POSING GUIDE - Soul-Fotosoul-foto.ru/photo_books/Model posing guide. 450 poses.pdfIMPRESSUM DIRK ROSENBERGER / MODEL POSING GUIDE Dirk Rosenberger / , »Model Posing Guide«

MODEL POSING GUIDE · 2014-06-17 · Fotograf und Copyright: Dirk Rosenberger / Homepage: Model: Inga Model / Homepag e: Fotograf und Copyright

MODEL AGR+ MODEL AGRC - Andis · MODEL AGR+® MODEL AGRC Congratulations! you just went First Class when you bought this Andis Clipper. Careful workmanship and quality design have

Arbeit als Teil einer Energieutopie › fileadmin › user_upload › ...Hours Rank 1 Hours Rank 1 Abs.Diff. Rank 1 Euro at PPS % p.a. Netherlands 1435 17 1380 17 -55 6 34.868 1.9

hours 2 KEMENTERIAN PENDIDIKAN MALAYSIA · 2016. 11. 10. · qs015/2 it*:runatix pwr2 semester i session 2015/2016 2 hours qso15i2 matematik kertas 2 semester i sesi 2015/2016 2 jam

Amtliches Mitteilungsblatt - gremien.hu-berlin.de · paper). Preconditions: none Teaching format Hours per week, workload in hours Credits preconditions for granting Topics, contents

Nedis Bluetooth®-högtalare - Elon Group...Nedis Bluetooth®-högtalare SPBT35800WT 5 Watt 15 Watt 4.2 2" yes 3.7 V / 2200 mAh 3 hours 6 hours IPX5 Ø 170 x 285 mm Power (RMS) Audio

Modulkatalog für den Spezialisierungsbereich des ...€¦ · Duration: 1 semester ECTS credits: 5 Teaching method (hours per week): lecture (2) Workload: 21 hours for lectures and