53

担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 2: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• 顶级学术期刊和会议上发表150余篇论文,被引用万余次。

• 担任诸多顶级学术会议(SIGIR/ICML/NIPS/KDD/WWW/AAAI/ WINE/ICTIR等)组委会主席或领域主席, 顶级学术期刊(TOIS/TWEB

/Neurocomputing等)副主编。

• 发布或联合发布知名开源项目• 微软认知工具包(CNTK)

• 微软图引擎 (Graph Engine)

• 微软分布式机器学习工具包(DMTK)

Page 3: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

28.225.8

16.4

11.7

7.3 6.73.5

ILSVRC2010 NECAmerica

ILSVRC2011 Xerox

ILSVRC2012

AlexNet

ILSVRC2013 Clarifi

ILSVRC2014 VGG

ILSVRC2014

GoogleNet

ILSVRC2015 ResNet

ImageNet Classification top-5 error (%)

Microsoft had all 5 entries being the 1-st places this year: ImageNet classification,

ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation

Page 4: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 5: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 6: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

https://arxiv.org/abs/1610.05256

Page 7: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

vs.Sedol Lee

Atari Games

Machine Learning, as the driving force, is

entering a new era of big model and big data.

Page 8: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

DNN: Deep Neural Networks

CNN: Convolutional Neural

Networks

RNN: Recurrent Neural

Networks

Page 9: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Model Hardware Time cost

ResNet on ImageNet

~1M image samples for

1K classes

K40 * 8 ~ 130 hours

GoogleNet on ImageNet K40 ~ 570 hours

2000h Speech LSTM

model training

K40 ~ 1100 hours

Neural Translation model K40 ~ 2000 hours

Page 10: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

How to well utilize computation resources

to speed up the training of big model

over big data?

Page 11: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 12: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• Mini-tutorial on distributed machine learning

• Details on CNTK’s parallel training

Page 13: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• CNTK is Microsoft’s open-source, cross-platform toolkit for learning and evaluating deep neural networks.

• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.

• CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.

Page 14: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 15: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”

Page 16: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”

example: 2-hidden layer feed-forward NN

h1 = s(W1 x + b1)

h2 = s(W2 h1 + b2)

P = softmax(Wout h2 + bout)

with input x RM

Page 17: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”

example: 2-hidden layer feed-forward NN

h1 = s(W1 x + b1)

h2 = s(W2 h1 + b2)

P = softmax(Wout h2 + bout)

with input x RM and one-hot label y RJ

and cross-entropy training criterion

ce = yT log P

Scorpusce = max

Page 18: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”

example: 2-hidden layer feed-forward NN

h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)

h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)

P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)

with input x RM and one-hot label y RJ

and cross-entropy training criterion

ce = yT log P ce = cross_entropy (P, y)

Scorpusce = max

Page 19: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”

h1 = sigmoid (x @ W1 + b1)

h2 = sigmoid (h1 @ W2 + b2)

P = softmax (h2 @ Wout + bout)

ce = cross_entropy (P, y)

Page 20: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”

+

s

+

s

+

softmax

W1

b1

W2

b2

Wout

bout

cross_entropy

h1

h2

P

x y

h1 = sigmoid (x @ W1 + b1)

h2 = sigmoid (h1 @ W2 + b2)

P = softmax (h2 @ Wout + bout)

ce = cross_entropy (P, y)

ce

Page 21: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

“CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.”

+

s

+

s

+

softmax

W1

b1

W2

b2

Wout

bout

cross_entropy

h1

h2

P

x y

ce

LEGO-like composability allows CNTK to supportwide range of networks & applications

Page 22: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• Select right reader and learner to do an LEGO-

like training

• Describe network as Computation Network

• Using Simple Network Builder, Brainscript or

python to build Computation Network

CNTK

learner• SGD

(momentum,

AdaGrad, …)

• minibatching,

packing,

padding

reader• task-specific

deserializer

• automatic

randomization

corpu

s

mode

l

network• network

definition

• CPU/GPU

execution

engine

Page 23: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• FCN-5/-8 – Fully connected network with different layers.

• AlexNet/ResNet – typical convolutional network for image classification /

recognition

• LSTM-32/-64 is for the LSTM model (2 lstm layer inside) with different batch size

• Iteration here means a (mini-batch) update.

• Caffe doesn’t support RNN/LSTM computation.

• Experiment with TitanX GPU

0

5

10

15

20

25

30

35

40

FCN-5 FCN-8 AlexNet ResNet LSTM-32 LSTM-64

Speed Comparison for Single GPU Trainingmeausred by iterations*/second, higher=better

Caffe CNTK MXNet TensorFlow Torch

• CNTK has 1bit SGD enabled.

• All the experiment is carried out based on FCN-5 model with

synchronous data parallel algorithm on K40 GPUs with

Infinitband connection.

0

5

10

15

20

25

FCN-5 FCN-5, 2 GPUs FCN-5, 4 GPUs FCN-5, 8 GPUs(2

nodes)

Speed Comparison for parallel TrainingMeasured by iterations / second, higher = better

Caffe CNTK TensorFlow Torch

http://github.com/Microsoft/cntk

• CNTK is fast for single-GPU training, and its speed is especially

outstanding for RNN training. CNTK is also the best among all

DNN tools in terms of scalability.

Page 24: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• Quick revisit on CNTK

• Mini-tutorial on distributed machine learning

• Details on CNTK’s parallel training

Page 25: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

1. Partition the training data

2. Parallel training on different machines

3. Synchronize the local updates

4. Refresh the local model with new

parameters, go to 2.

Page 26: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Worker 1

Worker 2

Worker 3

Worker 4

∆𝜔𝑖𝑡1

2 𝜔𝑡

Global Model

Synchronization

Time

Individual workers synchronize with each other every (k)

mini-batch:

1) Aggregate ∆𝜔𝑖’s from all workers to refine global model 𝜔.2) Broadcast global model 𝜔 back to each worker.

3) After receiving new global model, each worker starts next

step of training.

BSP: Bulk Synchronous

Parallel

Worker 1

Worker 2

Worker 3

Worker 4

∆𝜔𝑖𝑡1 2 𝜔𝑡

Time

Global Model

No Synchronization until staleness threshold is hit

Finished Iteration #5

Finished Iteration #2

When staleness=3,

worker 3 will wait for

worker 1 to catch

up.

Individual worker pushes update ∆𝝎𝒊 to global model

𝝎 every (k) mini-batch, until notice that another

worker is 𝒔 steps behind. Thus SSP tradeoffs between

BSP and ASP.

1) When 𝑠 = 0, 𝑆𝑆𝑃 = 𝐵𝑆𝑃.

2) When 𝑠 = ∞, 𝑆𝑆𝑃 = 𝐴𝑆𝑃.

SSP: Stale Synchronous

Parallel

• BSP is a well-defined mechanism, which

can be equivalent to a single-machine

SGD under certain conditions.

• BSP has convergence guarantee, but

might be inefficient due to frequent

synchronization.

• ASP always runs fast due to its

asynchronous nature, no time wasted on

waiting.

• ASP, in theory, might not converge when

differences between workers’ progresses

are unbounded (straggler will destroy

convergence by pushing stale ∆𝜔 onto

global model).

• SSP tradeoffs efficiency and

convergence: (1) It does not require

strict synchronization (2) It does not

allow workers’ progresses to have

large differences.

• SSP is proven to converge for convex

loss and bounded staleness.

Leslie G. Valiant

Worker 1

Worker 2

Worker 3

Worker 4

∆𝜔𝑖𝑡11

2 𝜔𝑡2

Time

Global Model

No Synchronization

Individual worker push its update ∆𝝎𝒊 to global model

𝝎 every (k) mini-batch, without waiting for others.

1) Push update ∆𝜔𝑖 to global model 𝜔2) Pull back whatever global model in the parameter server

3) Proceed training based on the latest 𝜔 in local machine.

ASP: Asynchronous Parallel

Page 27: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

A system approach

• The global model is partitioned into K sub-models without

overlap.

• The sub-models are distributed over K local workers and

serve as their local models.

• In each mini-batch, the local workers compute the

gradients of the local weights by back propagation.

Page 28: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Dataflow

(Deep learning)

Synchronous

Asynchronous

Data Parallelism Model Parallelism

Iterative

MapReduc

e

(LDA, LR)

Parameter Server

(Deep learning, LDA,

GBDT, LR)

Irregular Parallelism

Page 29: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Iterative

MapReduce• Use MapReduce /

AllReduce to sync

parameters among

workers

• Only synchronous

update

• Example: Spark and

other derived

systems

Local computation

Synchronous

update

Page 30: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Iterative

MapReduce

Parameter

Server

• Parameter server (PS) based

solution is proposed to

support: • Asynchronous update

• Different mechanisms for

model aggregation, especially

in asynchronous manner

• Model parallelism

• Example: • Google’s DistBelief; Petuum

• Multiverso PS

+ NIPS’12 DistBelief (Google), NIPS’13 Petuum (Eric Xing), OSDI’14 Parameter server (Mu Li), Multiverso PS… etc.

Page 31: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Iterative

MapReduce

Parameter

Server

Dataflow based solution is

proposed to support:• Irregular parallelism (e.g.,

hybrid data- and model-

parallelism), particularly in

deep learning

• Both high-level abstraction and

low-level flexibility in

implementation

Example: • Google’s TensorFlowDataflow

+ Tensorflow, Eusys’07 Dryad (Microsoft), NSDI’12 Spark (AMP Lab), CNTK, MXNet… etc.

Task scheduling &

execution based on:

1. Data dependency

2. Resource availability

Dataflow Resource

Page 32: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Computation

Infrastructureparallel execution

engine based on

gRPC, packup

message,

communication, etc.

Computation

allocation

methodManual assign to

devices vs. automatic

conduct optimized

allocation

Parameter

aggregation

logic Inherent all research

output from

parameter server

side.

Page 33: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

The focus is still in Data Parallelism

+ data is huge

+ model is in modest size

+ a cluster of machine are working together to speed up the data partition training

Page 34: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• Quick revisit on CNTK

• Mini-tutorial on distributed machine learning

Page 35: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

0

5

10

15

20

25

FCN-5 FCN-5, 2 GPUs FCN-5, 4 GPUs FCN-5, 8 GPUs(2 nodes)

Speed Comparison for parallel TrainingMeasured by iterations / second, higher = better

Caffe CNTK TensorFlow Torch

Page 36: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 37: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• Communicate less each time

• Communicate less often

• Asynchronous and Pipelined processing

https://github.com/Microsoft/CNTK/wiki/Multiple-GPUs-and-machines

Page 38: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• quantize gradients to but 1 bit per value with error feedback• All parameter was decided weather to plus/minus a same value, and carries

over the rest value to next minibatch

• Delay the updates procedure

0 5 10 15 20 25 30 35

1-bit

float

Transferred Gradient (bits/value), smaller is better

1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs, InterSpeech 2014, F. Seide, H. Fu, J. Droppo, G. Li, D. Yu

Page 39: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 40: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering,” ICASSP 2016

Page 41: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Block Momentum

K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering,” ICASSP 2016

Page 42: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

2.9 5.4

8.0 3.3

6.7 10.8 3.7 6.9

13.8

25.5

43.7

4.1 8.1

14.1

27.3

54.0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

4 GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs

1bit/BMUF Speedup Factors in LSTM Training

1bit-average

1bit-peak

BMUF-average

BMUF-peak

K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering,” ICASSP 2016

Page 43: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 44: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Asynchronous and Pipelined processing

Communication barrier in synchronous parallelization asynchronous parallelization

Page 45: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

Asynchronous and Pipelined processing

Main Thread (training)

Communication Thread

……….. Training 1 Training

Communication

1 ………..

………..

GPU Buffer

22Communicatio

n

Fill model

to gpu

buffer

Prepare

updatesPrepare

model

Switch

buffer

Page 46: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• Sequential SGD𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡+τ

• Async SGD𝑤𝑡+τ+1 = 𝑤𝑡+τ − η ∗ 𝑔 𝑤𝑡

Delayed communication in asynchronous parallelization

Page 47: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡

DC-ASGD: 𝑤𝑡+τ+1 = 𝑤𝑡+τ − η 𝑔 𝑤𝑡 − λ𝜙(𝑔 𝑤𝑡 ) · ()

𝑤𝑡+τ −𝑤𝑡

Training curve for a Resnet

DNN model for cifra-10

A work that directly targets to handle the delay, and it

is experimentally effective and with convergence

analysis.

Page 48: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 49: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 50: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize
Page 51: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• CNTK is Microsoft’s open-source, cross-platform toolkit for learning and evaluating deep neural networks.• Linux, Windows, docker, .Net

• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.• automatic differentiation, deferred computation, optimized execution and memory use

• powerful description language, composability

• implicit time; efficient static and recurrent NN training through batching

• data parallelization, GPUs & servers: 1-bit SGD, Block Momentum

• feed-forward DNN, RNN, LSTM, convolution, DSSM; speech, vision, text

• CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.

Page 52: 担任诸多顶级学术会议 - pic.huodongjia.comAug 14, 2017  · 2000h Speech LSTM model training K40 ~ 1100 hours Neural Translation model K40 ~ 2000 hours. How to well utilize

• Web site: https://cntk.ai/

• Github: https://github.com/Microsoft/CNTK

• Wiki: https://github.com/Microsoft/CNTK/wiki

• Issues: https://github.com/Microsoft/CNTK/issues

mailto:[email protected]

CNTK: democratizing the AI tool chain