20
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models Dheevatsa Mudigere †‡ , Yuchen Hao †‡ , Jianyu Huang , Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie (Amy) Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, Vijay Rao Facebook, Menlo Park, USA Abstract Deep learning recommendation models (DLRMs) have been used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row- wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, Neo enables extremely high-performance and memory-efficient embed- ding computations using a variety of critical systems opti- mizations, including hybrid kernel fusion, software-managed caching, and quality-preserving compression. Finally, Neo is paired with ZionEX , a new hardware platform co-designed with Neo’s 4D parallelism for optimizing communications for large-scale DLRM training. Our evaluation on 128 GPUs using 16 ZionEX nodes shows that Neo outperforms exist- ing systems by up to 40× for training 12-trillion-parameter DLRM models deployed in production. 1 Introduction Deep learning recommendation models (DLRMs) are ubiq- uitously used by online companies, including Amazon for [email protected], [email protected] These authors contributed equally Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Conference’17, July 2017, Washington, DC, USA © 2021 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 hps://doi.org/10.1145/nnnnnnn.nnnnnnn AlexNet GoogLeNet VGG ResNet Xception AlphaZero AlphaGoZero BERT GPT-3 DLRM-2020 DLRM-2021 DLRM-2022 0.01 0.1 1 10 100 1000 10000 2010 2012 2014 2016 2018 2020 2022 Petaflop/s-days AlexNet GoogLeNet VGG ResNet Xception AlphaZero BERT GPT-3 Switch Transformer(G) DLRM-2020 DLRM-2021 DLRM-2022 0.001 0.01 0.1 1 10 100 1000 10000 2010 2012 2014 2016 2018 2020 2022 Number Parameters (Billion) Figure 1. Comparing deep learning models in total amount of compute, in petaflop/s-days (top) [44] and model capacity (bottom). selecting items in its catalog [34, 36, 58], Netflix for show- ing movie options [12, 27], and Google for displaying per- sonalized advertisements [6, 8, 17]. They have also been adopted by standard benchmarking organizations, such as MLCommons (MLPerf) [37, 51]. At Facebook, we have been 1 arXiv:2104.05158v5 [cs.DC] 15 Sep 2021

Software-Hardware Co-design for Fast and Scalable Training

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Software-Hardware Co-design for Fast and Scalable Training

Software-Hardware Co-design for Fast and ScalableTraining of Deep Learning Recommendation Models

Dheevatsa Mudigere†‡, Yuchen Hao†‡, Jianyu Huang‡, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan,Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie (Amy) Yang, Leon Gao, Dmytro Ivchenko,

Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-HsiangChu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, LinYang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, AssafEisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak,Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov,

Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, Vijay Rao

Facebook, Menlo Park, USA

AbstractDeep learning recommendation models (DLRMs) have beenused across many business-critical services at Facebook andare the single largest AI application in terms of infrastructuredemand in its data-centers. In this paper, we present Neo, asoftware-hardware co-designed system for high-performancedistributed training of large-scale DLRMs. Neo employs anovel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massiveembedding operators in DLRMs. In addition, Neo enablesextremely high-performance and memory-efficient embed-ding computations using a variety of critical systems opti-mizations, including hybrid kernel fusion, software-managedcaching, and quality-preserving compression. Finally, Neo ispaired with ZionEX , a new hardware platform co-designedwith Neo’s 4D parallelism for optimizing communicationsfor large-scale DLRM training. Our evaluation on 128 GPUsusing 16 ZionEX nodes shows that Neo outperforms exist-ing systems by up to 40× for training 12-trillion-parameterDLRM models deployed in production.

1 IntroductionDeep learning recommendation models (DLRMs) are ubiq-uitously used by online companies, including Amazon for

[email protected], †[email protected]‡These authors contributed equally

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACMmust be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]’17, July 2017, Washington, DC, USA© 2021 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

AlexNetGoogLeNet

VGG ResNet

Xception

AlphaZero

AlphaGoZero

BERT

GPT-3

DLRM-2020

DLRM-2021

DLRM-2022

0.01

0.1

1

10

100

1000

10000

2010 2012 2014 2016 2018 2020 2022

Petaflo

p/s-days

AlexNet

GoogLeNet

VGGResNet

Xception

AlphaZero

BERT

GPT-3

Switch Transformer(G)

DLRM-2020

DLRM-2021

DLRM-2022

0.001

0.01

0.1

1

10

100

1000

10000

2010 2012 2014 2016 2018 2020 2022

Num

ber P

aram

eter

s (Bi

llion

)

Figure 1. Comparing deep learning models in total amountof compute, in petaflop/s-days (top) [44] and model capacity(bottom).

selecting items in its catalog [34, 36, 58], Netflix for show-ing movie options [12, 27], and Google for displaying per-sonalized advertisements [6, 8, 17]. They have also beenadopted by standard benchmarking organizations, such asMLCommons (MLPerf) [37, 51]. At Facebook, we have been

1

arX

iv:2

104.

0515

8v5

[cs

.DC

] 1

5 Se

p 20

21

Page 2: Software-Hardware Co-design for Fast and Scalable Training

using recommendation models extensively for ranking andclick through rate (CTR) prediction, including news feedand search services [13, 15, 41, 46]. DLRMs are the singlelargest AI application in terms of infrastructure demand indata centers.Unlike conventional deep neural networks (DNNs) with

mainly compute-intensive operators (e.g., convolution andmatrix multiplication), DLRMs combine compute-intensivecomponents with up to thousands of data-intensive embed-ding operators, each with a different resource requirementand performance characteristic [42]. As a result, DLRMs gen-erally exhibit much lower arithmetic intensity and largermodel sizes compared to their computer vision [7, 16, 28,56, 59], natural language processing [4, 9, 61], and reinforce-ment learning counterparts [54, 55], with models havingtrillions of parameters being deployed in practice, as shownin Figure 1. Existing software and hardware solutions tai-lored for DNNs achieve only suboptimal performance andlimited scalability for DLRM training due to the followingsoftware/hardware limitations.On the software side, existing deep learning frameworks

parallelize DNN training typically using either data, modelor pipeline parallelism [3, 31, 47]. Frameworks that supportcombinations of these strategies are generally designed forspecific DNN applications [14, 20, 40, 49]. However, exist-ing parallelization strategies designed and optimized forcompute-intensive DNNmodels achieve limited performanceand scalability for DLRMs. In particular, data parallelism re-quires each device to save a replica of the entire model andtherefore does not support DLRMs with up to trillions ofparameters [31]. Moreover, a DLRM cannot be directly par-allelized using model or pipeline parallelism due to the data-dependent behavior of its embedding operators. Specifically,processing different training samples may require accessesto different embedding parameters depending on the cate-gorical inputs of each sample. This data-dependent behaviormakes it infeasible to statically partition a DLRM’s trainableparameters into disjoint subsets while satisfying data depen-dencies for all training samples, a necessity for employingmodel and pipeline parallelism.In addition, today’s DNN frameworks are designed and

optimized for compute-intensive DNN computations andmiss critical optimizations for data-intensive embedding op-erators. Specifically, DLRM models contain up to thousandsof embedding operators. The forward processing, backwardpropagation, and gradient synchronization for these embed-ding operators require launching thousands of CUDAkernelsin a training iteration and consume up to terabytes of aggre-gated GPU device memory, introducing significant runtimeoverheads and memory requirements.

On the hardware side, modern hardware platforms such asGPU-based clusters provide significant capability boost for

DNN models, but they are not designed to match the perfor-mance characteristics of DLRMs. Specifically, hardware plat-forms for DNN training are generally optimized for central-ized inter-node communications (e.g., parameter servers [3])and/or AllReduce communications (e.g., Horovod [53] andNCCL [1]). However, as identified in Section 3, performantand scalable DLRM training requires efficient hardware sup-port for a mixture of diverse communication patterns, in-cluding AllReduce, AlltoAll, ReduceScatter, OneToMany,and ManyToMany.

1.1 Our ApproachWe present Neo, a software-hardware co-designed systemfor fast and scalable DLRM training building on top of threekey techniques.

4D parallelism. To enable fast and scalable training ofthe massive embedding operators in DLRMs, it is crucial toeffectively balance the workload distribution across GPUswhile minimizing communication costs. We introduce a 4Dparallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism to jointly optimize the paralleliza-tion performance of embedding operators. Additionally, Neoalso supports applying 4D parallelism in a recursive mannerat different levels of hardware hierarchy to further improveload balance and hardware efficiency.

High-performance embedding computation. Neo em-ploys two novel optimizations to minimize the computa-tional costs and memory requirements of embedding opera-tors. First, we introduce a hybrid kernel fusion technique thatfuses (1) multiple embedding operators and (2) embeddingcomputations and their parameter updates all in a singleCUDA kernel. This is realized by co-designing the optimiza-tion algorithms and software implementation of embeddingoperators. Second, to provide sufficient memory capacity forDLRM training,Neo uses a software-managed cachingmecha-nism to leverage the memory hierarchy of modern hardwareplatforms. Finally, a variety of compression techniques arefurther applied to minimize memory requirements.

Hardware platformdesign. We introduceZionEX , a newhardware platform co-designed with Neo’s 4D parallelism tooptimize inter-node communications for distributed DLRMtraining. ZionEX supports a fully-connected topology acrossall GPUs in the cluster by using a dedicated RDMA overConverged Ethernet (RoCE) based scale-out network. Thistopology design promotes high-performance data transfersfor the performance-dominating AlltoAll and ManyToManycommunications in distributed DLRM training. Meanwhile,ZionEX supports both the RDMA and GPUDirect communi-cation protocols and retains flexible intra-node GPU fabric.This enables high-performance DLRM training on ZionEX ,while ensuring compatibility with existing data-center in-frastructure to allow wide deployment of ZionEX .

2

Page 3: Software-Hardware Co-design for Fast and Scalable Training

Table 1. Sample DLRM time to train latency resources demand

Total compute 1+ PF/sTotal memory capacity 1+ TBTotal memory BW 100+ TB/sNetwork injection BW per worker 100+ GB/sNetwork bisection BW 1+ TB/s

Results. We have evaluated Neo on three DLRM modelsdeployed in production for different tasks, including clickthrough rate prediction, ranking, and engagement. TheseDLRM models represent a diverse set of production-levelrecommendation models. Our evaluation on 128 A100 GPUson 16 ZionEX nodes shows that Neo is able to process upto 1.7 million queries per second for training DLRMs with12 trillion parameters, a 40× speedup compared to existingsolutions for DLRM training in production. Ablation stud-ies show that 4D parallelism, high-performance embeddingcomputation, and the new ZionEX platform are all criticalto enabling fast and scalable DLRM training.To summarize, our contributions are:

• We present Neo, a software-hardware co-designedsystem for fast and scalable training of DLRMs. Neooutperforms existing systems by up to 40× for traininglarge-scale DLRMs with 12 trillion parameters.

• We propose 4D parallelism, a combination of table-wise, row-wise, column-wise, and data parallelism fortraining embedding operators.

• We develop and implement high-performance embed-ding operators using hybrid kernel fusion, software-managed caching, and quality-preserving compres-sion.

• WebuildZionEX , a new hardware platform co-designedwith Neo’s 4D parallelism to accelerate AlltoAll andManyToMany communications in DLRM training.

2 BackgroundDLRMs typically have two modes of training - offline andonline, each with varying requirements. The offline trainingcan be viewed more as a pre-training, where a candidatemodel is trained on sufficiently large historical data, andexpected to generalize when deployed to current/unseensamples. Once deployed, DLRMs continue to be trained inan online mode using the data it has already served on. Of-fline training is throughput limited, fitting into the moreconventional "train as fast as possible on as much data aspossible" paradigm, whereas online training is more latencysensitive, with the frequency of re-training and update be-ing an important factor. For online training, the throughputrequirement is lower hence it might be desired to use propor-tionally lower resources. This creates an unique requirementof training very large models at smaller scales capable oftolerating lower throughput.

PS-1 PS-2 PS-n

DensePS-0

Readers

Tr-0 Tr-n

Hogwild!

EASGD

Partitioning embedding tables: Model-parallelism

Replicating dense MLPs: Data-parallelism

Trainers

Parameter Servers

… … …

… … …

Figure 2. Disaggregated parameter-server based system

This paper focuses on offline training with more demand-ing training throughput needs — up to millions of samples(queries) per second resulting from processing through tensof petabytes of training data within a reasonable time. Thisdrives the training platform requirements, as summarized inTable 1.

Traditionally a disaggregated parameter-server (PS) baseddistributed CPU training system has been used for train-ing DLRMs in a production setting [15, 41]. Specifically, thedense parameters from the MLP modules are duplicated be-tween the trainers to exploit data-parallelism. Their weightsare synchronized with a centralized dense parameter serverusing Elastic Averaging method SGD [66, 69]. On the otherhand, The parameters from the embedding tables are parti-tioned and placed onmultiple PS to exploit model-parallelism,since the size of embedding parameters simply preventsmodel replication. To maximize training throughput, theparameters of embedding operators are updated using Hog-wild! [50]. In addition, the readers are deployed on a separatetier of machines to feed training batches to the trainers asillustrated in Fig. 2.

Such PS-based system is well suited for DLRMs allowingscaling different components separately and achieving a bal-anced resource utilization when training different modelswith different trainer, parameter server and reader configura-tions. Moreover, resources in the system are largely fungible,making it low-cost for datacenter operations.However, the need for supporting DLRMs with trillions

of parameters and therefore terabytes in size poses a seriouschallenge to the scalability of this approach. Necessitatingsteeply growing number of trainers and parameter-serversto meet the ever growing training requirements. This quicklybecomes intractable, degrading model accuracy with stale-ness due to increased asynchronous updates across a verylarge number of workers. To tackle these issues, we build ahigh-performance synchronous training solution for large

3

Page 4: Software-Hardware Co-design for Fast and Scalable Training

Top Neural Networks

Feature Interaction

Embedding Operators (Section 5)

Embedding Operators

Bottom Neural Networks

Dense Features Categorical Features

Categorical Features

Input Features

Data Parallelism

4D Parallelism (Section 4)

Communication Patterns

AllReduce

AlltoAll

One-to-manyMany-to-many

ReduceScatter(optional)

Figure 3. Neo overview. Each box in the figure indicates aneural network component, while edges between boxes aretensors shared between different components.

DLRMs, decoupling distributed scaling from statistical qual-ity.The efficient design of the synchronous training system

leads us to use a novel combination of 4D parallelism (Sec-tion 4) for memory intensive embeddings tables, data paral-lelism for compute intensive DNN operators, and pipeliningacross different components. This hybrid parallelism requiresAlltoAll communications for the embedding lookup re-sults [41, 42], as well as embedding table input redistributionif the inputs are streamed from database in batches, whichis often the case. Unlike AllReduce communications forgradient synchronizations, which can be overlapped, theseAlltoAll communications are on the critical path due todata dependencies, stressing the performance of the intercon-nect and communication primitives. Furthermore DLRMsare typically trained on very large amounts of data, whichcorresponds to mostly unstructured and unlabeled interac-tions from a wide variety of applications. Typical data-setsize are in the range of several petabytes, necessitating theuse of common, distributed network store, such as the Tec-tonic filesystem [45]. For training, this data would need to bestreamed-in, putting additional stress on the host networkand host-to-device bandwidths.

3 OverviewFigure 3 shows an overview of Neo, a software-hardwareco-designed system for fast and scalable training of DLRMs.This section briefly describes the key components of Neo.

First, Neo uses data parallelism for training compute-intensive DNN layers (shown in orange) and switches toa 4D parallelism strategy that combines table-wise, row-wise,column-wise, and data parallelism for efficient training ofmemory-intensive embedding operators.Second, Neo is equipped with a high-performance im-

plementation for embedding operators. This is achieved bya number of critical systems optimizations, including (1) ahybrid kernel fusion technique to reduce the computational

cost of embedding operators, (2) a software-managed cachingmechanism to leverage heterogeneous memories of modernhardware platforms, and (3) a variety of quality-preservingcompression techniques to minimize the memory require-ment for embedding computation.

Finally, Neo is deployed on ZionEX , a new hardware plat-form co-designed with Neo’s 4D parallelism to optimizeinter-node communications for DLRM training.

Data I/O.. The adoption of fully synchronous trainingand accelerators has a number of implications on data I/O.First, the host to device transfer should be non-blocking andfast enough not to limit the overall training throughput. Ide-ally overlapping the input data transfers with training usingdouble buffering or pipelining. Second, even though map-ping input data distribution to collective communicationsbetween trainers is faster, this introduces additional chal-lenges for the input and output data layout of the collectivecommunications. Initial experiments show that these couldadd significant latency to the critical path. We will illustratehow we overcome these practical challenges in Section 7.1.

4 4D ParallelismTo enable performant training for embedding operators, itis crucial to effectively balance the workload distributionacross GPUs and minimize communication costs. We intro-duce 4D parallelism, which combines table-wise, row-wise,column-wise, and data parallelism for jointly optimizing theparallelization performance of embedding operators.

Table-wise parallelism. The most straightforward par-allelism scheme is partitioning and parallelizing multipleembedding tables across GPUs, as shown in Figure 4a. Table-wise parallelism does not further split embedding tables,therefore this scheme requires no additional handling ofembedding table input indices or pooled embedding results,leading to optimal communication efficiency. However, table-wise parallelism cannot handle large embedding tables thatexceed thememory capacity of a single GPU, and the achievedload balance is often limited due to the skew in table sizes.

Row-wise parallelism. This scheme parallelizes largeembedding tables by rows and assigning different table shardsto different trainers. Since the embedding table inputs indextables by rows, they need to be bucketized based on the row-wise parallelism decision and distributed to the respectivetrainers, as illustrated in Figure 4b. Moreover, partial resultson multiple trainers need to be reduced and then scatteredto all trainers for downstream computations. This requires aReduceScatter communication pattern in the forward pass.This scheme handles large tables well and leads to better loadbalance. However, the communication cost scales linearlywith the number of trainers.

4

Page 5: Software-Hardware Co-design for Fast and Scalable Training

(a) Table-wise Parallelism (b) Row-wise Parallelism (c) Column-wise Parallelism (d) Data Parallelism

Figure 4. Embedding table sharding schemes with different implications on the communication cost, load balancing andmemory requirement. Bottom MLP is omitted in this figure for simplicity of illustration.

Column-wise parallelism. Column-wise parallelism par-titions the embedding tables along the embedding dimen-sions (see Figure 4c) and treats the partitioned table withsmaller embedding dimensions as individual operators. Thisscheme requires duplication of input indices for the parti-tioned tables. Comparing with table-wise parallelism, it pre-serves the same flow and communication pattern (AlltoAll).A key advantage of column-wise parallelism is enabling finer-grained parallelism, especially for large tables. However, itworks well only with large embedding dimensions and in-creases the payload for the input indices, which have to bereplicated to all nodes with the column shards. Furthermore,since the rows of column-wise sharded tables are split acrossdifferent trainers, using an independent row-wise updatefor these tables introduces additional parameters one foreach shard of the row instead of just a single value for theentire row when using sparse optimizers (see Section 5.1 fordetails).

Data parallelism. DLRM models tend to have a widerange of table sizes, while table-, row-, and column-wiseparallelism are efficient for relatively large embedding tablesprohibitive to replicate. For smaller tables, data parallelismachieves better performance, since data parallelism does notinvolve any communication in the forward pass (see Fig-ure 4d). Therefore, for small embedding tables, Neo treatsembedding tables as dense parameters and replicate themacross all trainers. AlltoAll is no longer needed for thepooled embeddings of data-parallel embedding tables. In-stead, AllReduce is required to synchronize across all repli-cas. As a result, this depends on the trade-off between thecost of AlltoAll of the pooled embeddings versus the costof AllReduce on the entire table. In general, small embed-ding tables with fewer rows are good candidates for dataparallelism. Input indices for theses tables are passed throughas data-parallel inputs and no longer require re-distribution.

4.1 Parallelization AlgorithmsNeo supports applying 4D parallelism strategies at the gran-ularity of individual embedding operators to maximize flexi-bility. Practitioners can mix-and-match the above primitivesto determine the best strategy to partition an embeddingoperator. Additionally, Neo also supports partitioning em-bedding operators in a recursive manner at different levelsof hardware hierarchy to further improve workload balanceand hardware efficiency. For example, the table-wise thenrow-wise scheme first assigns a set of tables to a particu-lar node, and within that node the tables are partitionedrow-wise. This family of hierarchical parallelism schemesimprove hardware locality by fully exploiting the fast GPUinterconnects and reduce inter-node communications.With a cost function defined for each of the above par-

allelism schemes, placement algorithms can be explored tominimize the cost differences between workers. We imple-ment and evaluate two polynomial time heuristics as a proofof concept. The first one is a simple greedy heuristic thatsorts the costs of available schemes in a descending orderand allocates the largest shard first, one per worker. Then,the greedy algorithm iterates through all remaining shardsand assigns the top cost to the node with the smallest sum ofcosts. A second heuristic is the largest differencing method(also known as the Karmarker–Karp algorithm [24]). Themain idea is to take the two largest numbers from the inputand replace them by their difference. It directly reduces thedifference of sums and generally outperforms the greedyheuristic.

4.2 PipeliningAlthough using GPUs as the main compute resource offerslimited pipelining opportunities within model evaluation,we improve GPU utilization by pipelining inter-batch data

5

Page 6: Software-Hardware Co-design for Fast and Scalable Training

movement and overlapping communication with computa-tion.

When batch 𝑖 is being evaluated, the same GPUs can startreceiving and distributing batch 𝑖 +1 using a separate stream.Tominimize the interference, we overlap the input AlltoAllof batch 𝑖 + 1 with the forward propagation of top MLP ofbatch 𝑖 where no communication is involved. In addition, weoverlap the pooled embedding AlltoAll with the forwardpropagation of bottom MLP to hide latency.

5 Embedding OperatorsA major difference between DLRM models and conventionaldeep neural networks is leveraging categorical features suchas users, posts, or pages. The DLRM models used in produc-tion typically contain up to thousands of categorical features,each of which corresponds to a dedicated embedding operator.Optimizing the runtime performance of DLRM’s embeddingoperators requires addressing two key challenges. First, theforward processing, backward propagation, and gradient up-dates for the embedding operators require launching thou-sands of GPU kernels in each training iteration, introducingsignificant GPU kernel launch overhead. Second, some em-bedding operators may include up to billions of parametersand do not fit on the device memory of a single GPU.

We introduce three novel techniques to reduce the compu-tational cost and memory requirement of embedding opera-tors. First, we introduce a hybrid kernel fusion technique tominimize the CUDA kernel launch overhead and allow eachGPU worker to only launch two kernels (i.e., one for forwardand one for back propagation and parameter update). Second,for parallelizing the computation of the embedding operators,we propose column-wise parallelism and row-wise parallelismin addition to data and model parallelism. The combinationsof these four parallelism dimensions enable Neo to supportembedding tables with up to trillions of parameters. Finally,Neo exploits a series of memory saving techniques that lever-age the memory hierarchy of the ZionEX platform to ensuresufficient memory capacity for DLRM.

5.1 Kernel FusionNeo uses a hybrid kernel fusion mechanism to minimize theCUDA kernel launch overhead for performing embeddingcomputations in a training iteration. First, instead of apply-ing a separate embedding lookup for each embedding table,Neo fuses multiple embedding lookups on the same GPUinto a single CUDA kernel (Figure 5b), which improves theparallelism and bandwidth utilization and reduces the over-head of launching multiple CUDA kernels on GPUs.

Second, Neo also fuses the backward pass with the sparseoptimizer to further reduce kernel launch overhead and avoidmaterializing gradients to the embedding tables. The keychallenge of such fusion is avoiding potential race-conditionacross gradient updates from different training samples and

12613

6

row 1row 2row 3row 4row 5row 6

Embedding Table

sample 1

sample 2pooling operator

Output Tensor

CategoricalInputs

sam

ple

1sa

mpl

e 2

▽sample 1

▽sample 2

Output Gradient

▽row 1▽row 2▽row 3▽row 4▽row 5▽row 6

Embedding Gradient

row 1row 2row 3row 4row 5row 6

Embedding Table

Sparse Optimizer

Forward Backward Optimizer

Tens

orG

radi

ent

(a) Embedding computation workflow

Table 1

Table 2

Table T

...

D

DT

B

Table 1, Batch 1: pooling L1,1 rowsTable 1, Batch 2: pooling L1, 2 rows

Table 1, Batch B: pooling L1,B rows

...

Table 2, Batch 1: pooling L2,1 rows

...

Table T, Batch B: pooling LT,B rows

...

Fusion of multiple embedding table lookups

Forward

Backward+ Optimizer (Fusion)

L1,1

Output Tensor

...

HT

H2

H1

D

D

Categorical feature inputs

T x B segments

L1,2

LT,B

T tables and B batches

(b) Fusing multiple embedding tables

▽sample 1

▽sample 2

Output Gradient

▽row 1▽row 1▽row 2▽row 3▽row 6▽row 6

Sorted Embedding Gradients

row 1

row 2row 3

row 6

Embedding Parameter

Fusing Backward & OptimizerGradient Sorting

Sparse Optimizer

Gradient Aggregation

(c) Fusing embedding backward and sparse optimizer

Figure 5. Embedding operator optimizations

handling non-linearity in advanced optimizers such as Ada-Grad [10], LAMB [65], and Adam [25]. For example, bothsample 1 and 2 in Figure 5a contribute to the gradients of theembedding vector 1 and 6. Directly sending these gradientsto a non-linear sparse optimizer without aggregation wouldresult in incorrect updates to the embedding tables.

To guarantee correctness while maximizing performance,Neo applies gradient sorting by rows so that gradients tothe same embedding rows are processed by a single CUDAthread block, as shown in Figure 5c. Gradient aggregation issubsequently applied within each CUDA thread block usingmuch faster but smaller GPU shared memory.Neo’s hybrid fusion technique for embedding operators

lead to three performance benefits. First, Neo reduces thememory requirement for embedding operators by avoiding

6

Page 7: Software-Hardware Co-design for Fast and Scalable Training

allocating GPU device memory for embedding gradients.Second, the memory accesses to GPU device memory areminimized by using GPU shared memory to save interme-diate embedding gradients. Finally, kernel fusion improvesthe overall performance of embedding computations by upto 7× compared to a native implementation. The optimizedembedding operator implementations are open sourced aspart of the FBGEMM library∗ and integrated with PyTorch.

5.2 Managing Memory HierarchyFor the DLRM model with up to trillions of parameters, theembedding tables are too large to entirely fit on a singleGPU.We leverage multiple levels of memory hierarchy of theZionEX platform, including HBM, DRAM and SSDs to ensuresufficient memory capacity for the models, with the fastermemory serving as a software cache of the subsequent layer.Neo’s hierarchical memory management strategy is also use-ful for online training of DLRM, which warrants using fewernodes for training original large models. One approach tomanaging memory hierarchy is CUDA’s unified memory(UVM) [43], which provides a single memory address spacefor different memory types and automatically replaces andevicts unused pages. However, random table lookups in em-bedding operators requires caching and replacing unusedparameters at the granularity of individual embedding rows,making UVM insufficient for DLRM. Instead, Neo uses acustomized 32-way set-associative software cache [63] us-ing least recently used (LRU) or least frequently used (LFU)cache replacement policies, where the associativity matchesthe warp size of GPUs. This enables fine grain control ofcaching and replacement, allowing it to be tuned for targetmodel characteristics. Note that UVM is bounded by PCIebandwidth, while Neo’s software cache can bridge the gapfor the bandwidth between PCIe and HBM ( 50× difference).The software cache improves the end-to-end performance ofDLRM workloads by approximately 15% compared to UVM.To further reduce the memory requirement of embed-

ding operators, Neo also employs a variety of compressiontechniques, such as a row-wise sparse optimizer, low/mixed-precision training using a high-precision cache backed bylow precision embedding tables, and advanced factorizationtechniques. We omit the details of Neo’s compression tech-niques due to space limit and will elaborate on them in thefinal paper.

6 ZionEX : Hardware Platform DesignWe begin by describing the limitations of our previous hard-ware platform for DLRM in Section 6.1. Section 6.2 introducesZionEX , a new hardware platform for DLRM.We also outlinethe design principles used in the development of ZionEX .

∗PyToch FBGEMM_GPU library: https://github.com/pytorch/FBGEMM/tree/master/fbgemm_gpu

CPU

Accel

8S Glue-less UPI Topology (Twisted Hyper Cube)

NIC CPUNIC CPUNIC CPUNIC CPUNIC CPUNIC CPUNIC CPUNIC

AccelAccel Accel Accel Accel AccelAccel

PCIe Switch PCIe Switch PCIe Switch PCIe Switch

Flexible intra-node topology

(a) Zion

(b) ZionEX

Data ingestion service

reader

reader

readerTraining cluster

ZionEX

ZionEX

ZionEX

Distributed network storage ……

(c) Overall training architecture.

Figure 6. The system architectures of Zion, ZionEX plat-forms and the overall training system.

6.1 Previous Platform: ZionZion [39] introduced in 2019was designed as a high-performancehardware platform for training DLRMs. While Zion offerssignificantly improved capabilities at single-node level, itfalls short as a distributed platform not being extensible tomeet the rapidly growing DLRM training requirements. Wecritically appraise its limitations, but other platforms basedon a similar design share the same limitations; we discussthose platforms in Section 9.

Figure 6a shows the architecture of a Zion node, which has8 CPU sockets with 1.5 TB memory, 8 GPUs, and 8 networkinterface cards (NICs). It provides a powerful heterogeneoussuper node design for training DLRM by (1) offloading com-pute heavy layers of DLRM (e.g., MLPs) onto GPUs and (2)leveraging CPUs for large embedding operators on the rela-tively cheaper DRAM instead of HBM for accommodatingTB-scale DLRM models on a single node.

7

Page 8: Software-Hardware Co-design for Fast and Scalable Training

However, this heterogeneous design introduces a numberof challenges to software design and performance. For exam-ple, it’s critical to balance the workload on CPUs and GPUsto ensure maximum overlap. This requires elaborate pipelin-ing between CPUs and GPUs and partitioning DLRM intofine-grained tasks using an accurate cost model. In addition,heterogeneous training of DLRM also introduces non-trivialruntime overheads, such as increased data transfers betweenCPUs and GPUs and inter-socket communication. Finally,a critical missing component in Zion is that each NIC is di-rectly attached to a CPU. As a result, all of the inter-nodecommunications (e.g., gradient synchronization and tensortransformation) necessitate CPU intervention and additionalGPU-CPU transfers. Furthermore these NICs are connectedto the common shared data-center network infrastructure,which introduces overheads from network congestion, inter-ference and are constrained to use more data center-friendlytopologies, protocols (TCP/IP) which are sub-optimal fordistributed training.Although each Zion node is equipped with 8x 100Gbps

NIC bandwidth, in reality we found it is very difficult toscale out to multiple nodes due to networking overheads.With today’s increasing demand on modeling size of DLRMmodels, Zion is not able to scale well and fully utilize thepowerful hardware resources.

6.2 ZionEXTo address these shortcomings, we introduce ZionEX , whichwe have designed to be more scalable than Zion with im-proved network capabilities, while retaining its flexibilityand core advantages, such as the OAM form factor, modulardesign [39, 57], and flexible intra-node accelerator fabric [68].Figure 6b shows the overall system architecture. We brieflyhighlight ZionEX ’s core design principles:

Scalability. BothZion andZionEX support heterogeneoustraining of DLRM, but the most striking difference is thatZionEX is designed with sufficient scale-up and scale-outnetwork capabilities. As shown in Figure 6b, ZionEX em-ploys a dedicated RDMA over Converged Ethernet (RoCE)NIC for each of the GPUs connected via PCIe switches toallow for a dedicated inter-node connectivity (isolated fromcommon data-center network) and importantly supportmoreefficient RDMA/GPUDirect communication protocols [41].These ZionEX nodes can be connected with a dedicated back-end network to form a cluster for distributed scalable train-ing. The extensible design of ZionEX allows for scaling thebackend network to interconnect many thousands of nodes,forming a data-center scale AI training cluster.

High Performance. As a scale-out solution, we offloadthe entire DLRM to GPUs to fully leverage the massive paral-lelism and high memory bandwidth to accelerate MLPs andembedding computations. To transfer tensors and synchro-nize gradients, each GPU can communicate directly with

GPUs on a different node through the dedicated low-latencyhigh-bandwidth RoCE NIC, without involving host CPUs.In addition, ZionEX also has a frontend NIC connected toeach CPU. Data ingestion goes through the regular fron-tend network and PCIe, without interfering with activationsor gradients. The host CPUs are only used to setup inputbatches and marshal the training process.

Capability. With ZionEX we ensure that the platform iscompatible with existing infrastructure and can be widelydeployed within our data-centers, without causing major dis-ruptions. This is critical for being able to effectively leveragethe capability of the platform and make it readily available toacross variety of applications and uses-cases. We achieve thisby making the ZionEX platform compliant with the standardOpen Rack specifications [2] , which covers the compati-bility with other infrastructure components such as power,cooling, mechanicals and cabling. Furthermore designingthe platform to be modular and relying on open standardsbased technologies, for instance - the ethernet based networkfabric for high-performance scale out solution.

Fig. 6c shows the overall training platform, along with thedis-aggregated data-ingestion service. This supports stream-ing input data from a network store such as Tectonic [45]and perform light-weight data pre-processing operations ina distributed fashion. So that the data-ingestion is not a bot-tleneck for the end-to-end training and to ensure sufficientthroughput in feeding ZionEX trainers.

7 ImplementationWe detail the implementation of high-performance scalabletraining for DLRMs described above.We built a high-performancetraining software stack for DLRMs using PyTorch [47], withefficient CUDA implementation for most deep learning op-erators via the ATen library, and automatic handling of pa-rameter replication and gradient synchronization with over-lapped back-propagation and AllReduce via the PyTorchDistributedDataParallel library [31]. We have enabledthe following components for efficient DLRM training.

7.1 Data ingestionData ingestion is a key component to ensure end-to-endtraining performance especially for DLRMs, which typicallyprocess through order(s) of magnitude larger amount ofdata than other typical DNN models. We observe that dataingestion, if left unoptimized, can incur significant latencyand introduce non-trivial overheads to pipelining.Originally designed for a distributed asynchronous CPU

setup, our readers and data pre-processing module stores theoffsets and indices† of each sparse feature in separate tensorsper embedding table. As a result, a DLRM with hundreds ofembedding tables can easily get a thousand input tensors per

†Please refer to the interface of nn.EmbeddingBag https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html

8

Page 9: Software-Hardware Co-design for Fast and Scalable Training

iteration, which translates into significant overheads fromCPU↔ GPU transfers and was one of the key bottlenecksfor the previous Zion platform as detailed in Sec. 2.

To overcome this practical challenge, we co-designed thedata pre-processing module to use a combined format wherelengths rather than offsets are used and inputs to differentembedding tables are simply concatenated. The benefits ofusing the combined format are two-fold: (1) it optimizesCPU-GPU transfer by consolidating small transfers; (2) itcan be directly consumed by the embedding kernel withoutadditional layout transformations. We further optimized in-put data transfer by using pinned memory to avoid the extracopy.With the combined format, we developed a module to

efficiently distribute embedding table inputs based on thesharding strategy. In the case of table-wise sharding (shownin Fig. 4a), an AlltoAll is needed to distribute the globalbatch for local tables to each worker. Since the size of indicesis dependent on the values of the lengths, the communicationis actually implemented as an AlltoAll for lengths followedby an AlltoAll for indices. In a setup with𝑊 workers, 𝑇local tables and 𝐵 local batch size, this gives us indices inthe order of (𝑊,𝑇, 𝐵), which needs to be further permutedto (𝑇,𝑊 , 𝐵) for embedding kernel consumption. We havedeveloped custom GPU kernels for permute, bucketize andreplicate to achieve maximum throughput on embedding in-put indices distribution for table-wise, row-wise and column-wise sharding schemes. Checkpointing the model has similarchallenges, requiring to be sufficiently frequency be able towrite-out such larger model whilst not becoming an over-head for training, as outlined in this recent paper [11].

7.2 Communication PrimitivesHigh-performance collective communication is key to per-formant and scalable DLRM training. PyTorch provides theProcess Group (PG) interface for collectives - an abstractplatform / collectives library agnostic API. DLRM uses thisAPI directly (for Alltoall) or indirectly via DDP (for Allre-duce) [31]. We use the NVIDIA’s Collective CommunicationLibrary (NCCL) as our primary collective communicationlibrary since it efficiently uses RDMA and NVLINK for bestperformance. We extended PyTorch NCCL process group im-plementation to support Alltoall/Alltoallv collectives usingNCCL Send/Recv primitives (requires NCCL 2.7.3 or later).

8 EvaluationWe provide results for end-to-end training of productionmodels, operator-wise performance breakdown and a perfor-mance roof line to establish the upper bound of achievableperformance.

Figure 7. DLRM dependency graph

8.1 Performance modeling and benchmarkingIn order to identify performance gaps, to see how far weare from fully utilizing the platform capabilities - we estab-lish the upper bound for achievable performance using ananalytical roofline model. DLRMs can be broken down intothe following major components - 1) bottom MLP; 2) em-bedding lookup and update; 3) AlltoAll communication ofthe model-parallel pooled embeddings; 4) interaction andTop MLP; 5) AllReduce communication for the data-parallelMLP gradient synchronization. The execution dependencybetween there different components are outlined in Fig.7.As discussed above, individually each of these have differ-ent characteristics. The latency/performance for each com-ponent is dependent on different parts of the system, forinstance the embedding ops performance depends on theachievable HBM bandwidth, whereas the MLP performanceis bounded by achievable compute flops. Even between thetwo collective communication primitives - AllReduce per-formance depends on both the scale-out and scale-up band-widths, whereas the AlltoAll performance primarily de-pends on the scale-out bandwidth. With estimates for la-tencies for these individual components, the overall per-iteration latency can be estimated as shown in Eq. 1

𝑇𝑓 𝑤𝑑 = max[𝐵𝑜𝑡𝑀𝐿𝑃𝑓 𝑤𝑑 , (𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔_𝑙𝑜𝑜𝑘𝑢𝑝+𝑎𝑙𝑙𝑡𝑜𝑎𝑙𝑙𝑓 𝑤𝑑 )]+ 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑓 𝑤𝑑 +𝑇𝑜𝑝𝑀𝐿𝑃𝑓 𝑤𝑑

𝑇𝑏𝑤𝑑 = max[𝑇𝑜𝑝𝑀𝐿𝑃𝑏𝑤𝑑 + 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑏𝑤𝑑+max{𝑎𝑙𝑙𝑡𝑜𝑎𝑙𝑙𝑏𝑤𝑑 + 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔_𝑢𝑝𝑑𝑎𝑡𝑒, 𝐵𝑜𝑡𝑀𝐿𝑃𝑏𝑤𝑑 },

(𝑇𝑜𝑝𝑀𝐿𝑃_𝐴𝑙𝑙𝑟𝑒𝑑𝑢𝑐𝑒 + 𝐵𝑜𝑡𝑀𝐿𝑃_𝐴𝑙𝑙𝑟𝑒𝑑𝑢𝑐𝑒)]

𝑇𝑡𝑜𝑡𝑎𝑙 = 𝑇𝑓 𝑤𝑑 +𝑇𝑏𝑤𝑑 (1)9

Page 10: Software-Hardware Co-design for Fast and Scalable Training

Table 2. ZionEX per-node system specification.

Compute (PFLOPS) 2.5 (FP32) / 20 (TF32) / 40 (FP16,BF16)HBM 320 GB, 12.4 TB/sDDR 1.5 TB, 320 GB/sScale-up bandwidth 2.4 TB/s (uni-directional)Scale-out bandwidth 1600 Gbps (uni-directional)Host NW 4 × 100 Gbps

Table 3. Target models configuration

Model model-A model-F model-INum parameters 793B 12T 332BMFLOPS per sample 638 5 60Num of emb tables ≈ 1000𝑠 ≈ 10𝑠 ≈ 100𝑠Embedding table dims [4, 384] [256, 256] [92, 92](range [min, max], avg) avg: 93 avg: 256 avg: 92Avg pooling size 15 20 70Num MLP layers 20 7 43Avg MLP size 3375 490 682Target local batch size 512 512 2048Achieved QPS 1.2M 1.7M 3.4M

To estimate the performance and latencies for each ofthese components, we use operator-level benchmarks whichallow evaluation of target operator shapes/sizes on candidateHW platforms. We benchmark‡ the 1) embedding operators,2) typical MLP sizes, and 3) communication primitives. Withthese benchmarks we are able to establish the max achiev-able HBM bandwidth to be 850 GB/s for V100 and 1300 GB/son A100 GPUs, and for the MLP sizes of interest, achiev-able compute efficiencies to be up to 78.6% (V100) and 70.5%.(A100). Furthermore, we achieve 7GB/s for 256MB AlltoAlland 60GB/s for 256MB AllReduce. Since, AllReduce uti-lizes both scale-out and NVLINKs and hence able to achievehigher bandwidth.We provide detailed benchmarking resultsand configuration used in appendix-A.

8.2 Experimental SetupTable 2 summarizes the aggregated capabilities of a singleZionEX node with 8 NVIDIA A100 GPUs. The 8 GPUs in anode provide a total 320 GB HBM with 12.4 TB/s aggregatedmemory bandwidth. The 4-socket CPUs provide 1.5 TB mem-ory with 320 GB/s bandwidth. On network capabilities, theGPUs are interconnected with high-bandwidth NVLink forintra-node GPU communication, and each GPU has a ded-icated 200 Gbps RoCE NIC for inter-node communication.We use a cluster of 16 ZionEX nodes in the experiments with5TB total HBM capacity.

‡https://github.com/facebookresearch/param

0.92

0.94

0.96

0.98

1.00

0 10 20 30 40 50 60

Rela

tive

Nor

mal

ized

Entr

opy

Billion samples

A1 async small batchA1 sync large batch

Figure 8. Training quality comparison between asynchro-nous small batch on a distributed CPU platform and syn-chronous large batch on the proposed platform, measuredin relative normalized entropy [18].

8.3 End-to-End TrainingWe report results on three DLRMmodels deployed in produc-tion for different tasks, including click through rate (CTR)prediction, ranking, and engagement. Table 3 lists high-levelcharacteristics of these candidate models. Model-A repre-sents large and complex DLRMs that stress Neo’s computecapability and communication bandwidth, using significantlyhigher FLOPS per sample and a large number of embeddings.Model-F presents a different practical challenge where de-spite having low FLOPS per sample and a small number ofembedding tables, it has a single massive table that cannot fitin the device memory of a single GPU. Finally, Model-I rep-resents moderate scale DLRMs stressing memory bandwidthwith high average embedding pooling sizes. These targetmodels are trained on up to 16 ZionEX nodes (128 GPUs) inthe cluster. The model qualities are evaluated in normalizedentropy [18], and the training throughput is measured inqueries per second (QPS).

First, we use model-A to demonstrate the training quality,since it can also be trained on a distributed CPU platform. Asshown in Figure 8, despite using significantly larger batchsize (64K vs. ~150), synchronous large batch training onZionEX provides on-par or better model quality (both us-ing tuned hyperparameters). With the same configuration,Neo achieves 1.2 MQPS using 128 GPUs on 16 nodes, a 40×speedup compared to our previous generation distributedCPU asynchronous training platform using 45 parameterservers and 15 trainers. While previous solution was unableto scale out further without hurting training quality, fullysynchronous training on ZionEX allows scaling beyond 16nodes with even larger batch sizes.

10

Page 11: Software-Hardware Co-design for Fast and Scalable Training

1X

3X

5X

7X

9X

11X

8 16 32 64 128

Trai

ning

thro

ughp

ut sp

eedu

p

# of GPUs

Model-A Model-I

Figure 9. Training throughput scaling for model-A andmodel-I, relative to 8 GPUs (1 node).

8.4 Scaling PerformanceFigure 9 shows the normalized training throughput of model-A and model-I using up to 16 nodes, while keeping the per-GPU batch size constant. While the workload of data-paralleltraining remains the same with the scaling, the numbers ofembedding tables per GPU reduces with scaling due tomodel-parallelism. For the same reason, however, each GPU pro-cesses the entire global minibatch for each of its local tablesand this increases commensurately with scale and compen-sating for the reduced tables, making this still a weak scalingexperiment. To run on smaller node counts, we reduce theembedding table cardinality and hash inputs to be within thereduced number of rows. This shrunk version of the model ef-fectively reduces the model sizes with minimal/no impact onthe performance characteristics, hence is used for studyingscaling performance.As seen from the figure, on larger node counts, the scal-

ing efficiency is around 50% for model-A and around 75%for model-I. While model-A and model-I come very close interms of effective FLOPS andmemory requirements after con-sidering the target local batch size, model-A has larger fullyexposed AlltoAll latency. This is because more embeddingtables increase AlltoAll payload, and mixed dimensionsmake it more difficult to balance embedding computationsand AlltoAll communications at the same time. As a con-sequence, model-A suffers more from reduced AlltoAllefficiency when scaling out.

To better understand the scaling performance, we providea breakdown of serialized and exposed training iteration la-tency of model-A in Figure 10. Comparing between serializedand exposed latency, the CPU to GPU transfer (i.e., HtoD)is completely hidden, and the exposed communication la-tency is much less than serialized AlltoAll and AllReducelatency combined. This demonstrates the effectiveness ofNeo’s pipelining optimization to overlap communicationswith computations (see Section 4.2).

0%

20%

40%

60%

80%

100%

serialized exposed serialized exposed

8 GPUs (single node) 128 GPUs

HtoD

Data Layout Transform +Other OpGEMM

Emb

All2all

AllReduce

Comm (Exposed)

Figure 10. Model-A (with local batch size per GPU = 512)results and dominant operator time breakdown (serializedand exposed time) per GPU, after optimizations.

As node count increases, we observe increased AlltoAlland AllReduce latencies. Since most AlltoAll communica-tions are on the critical path, increased AlltoAll cost hasa direct impact on the exposed communication and overalltraining latency. While AllReduce is mostly hidden on up to16 nodes, the increased AllReduce latency and unchangedcomputation latency signifies that AllReduce can becomethe bottleneck once the slack in backward pass is completelyused up with higher node counts and/or faster computation.

8.5 Training Throughput OptimizationsUsing model-A as a case study, we detail the various op-timizations and their contributions in achieving up to 1.5MQPS, shown in Figure 11. The baseline performance formodel-A on 128 GPUs is below 700 KQPS. Further profil-ing reveals large disparities on embedding lookup latencybetween different GPUs, signifying severe load imbalance.This is mitigated using a combination of table-wise, column-wise, and data parallelism for the ≈ 1000𝑠 of embeddingtables to partition them across 128 GPUs. Note that eventhough column-wise parallelism introduces additional costto its input AlltoAll, the benefit from better load-balanceoutweighs the overheads and results in overall QPS improve-ment by 20%. However, the scaling efficiency is still about30% lower than ideal linear scaling.As discussed previously, the two major issues limiting

scaling efficiency are: (1) load imbalance and (2) increasedAlltoAll latency. For model-A, further balancing the loadusing only HBM is particularly challenging because themodel size in TF32 comes close to the 5TB aggregated HBMcapacity on 128 GPUs. After discounting for memory re-served by PyTorch framework and NCCL on each rank, Neohas little room to explore placement strategies. To mitigatethis issue, we use lower precision (FP16) embedding tables,reducing the model size by a factor of 2. While this alone

11

Page 12: Software-Hardware Co-design for Fast and Scalable Training

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Model-A

Trai

ning

thro

ughp

ut (M

QPS

)

Baseline (BS=64k)

+ optimized sharding

+ fp16 emb

+ Quantized comms

+ Larger batch size (256K)

Figure 11. Training throughput improvements enabled byoptimized sharding, reduced precision embeddings, quan-tized communications and larger batch sizes.

does not provide direct throughput benefit,Neo can leveragethe head room to strike a better balance. As a consequence,the training throughput is increased by another 20% due toimproved load balancing.Next, to address the increased AlltoAll latency, we in-

corporate quantized collective communications proposedin [64], which directly reduce the communication volume.Formodel-A, we validate that using FP16 in forward AlltoAlland BF16 in backward AlltoAll provides almost 30% speedupwithout any training quality loss.

Lastly, we increase the global batch size from 64K to 256K.This directly increases activation sizes, which helps satu-rate GPUs and communication bandwidth better, while be-ing complimentary to all other optimizations. With appro-priately tuned optimizer/hyper-parameters, we are able toachieve on-par training quality, however more comprehen-sive experimentation is warranted since large batch trainingof DLRMs is not as well studied and will be part of futurework. Collectively, these techniques unlock an 87% improve-ment on training throughput compared to TF32 training witha 64K global batch size.

8.6 Model Capacity Limitation StudyWe use model-F as an example to push the model capacity onthe prototype system. Unlike model-A or model-I, efficientlytraining model-F presents 2 different challenges. First, with12T parameters, model-F can easily require up to 96TB ofmemory using a naive training approach, far exceeding thetotal memory available on a 16-node cluster§. Second, themodel has only a few massive embedding tables with ~10Brows and 256 columns, each requiring multi-node worth ofGPU and host memory to train.

§Considering FP32 precision and doubled size for optimizer states12𝑒12 × 4 × 2 = 96𝑒12. The prototype cluster has in total 4TB HBM and24TB DRAM

To fit the model onto 16 nodes, we first apply row-wisesparse AdaGrad optimizer to embedding tables which re-duces optimizer states from per element to per embeddingrow. Then we use FP16 precision on embedding tables. Thesetwo optimizations collectively bring model memory foot-print from 96TB down to 24TB, just fitting under the 4TBHBM + 24TB DRAM memory hierarchy. On the massive em-bedding tables, we enable row-wise sharding to distribute thetables to multiple nodes and adjust the training flow to useAlltoAll with bucketization and ReduceScatter as shownin Figure 4b. With UVM enabled and HBM used as a cache,we are able to train model-F with throughput as high as 1.7MQPS, demonstrating capability of our HW/SW co-designedsolution to push beyond the current state-of-the-art.

9 Related workResearchers have proposed various system level innovationsto tackle the challenges brought by extremely large mod-els. DeepSpeed [49] fully shards model parameters, gradi-ents and optimizer states across all nodes, and reconstructsnecessary states on the fly using checkpoint partitioningand rematerialization [19, 26] to drastically reduce mem-ory usage. GShard [30] trains a massive translation modelwith mixture of experts that are sharded across acceleratorsthrough annotation of parallelization strategy at tensor level.FlexFlow [20] uses automatic search to discover the bestparallelization strategy of operators in the graph. Buildingon this direction of auto-parallelization, these recent papers[38, 60] use optimal synthesis and reinforcement learning tofind optimized device placement to further improve paral-lelization without the need for manual intervention. How-ever, these general systems are not specifically designed forhighly sparse recommendation models.To that end, Alibaba introduced XDL [21], an industry-

scale training system designed for high-dimensional sparsedata. XDL incorporates optimizations such as hierarchicalsample compression, workflow pipelining, zero copy andCPU binding to improve training efficiency of the sparsepart of the model. Kraken [62] targets at more efficient on-line training with decoupled key-value fetching and em-bedding, codesigned cache eviction policy with ML domainknowledge for the embedding tables, memory efficient op-timizers for the sparse and dense part of the model, and anon-colocated deployment model that allows the inferenceservers and parameter servers to grow independently. [23]optimizes CPU-based DLRM training through lock-free em-bedding table update, tuned loop tiling for dense MLP, theAlltoAll communication primitive and a new split-SGDimplementation that takes advantage of the bits aliasing inFP32 and BFloat16 to reduce memory footprint. Baidu’s AI-Box [67] takes a different approach to horizontal scaling andfocuses on fitting training of large recommendation modelsin a single node. AIBox hides serving latency by pipelining

12

Page 13: Software-Hardware Co-design for Fast and Scalable Training

network, disk and CPU/GPU tasks, and reduces model up-date overhead and improves SSD life span through a groupedhashing scheme and a multi-level in-memory hashing sys-tem.Much attention is given to communication performance

as it has become a major bottleneck in distributed train-ing at cluster and datacenter scale. BytePS and ByteSched-uler [22, 48] harnesses idle CPU and network resources andbetter communication scheduling to improve parameter ex-change efficiency. However, in a homogeneous training clus-ter where each job spans multiple nodes, there are reducedopportunities for finding and exploiting spare network re-sources, resulting in a sub-optimal use of such approach.SwitchML and ATP [29, 52] leverages programmable net-work switches to perform in-network aggregation for cross-rack bandwidth reduction in datacenter environments. [5, 35]discovers and exploits datacenter network locality and formsoptimized and dynamic aggregation routes through learningand optimal synthesis. Alternatively, these papers [32, 33]address the communication overheads by using various quan-tization schemes to reduce communication volume.

10 ConclusionDLRMs are an important class of models widely used bymany internet companies for a wide range of applications.They can often be the single largest AI application in termsof infrastructure demand in data-centers. These models haveatypical requirements compared to other types of deep learn-ing models, but they still follow a similar trend of rapid rateof growth that is common across all deep learning-based ap-plications. This growth constantly pushes the performanceboundary required of the underlying software stack andhardware platform.In this paper we co-design a solution that enables us to

run models with trillions of parameters, while attaining 40×faster total training time for production recommendationmodels. With novel SW features such as 4D parallelism, high-performance operator implementations, hybrid kernel fusionand hierarchical memory support. On the HW-side, the ex-tensible ZionEX platform allows for scaling up to the fulldata center with thousands of nodes, thus enabling a datacenter-scale AI training cluster to continue catering to thegrowing demands of deep learning models.

AcknowledgementsWe would like to acknowledge all of the help from mem-bers of the hardware, datacenter and infrastructure teams,without which we could not have achieved any of the abovereported results. This includes among others Jenny Yu, MattHoover, Hao Shen, Damien Chong, Jeff Puglis, Garnnet Thomp-son, Peter Bracewell, Anthony Chan,Wei Zhang,Michael Haken,Tiffany Jin, Joshua Held, Cheng Chen, Yin Hang, Ben Kim,Tyler Hart, Gada Badeer, AhmedQaid, Peichen Chang, Zhengyu

Yang, Anil Agrawal, Viswesh Sankaran, Daniel Montgomery,James Taylor, JeffAnderson, Amithash Prasad, PatrickWilliams,Harsha Bojja, Arrow Luo, Changduk Kim, James Le, RachelW Wang, Vignesh Mathimohan, Shockely Chen, Doug Wimer,James Allen, Vidya Rajasekaran, Kelly Zuckerman, WenyinFu, Valentin Andrei, Matt Skach, Philipp Keller, Olivier Raginel,Danielle Costantino. We also like to thank other reviewerswho have gone through multiple drafts of this paper, provid-ing helpful inputs.

13

Page 14: Software-Hardware Co-design for Fast and Scalable Training

References[1] Nvidia collective communications library (nccl), https://developer.

nvidia.com/nccl.[2] Ocp open rack standard (v2), https://www.opencompute.org/wiki/

Open_Rack/SpecsAndDesigns#RACK_Standards.[3] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng

Chen, Craig Citro, Greg S. Corrado, AndyDavis, Jeffrey Dean,MatthieuDevin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv-ing, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Man-junath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, SherryMoore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, VincentVanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, PeteWarden, Martin Wattenberg, Martin Wicke, Yuan Yu, and XiaoqiangZheng. TensorFlow: Large-scale machine learning on heterogeneoussystems, 2015. Software available from tensorflow.org.

[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredKaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam,Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, IlyaSutskever, and Dario Amodei. Language models are few-shot learners,2020.

[5] Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, ToddMytkowicz, Jacob Nelson, and Olli Saarikivi. Synthesizing optimalcollective algorithms. In Proceedings of the 26th ACM SIGPLAN Sympo-sium on Principles and Practice of Parallel Programming, pages 62–75,2021.

[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, TusharChandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai,Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xi-aobing Liu, and Hemal Shah. Wide and deep learning for recommendersystems. arXiv:1606.07792, 2016.

[7] François Chollet. Xception: Deep learning with depthwise separableconvolutions, 2017.

[8] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networksfor YouTube recommendations. In Proc. 10th ACM Conf. RecommenderSystems, pages 191–198, 2016.

[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for languageunderstanding, 2019.

[10] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradientmethods for online learning and stochastic optimization. Journal ofmachine learning research, 12(7), 2011.

[11] Assaf Eisenman, Kiran KumarMatam, Steven Ingram, DheevatsaMudi-gere, Raghuraman Krishnamoorthi, Murali Annavaram, KrishnakumarNair, and Misha Smelyanskiy. Check-n-run: A checkpointing systemfor training recommendation models, 2020.

[12] Carlos A. Gomez-Uribe and Neil Hunt. The netflix recommender sys-tem: Algorithms, business value, and innovation. ACM Trans. Manage.Inf. Syst., 6(4), December 2016.

[13] U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel,K. Hazelwood, M. Hempstead, B. Jia, H. S. Lee, A. Malevich, D. Mudi-gere, M. Smelyanskiy, L. Xiong, and X. Zhang. The architecturalimplications of facebook’s dnn-based personalized recommendation.In 2020 IEEE International Symposium on High Performance ComputerArchitecture (HPCA), pages 488–501, 2020.

[14] Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri,Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast andefficient pipeline parallel dnn training, 2018.

[15] Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, UtkuDiril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia,

Aditya Kalro, et al. Applied machine learning at facebook: A datacenterinfrastructure perspective. In 2018 IEEE International Symposium onHigh Performance Computer Architecture (HPCA), pages 620–629. IEEE,2018.

[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition, 2015.

[17] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, andTat-Seng Chua. Neural collaborative filtering. In Proc. 26th Int. Conf.World Wide Web, pages 173–182, 2017.

[18] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, YanxinShi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and JoaquinQuiñonero Candela. Practical lessons from predicting clicks on ads atfacebook. In Proceedings of the Eighth International Workshop on DataMining for Online Advertising, ADKDD’14, 2014.

[19] Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, PieterAbbeel, Kurt Keutzer, Ion Stoica, and Joseph E Gonzalez. Checkmate:Breaking thememorywall with optimal tensor rematerialization. arXivpreprint arXiv:1910.02653, 2019.

[20] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and modelparallelism for deep neural networks. arXiv preprint arXiv:1807.05358,2018.

[21] Biye Jiang, Chao Deng, Huimin Yi, Zelin Hu, Guorui Zhou, YangZheng, Sui Huang, Xinyang Guo, Dongyue Wang, Yue Song, et al. Xdl:an industrial deep learning framework for high-dimensional sparsedata. In Proceedings of the 1st International Workshop on Deep LearningPractice for High-Dimensional Sparse Data, pages 1–9, 2019.

[22] Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanx-iong Guo. A unified architecture for accelerating distributed DNNtraining in heterogeneous gpu/cpu clusters. In 14th USENIX Sympo-sium on Operating Systems Design and Implementation (OSDI 20), pages463–479. USENIX Association, November 2020.

[23] Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jian-ping Chen, Mikhail Shiryaev, and Alexander Heinecke. Optimizingdeep learning recommender systems training on cpu cluster archi-tectures. In Proceedings of the International Conference for High Per-formance Computing, Networking, Storage and Analysis, SC ’20. IEEEPress, 2020.

[24] Narenda Karmarker and Richard M. Karp. The differencing method ofset partitioning. Technical report, USA, 1983.

[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

[26] Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan,Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. Dynamictensor rematerialization. arXiv preprint arXiv:2006.09616, 2020.

[27] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorizationtechniques for recommender systems. Computer, (8):30–37, 2009.

[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenetclassification with deep convolutional neural networks. In Proceedingsof the 25th International Conference on Neural Information ProcessingSystems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA,2012. Curran Associates Inc.

[29] ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu,Aditya Akella, and Michael Swift. Atp: In-network aggregation formulti-tenant learning.

[30] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen,Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, andZhifeng Chen. Gshard: Scaling giant models with conditional com-putation and automatic sharding. arXiv preprint arXiv:2006.16668,2020.

[31] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis,Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania,et al. Pytorch distributed: experiences on accelerating data paralleltraining. Proceedings of the VLDB Endowment, 13(12):3005–3018, 2020.

14

Page 15: Software-Hardware Co-design for Fast and Scalable Training

[32] Hyeontaek Lim, David G Andersen, and Michael Kaminsky. 3lc: Light-weight and effective traffic compression for distributed machine learn-ing. arXiv preprint arXiv:1802.07389, 2018.

[33] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deepgradient compression: Reducing the communication bandwidth fordistributed training. arXiv preprint arXiv:1712.01887, 2017.

[34] Romain Lopez, Inderjit Dhillon S., and Michael Jordan I. Learning fromextreme bandit feedback. In Proc. Association for the Advancement ofArtificial Intelligence, 2021.

[35] Liang Luo, Peter West, Arvind Krishnamurthy, Luis Ceze, and JacobNelson. Plink: Discovering and exploiting datacenter network localityfor efficient cloud-based distributed training, 2020.

[36] Yifei Ma, Balakrishnan (Murali) Narayanaswamy, Haibin Lin, and HaoDing. Temporal-contextual recommendation in real-time. KDD ’20,page 2291–2299, New York, NY, USA, 2020. Association for ComputingMachinery.

[37] Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, PauliusMicikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis,Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta,Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia,Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma,Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, LillianPentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, TsuguchikaTabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young,and Matei Zaharia. Mlperf training benchmark, 2020.

[38] Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, SamyBengio, Benoit Steiner, Yuefeng Zhou, Naveen Kumar, Rasmus Larsen,and Jeff Dean. Device placement optimization with reinforcementlearning. 2017.

[39] Dheevatsa Mudigere and Whitney Zhao. Hw/sw co-design for futureai platforms - large memory unified training platform (zion). In 2019OCP Regional Summit, Amsterdam, 2019.

[40] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres-ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, PrethviKashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, andMatei Zaharia. Efficient large-scale language model training on GPUclusters. CoRR, abs/2104.04473, 2021.

[41] Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan,Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hec-tor Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-YiingSu, Jiyan Yang, and Mikhail Smelyanskiy. Deep learning training infacebook data centers: Design of scale-up and scale-out systems, 2020.

[42] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, JianyuHuang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, UditGupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov,Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krish-namoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira,Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and MishaSmelyanskiy. Deep learning recommendation model for personaliza-tion and recommendation systems. CoRR, abs/1906.00091, 2019.

[43] NVIDIA. Unified memory in CUDA 6.https://developer.nvidia.com/blog/unified-memory-in-cuda-6/, 2021.Accessed: 2021-03-31.

[44] OpenAI. Ai and compute. https://openai.com/blog/ai-and-compute/,2018.

[45] Satadru Pan, Theano Stavrinos, Yunqiao Zhang, Atul Sikaria, Pavel Za-kharov, Abhinav Sharma, Shiva Shankar P, Mike Shuey, Richard Ware-ing, Monika Gangapuram, Guanglei Cao, Christian Preseau, PratapSingh, Kestutis Patiejunas, JR Tipton, Ethan Katz-Bassett, and WyattLloyd. Facebook’s tectonic filesystem: Efficiency from exascale. In19th USENIX Conference on File and Storage Technologies (FAST 21),pages 217–231. USENIX Association, February 2021.

[46] Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Ar-avind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Male-vich, Satish Nadathur, Juan Pino, Martin Schatz, Alexander Sidorov,Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu,Hector Yuen, Utku Diril, Dmytro Dzhulgakov, Kim Hazelwood, BillJia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo,and Mikhail Smelyanskiy. Deep learning inference in facebook datacenters: Characterization, performance optimizations and hardwareimplications, 2018.

[47] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,Luca Antiga, et al. Pytorch: An imperative style, high-performancedeep learning library. In Advances in neural information processingsystems, pages 8026–8037, 2019.

[48] Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, ChangLan, Chuan Wu, and Chuanxiong Guo. A generic communicationscheduler for distributed dnn training acceleration. In Proceedings ofthe 27th ACM Symposium on Operating Systems Principles, pages 16–29,2019.

[49] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.Zero:Memory optimizations toward training trillion parametermodels.In SC20: International Conference for High Performance Computing,Networking, Storage and Analysis, pages 1–16. IEEE, 2020.

[50] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hog-wild!: A lock-free approach to parallelizing stochastic gradient descent.Advances in neural information processing systems, 24:693–701, 2011.

[51] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu,B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Cole-man, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner,I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar, D. Lee,J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne,G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang,M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong,P. Zhang, and Y. Zhou. Mlperf inference benchmark. In 2020 ACM/IEEE47th Annual International Symposium on Computer Architecture (ISCA),pages 446–459, 2020.

[52] Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis,Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RKPorts, and Peter Richtárik. Scaling distributed machine learning within-network aggregation. arXiv preprint arXiv:1903.06701, 2019.

[53] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy dis-tributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799,2018.

[54] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou,Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, DharshanKumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, andDemis Hassabis. Mastering chess and shogi by self-play with a generalreinforcement learning algorithm, 2017.

[55] David Silver, Julian Schrittwieser, Karen Simonyan, IoannisAntonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker,Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, FanHui, Laurent Sifre, George Driessche, Thore Graepel, and DemisHassabis. Mastering the game of go without human knowledge.Nature, 550:354–359, 10 2017.

[56] Karen Simonyan and Andrew Zisserman. Very deep convolutionalnetworks for large-scale image recognition, 2015.

[57] M. Smelyanskiy. Zion: Facebook next-generation large memory train-ing platform. In 2019 IEEE Hot Chips 31 Symposium (HCS), 2019.

[58] Brent Smith and Greg Linden. Two decades of recommender systemsat amazon.com. IEEE Internet Computing, 21(3):12–18, May 2017.

[59] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and AndrewRabinovich. Going deeper with convolutions, 2014.

15

Page 16: Software-Hardware Co-design for Fast and Scalable Training

[60] Jakub Tarnawski, Amar Phanishayee, Nikhil R Devanur, Divya Ma-hajan, and Fanny Nina Paravecino. Efficient algorithms for deviceplacement of dnn graph operators. arXiv preprint arXiv:2006.16423,2020.

[61] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, LlionJones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attentionis all you need, 2017.

[62] M. Xie, K. Ren, Y. Lu, G. Yang, Q. Xu, B. Wu, J. Lin, H. Ao, W. Xu, andJ. Shu. Kraken: Memory-efficient continual learning for large-scalereal-time recommendations. In 2020 SC20: International Conference forHigh Performance Computing, Networking, Storage and Analysis (SC),pages 1–17, Los Alamitos, CA, USA, nov 2020. IEEE Computer Society.

[63] Jie Amy Yang, Jianyu Huang, Jongsoo Park, Ping Tak Peter Tang, andAndrew Tulloch. Mixed-precision embedding using a cache, 2020.

[64] Jie Amy Yang, Jongsoo Park, Srinivas Sridharan, and Ping Tak PeterTang. Training deep learning recommendation model with quantizedcollective communications. 2020.

[65] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Sri-nadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, andCho-Jui Hsieh. Large batch optimization for deep learning: Trainingbert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.

[66] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learningwith elastic averaging sgd. Advances in neural information processingsystems, 28:685–693, 2015.

[67] Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, andPing Li. Aibox: Ctr prediction model training on a single node. InProceedings of the 28th ACM International Conference on Informationand Knowledge Management, pages 319–328, 2019.

[68] Whiteny Zhao, Dheevatsa Mudigere, Xiaodong Wang, Jongsoo Park,John Kim, and Mikhail Smelyanskiy. Accelerator fabric in facebookzion training system. In 2019 IEEE/ACM International Symposium onNetworks-on-Chip (NOCS), 2019.

[69] Qinqing Zheng, Bor-Yiing Su, Jiyan Yang, Alisson Azzolini, QiangWu, Ou Jin, Shri Karandikar, Hagay Lupesko, Liang Xiong, and EricZhou. Shadowsync: Performing synchronization in the backgroundfor highly scalable distributed training. CoRR, 2003.03477, 2020.

16

Page 17: Software-Hardware Co-design for Fast and Scalable Training

Appendix-AWe collected and developed a set of operator-level benchmarks which we have also open sourced as part of PARAM bench¶,to evaluate the representative problem sizes and shapes on the candidate hardware platforms and to better understand thethroughput and latency in compute, memory, and communications.

Compute BenchmarkGEMM benchmark. This benchmark calls cuBLAS GemmEx routine to compute matrix multiplications on configurableproblem sizes with multiple precision choices. On the V100 GPU, this benchmark supports FP32 GEMM on the CUDA coreand FP16 mixed-precision GEMM on Tensor Core. On the A100 GPU, it additionally supports TF32 GEMM and BF16 GEMMon the Tensor Core.

The benchmark results are shown in Figures 12.

10.0.1 MLP benchmark. This benchmark implements the following multilayer perceptron (MLP) layers:• Batch size = 128, 256, 512, 1024, 2048, 4096;• 20 MLP layers, where each layer is 1K×1K , 2K×2K and 4K×4K;• Each layer has ReLU and final layers has SoftMax;• Both backward and forward passes, including SGD update as the optimizer after the backward pass;• Precision support: FP16, BF16, TF32, FP32.

The batch size, layer dimension, and number of layers can be configured to the customized number. We implemented this MLPbenchmark using C++, directly implementing FC and FCGradients in the MLP layer using cuBLAS SGEMM/GemmEx function,ReLU with cuDNN cudnnActivationForward/ cudnnActivationBackward function, SoftMax with cudnnSoftmaxForward in theforward a customized CUDA kernel for the backward pass, and SGD optimizer with cuBLAS axpy function. This benchmarkcan be used to project the performance of V100/A100 GPUs using a minimal MLP network without the framework overhead inPyTorch. The benchmark results are shown in Figures 13 and 14.

Memory BenchmarkThis benchmark evaluates the achieved memory bandwidth of the embedding kernels described in Section 5. To eliminate theL2 cache effects, a random tensor with 40 MB data (A100 L2 cache size) is allocated to flush the cache.

• Support the evaluation forward and backward pass (the backward pass is fused with optimizer);• Precision Support: FP32 and FP16;• Number of rows: 1000000, Number of tables: 64, Embedding dimension: 128, Pooling size: 32, rows per thread block: 32.

¶https://github.com/facebookresearch/param

0

50

100

150

200

250

300

Bx40

96x4

096

Bx10

24x1

024

Bx40

96x4

0928

Bx10

24x2

000

Bx40

928x

4096

4096

x409

28xB

4096

x409

6xB

1024

x102

4xB

1024

x200

0xB

Bx40

96x4

096

Bx10

24x1

024

Bx40

96x4

0928

Bx10

24x2

000

Bx40

928x

4096

4096

x409

28xB

4096

x409

6xB

1024

x102

4xB

1024

x200

0xB

Bx40

96x4

096

Bx10

24x1

024

Bx40

96x4

0928

Bx10

24x2

000

Bx40

928x

4096

4096

x409

28xB

4096

x409

6xB

1024

x102

4xB

1024

x200

0xB

V100, FP16 A100, FP16 A100, BF16

Mes

aure

d TF

/s

BS=128 BS=256 BS=512 BS=1024 BS=2048 BS=4096

Figure 12. GEMM performance (TF/s) for V100 FP16 vs. A100 FP16/BF16.17

Page 18: Software-Hardware Co-design for Fast and Scalable Training

The benchmark results are shown in Figures 15 and 16.

Communication BenchmarkLow-level collective communication benchmarks, e.g. NVIDIA’s NCCL tests or OSU MPI benchmarks, have the followinglimitations:

• Do not capture the behavior of actual workloads, i.e. exact message sizes, sequence of collective operations, etc. Insteadthese benchmarks support power-of-two message sizes - helpful to detect network trends.

• Limited to one specific communication library. As the name suggests, NCCL tests works only with NCCL and OSU MPIbenchmarks is limited to MPI.

The PARAM comms benchmarks addresses these gaps by:• Creating common abstractions across platforms (e.g. NVIDIA GPUs, x86 CPUs, Google TPU etc.) to help standardize thebenchmarking logic.

0

20

40

60

80

100

120

Bx1Kx1K Bx2Kx2K Bx4Kx4K Bx1Kx1K Bx2Kx2K Bx4Kx4K Bx1Kx1K Bx2Kx2K Bx4Kx4K

V100, FP32 A100, FP32 A100, TF32

Mes

aure

d TF

/s

BS=128 BS=256 BS=512 BS=1024 BS=2048 BS=4096

Figure 13.MLP performance for V100 FP32 vs. A100 FP32/TF32.

0

50

100

150

200

250

300

Bx1Kx1K Bx2Kx2K Bx4Kx4K Bx1Kx1K Bx2Kx2K Bx4Kx4K Bx1Kx1K Bx2Kx2K Bx4Kx4K

V100, FP16 A100, FP16 A100, BF16

Mes

aure

d TF

/s

BS=128 BS=256 BS=512 BS=1024 BS=2048 BS=4096

Figure 14. MLP performance for V100 FP16 vs. A100 FP16/BF16.18

Page 19: Software-Hardware Co-design for Fast and Scalable Training

0

200

400

600

800

1000

1200

1400

1600

0 10000 20000 30000 40000 50000 60000 70000

Mea

sure

d BW

(GB/

s)

Batchsize

V100 (GB/s), FP32

V100 (GB/s), FP16

A100 (GB/s), FP32

A100 (GB/s), FP16

Figure 15. Achieved embedding lookup forward bandwidth using FP32 vs. FP16 on V100 vs. A100.

0

200

400

600

800

1000

1200

1400

1600

0 10000 20000 30000 40000 50000 60000 70000

Mea

sure

d BW

(GB/

s)

Batchsize

V100 (GB/s), SGD, FP32

V100 (GB/s), SGD, FP16

V100 (GB/s) RowWiseAdagrad, FP32

V100 (GB/s) RowWiseAdagrad, FP16

A100 (GB/s), SGD, FP32

A100 (GB/s), SGD, FP16

A100 (GB/s) RowWiseAdagrad, FP32

A100 (GB/s) RowWiseAdagrad, FP16

Figure 16. Achieved embedding lookup backward+optimizer bandwidth using FP32 vs. FP16 on V100 vs. A100.

• Using PyTorch Process Group APIs to provide a portable interface across different communication libraries (e.g. NCCL,MPI, and UCC).

PARAM comms benchmarks supports two types of collective benchmarks:• Bench mode: Simplest mode of operation similar to NCCL tests. Run single collective in blocking or non-blockingmanner across fixed set of message sizes (e.g. power of 2 message sizes). This is mainly used for low-level HW testing

• Replay mode: Replays a trace of collective communication calls to mimic exact workload behavior in terms of collectivesizes.

Figure 17 presents AlltoAll and AllReduce benchmark scaling for power-of-two message sizes on 128 GPUs. AlltoAllachieves 7GB/s and is primarily limited by scale-out bandwidth (12.5 GB/s peak; 10.5 GB/s achievable on V100). AllReduceachieves higher bandwidth since it uses NVLINK more effectively.

19

Page 20: Software-Hardware Co-design for Fast and Scalable Training

0

10

20

30

40

50

60

70

0 50000000 100000000 150000000 200000000 250000000 300000000

Mea

sure

d BW

(GB/

s)

Message size (bytes)

Alltoall (GB/s)

Allreduce (GB/s)

Figure 17. Achieved AlltoAll and AllReduce bandwidth at 128GPUs

20