In this blog post, I’d like to share some of the insights from my work at the High-Performance Computing (HPC) Texas Advanced Computing Center (TACC) cluster (circa 2010, TACC had a “Lonestar” cluster with 5200 2Gb-nodes), and within the Technology Group at Conoco-Phillips Corp. Specifically, I want to address the issue of Data vs. Model distributed computations.
But before I go into the details, let’s understand what ring-reduce and all-reduce are.
What Are Ring-Reduce And All-Reduce Operations?
All-Reduce is a parallel algorithm that aggregates the target arrays from all processes independently into a single array. Aggregation can be either concatenation or summation, or any other operation that allows for independent parallel processing of arrays. When applied to the gradient of deep learning, the summation is the most common operation.
On the other hand, in ring-reduce, the gradient is divided into consecutive blocks at each node, and simultaneously, each block is updated using the previous node and also provides an update to the next node, making a ring pattern. This method is popularized by tools like Horovod.
Ring-Reduce vs. All-Reduce
When it comes to All-Reduce vs. Ring-Reduce distribution, we often do not consider the price we would need to pay for Ring-Reduce.
For example, All-Reduce ACCURACY is absolutely equivalent to non-distributed training, i.e., we should expect the same results from All-reduce running on any number of nodes, including a single node. However, All-reduce won’t be able to train models that do not fit memory.
Ring-Reduce allows training large models that do not fit in single node memory. However, there is a price in ACCURACY that needs to be paid, and there are methods to deal with it to minimize this price.
Here is an intuitive explanation of Ring-Reduce using a big matrix solving notation that illustrates the ‘price’ that one pays both in terms of convergence rate and attainable convergence accuracy.
First of all, there is a SYNCHRONOUS Ring-Reduce that reproduces the single node accuracy (up to shuffling in case of deep learning batch training). Synchronous Ring-Reduce uploads the model into memory in chunks, but only one node runs at any time.
The next phase of introducing Ring-Reduce for ASYNCHRONOUS distributed computations is making Ring-Reduce reproduce one’s single node ACCURACY for a diagonal block matrix (diagram below, top), with a trivial model-nodes distribution which requires no intra-node communication.
Lastly, here is how we improve Ring-Reduce for a non-diagonal but sparse matrix. Model distribution is still possible, and now each node will need to communicate to its neighbor ASYNCHRONOUSLY and without blocking. It is done by having each node dealing with a ‘delayed’ part of the model weights computed from its neighbor. This deviates from the model update scheme for a single node when the whole model is updated at step n+1 using the model at step n.
I illustrate All-reduce vs. Ring-reduce using a small MPI c-program (Github) :
To introduce some complexity, I have chosen a non-diagonally-dominant matrix with a non-guaranteed convergence, as can be seen with the Gauss-Seidel update scheme. As a result, we will observe Ring-Reduce not being able to converge for some ranges of matrix size n (❤00).
Here are the main conclusions by comparing side-by-side All-Reduce and Ring-Reduce:
- Synchronous All-Reduce MPI matrix solver that attains absolute up to arbitrary epsilon convergence with Gauss-Seidel iterations.
- Ring-Reduce is able to distribute large models (large matrices)
- Ring-Reduce provides a straightforward extension to asynchronous model weights updates.
- Ring-Reduce may break update schemes. For example, the Gauss-Seidel solver may not attain an absolute arbitrary-epsilon convergence (for example, one can see it for n=100), and the resulting convergence error can be large.
- Ring-Reduce requires a larger absolute amount of computations, i.e., in my example, each node will make up to ten inter-node iterations between the intra-node communications.
- Ring-Reduce performance greatly depends on the linear equations sparsity. For a general non-sparse model, no model distribution could be attained.
I had All-Reduce and Ring-reduce implemented in MPI using a non-blocking synchronous MPI Reduce and point-to-point Send and Receive on each node.
Gauss-Seidel update attains arbitrary unlimitedconvergence accuracy of 0.002 with synchronous (blocking) distributed model weights update, versus limited convergence of 0.1 in an asynchronous update.