Large scale machine learning - Map-reduce and data parallelism

王彩旗 edwardwangcq.com

于 2021-04-08 08:27:26 发布

阅读量84

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/115500923

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

In this class, let's talk about a different approach to large scale machine learning called the Map-reduce approach. It is at least as equally important or even more important compared to stochastic gradient descent. By using this idea, you might be able to scale learning algorithms to even far larger problems than using stochastic gradient descent.

figure-1

Let's say we want to fit a linear regression model or logistic regression model or some such and let's start again with batch gradient descent. Let's suppose m=400 for easy of explanation. Of course, by our standards in terms of large scale machine learning, such m may be too small. So this might be more commonly applied to problems where you have maybe closer to 400 million examples. The figure-1 shows logic for batch gradient descent. If m is large, that is a computationally expensive step.

Map-reduce is based on the idea of Jeffrey Dean and Sanjay Ghemawat. And figure-2 shows the idea. Let's say we have some training set with 400 examples denoted by the box. In Map-reduce, I'm goging to split this training set into different subsets. Let's assume we have 4 machines to run in parallel on my training set. Then, I'm going to split my training set into 4 subsets here.

The 1st machine will just use first 1/4 of my training sets, that is $(x^{(1)}, y^{(1)}),...,(x^{(100)}, y^{(100)})$ . And compute the summation of just this first 100 training examples, name it $temp^{(1)}_{j}$ . The superscript (1) denotes the first machine
Similarly, I'm going to take the second quarter of the data and send it to my second machine. And my second machine will use $(x^{(101)}, y^{(101)}),...,(x^{(200)}, y^{(200)})$ and compute $temp^{(2)}_{j}$
And similarly, machine three and four will use the third and fourth quarter of my training set to compute $temp^{(3)}_{j}$ and $temp^{(4)}_{j}$
Finally, after all these machines have done this work, I'm going to take these temp variables and send them all to a centralized master server and combine these results together and update the parameters $\theta _{j}=\theta _{j}-\alpha \frac{1}{400}(temp^{(1)}_{j}+temp^{(2)}_{j}+temp^{(3)}_{j}+temp^{(4)}_{j})$ , where $j=0,...n$ and n is the number of features

So now, each machine only need to do a quarter of the work, thus presumably it could do it about 4x as fast.

Figure-3 is the general picture of the Map-reduce technique. We have some training sets and can split it as evenly as we can into four subsets. Then send the subsets to four different computers. Each computer can compute a summation over just one quarter of the training set and then sends the result to a centralized server which then combines the results together. If there were no network latencies and no cost of network communications to send the data back and forth, potentially you can get up to a 4x speed up. Of course, in practice, because of network latencies, the overhead of combining the results afterwards and other factors, you would get a slightly less than a 4x speed up. But nonetheless this sort of Map-reduce approach does offer us a way to process much larger data sets than is possible using a single computer.

If you're thinking of applying Map-reduce to some learning algorithm in order to speed it up by parallelizing the computation over different computers, the key question to ask yourself is can your learning algorithm be expressed as a summation over the training set.

Figure-4 shows one more example. Let's say we want to use one of the advanced optimization algorithms (L-BFGS, conjugate gradient descent and so on). And let's say we want to train a logistic regression learning algorithm. We need to compute two main quantities: the routine to compute the cost functions, and the routine to compute these partital derivatives. Then you would have each machine to compute the summation over just some small fraction of your trainig data. Then send their results to a centralized server which can then add up the partial sums and get the overall cost function and get the overall partial derivatives.

Besides parallelizing over multiple computers, sometimes even if you have just a single computer Map-reduce can also be applicable. In particular, on many computers now you can have multiple processing cores. If you have a large training set and a single computer with 4 cores, you can split the training set multiple pieces and sent the training set to different cores. Each of the cores can sum over say 1/4 of your training set and then take the partial sums and combine them in order to get the summation over the entire training set. The advantage of thinking about Map-reduce this way is you don't have to worry about network latency.

Finally, one last caveat on parallelizing within a multi-core machine. Depending on the details of implementation, if you have a multi-core machine, and if you have certain numerical linear algebra libraries, it turns out that some numerical linear algebra libraries that can automatically parallelize their linear algebra operation across multiple cores within the machine. If you're fortunate enough to be using one of those numerical linear algebra libraries, and if you have a very good vectorized implementation of a learning algorithm, sometimes you can just implement your standard learning algorithm in a vectorized fashion and not worry about parallelization, and so you don'e need to implement Map-reduce.

<end>