Cloud Computing(3)_Basic MapReduce Algorithm Design_Pairs&Stripes

How do we aggregate partial counts efficiently?

Pairs

An algorithm.
This algorithm illustrates the use of complex keys in order to coordinate distributed computations.

  • Each mapper takes a sentence
  • Reducers sum up counts associated with these pairs
//"pairs" approach
class MAPPER
    method MAP(docid a, doc d)
        for all term w∈doc d do
            for all term u∈NEIGHBORS(w) do
                EMIT( pair(w, u) , count 1)   //EMIT count for each co-occurrence

class REDUCER
    method REDUCE(pair p, counts[c1, c2, ...])
        s = 0
        for all count c ∈counts[c1, c2, ...] do
            s = s + c
        EMIT(pair p, count s)

For each term emit pairs: ( (a,b), 1 ) 键值是一个pair(a,b)

“Pairs Analysis”(数组短,但数目多)
  • Advantages

    • Easy to implement, easy to understand: map就是找pair,reduce就是统计
  • Disadvantages

    • Lots of pairs to sort and shuffle around, upper bound = (n!)(n个单词,就有n的阶乘个pairs)
    • Not many opportunities for combiners to work
Stripes

Co-occurrence information is first stored in an associative array, denoted H.
The mapper emits key-value pairs with words as keys and corresponding associative arrays as values, where each associative array encodes the co-occurrence counts of the neighbors of a particular word.

  • Each mapper takes a sentence
  • Reducers perform element-wise sum of associative arrays
//"stripes" approach
class MAPPER
    method MAP(docid a, doc d)
        for all term w∈doc d do
            H = new ASSOCIATIVEARRAY
            for all term u∈NEIGHBORS(w) do
                H{u} = H{u} + 1   //Tally words co-occurring with w
            EMIT( term w , Stripe H)  

class REDUCER
    method REDUCE(term w , Stripes [H1, H2, H3,...])
        Hf = new ASSOCIATIVEARRAY
        for all stripe H ∈stripes[H1, H2, H3, ...] do
            sum(Hf,H)
        EMIT(term w , Stripe Hf)

For each term emit stripes: a->{b:1, c:2, d:2, ….} 键值是“a”

“Stripes Analysis”(数组长,但数目少)
  • Advantages

    • Far less sorting and shuffling of key-value pairs
    • Can make better use of combiners
  • Disadvantages

    • More difficult to implement
    • Underlying object more heavyweight
    • Fundamental limitation in terms of size of event space
Pairs vs. Stripes
  • 处理量不大,处理资源数目少,用pairs;反之,stripes较优
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值