Pig: Introduction to Latin - 3

  • flatten

players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos= foreach players generate name, flatten(position) as position;
bypos= group pos by position;

 

Jorge Posada,New York Yankees,{(Catcher),(Designated_hitter)},...

==>

Jorge Posada,Catcher
Jorge Posada,Designated_hitter

 

Note:A foreach with a flatten produces a cross product of every record in the bag with all of the other expressions in the generate statement.If there is more than one bag and both are flattened, this cross product will be done with members of each bag as well as other expressions in the generate statement.

 

If the bag is empty, no records are produced. But you can avoid this  by

noempty = foreach players generate name,

                                           ((position is null or IsEmpty(position)) ? {('unknown')} : position) as position;

 

Flatten can also be applied to a tuple. In this case, it does not produce a cross product;instead, it elevates each field in the tuple to a top-level field.

 

  •  Nested foreach

 

daily = load 'NYSE_daily' as (exchange, symbol); -- not interested in other fields
grpd = group daily by exchange;
uniqcnt = foreach grpd {
           sym = daily.symbol;
           uniq_sym = distinct sym;
           generate group, COUNT(uniq_sym);
};

 


divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,date:chararray, dividends:float);
grpd = group divs by symbol;
top3 = foreach grpd {
              sorted = order divs by dividends desc;
              top = limit sorted 3;

              generate group, flatten(top);

};

 

Note:only distinct , filter , limit , and order are supported in foreach.

 

  • fragment-replicate join

 

jnd = join daily by (exchange, symbol), divs by (exchange, symbol) using 'replicated';

 

Pig implements the fragment-replicate join by loading the replicated input into Ha-doop’s distributed cache. All but the first relation will be load into memory.

Note:Fragment-replicate join supports only inner and left outer joins.

 

  • skew join

In many data sets, there are a few keys that have three or more orders of magnitude more records than other keys. This results in one or two reducers that will take much longer than the rest.

 

Skew join works by first sampling one input for the join. In that input it identifies any keys that have so many records that skew join estimates it will not be able to fit them all into memory. Then, in a second MapReduce job, it does the join. For all records except those identified in the sample, it does a standard join, collecting records with the same key onto the same reducer. Those keys identified as too large are treated differently. Based on how many records were seen for a given key, those records are split across the appropriate number of reducers. The number of reducers is chosen based on Pig’s estimate of how wide the data must be split such that each reducer can fit its split into memory. For the input to the join that is not split, those keys that were split are then replicated to each reducer that contains that key.

 

users = load 'users' as (name:chararray, city:chararray);
cinfo = load 'cityinfo' as (city:chararray, population:int);
jnd = join cinfo by city, users by city using 'skewed';

 

Note:Skew join can be done on inner or outer joins. However, it can take only two join inputs.Pig looks at the record sizes in the sample and assumes it can use 30% of the JVM’s heap to materialize records that will be joined. This can be controlled by parameter pig.skewedjoin.reduce.memusage

 

  • merge join

daily = load 'NYSE_daily_sorted' as (exchange:chararray, symbol:chararray,date:chararray,  open:float,                            high:float, low:float,close:float, volume:int, adj_close:float);
divs = load 'NYSE_dividends_sorted' as (exchange:chararray,   symbol:chararray,

                        date:chararray,    dividends:float);
jnd = join daily by symbol, divs by symbol using 'merge';

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值