1、数据类型
简单类型:传承自Java中的pig包
复杂类型:Map、Tuple、Bag
注:Bag是一个无序的tuple集合,且不需要加载到内存中
2、Null值
此处与sql中的null概念一样,但是和C、Java、Python中的不同。此处null的含义是说这个值是未知的,这可能是数据缺失或是数据处理出错造成;但是在程序语言中null则表示未被赋值或没有指向一个有效的地址或对象。
3、模式
支持显式定义模式和隐式转换,默认类型是bytearray类型
类型转换:所有类型不允许转换成bytearray类型,但是将bytearray类型转换成其他类型是允许的。
Int、long、float、double、chararray,可以相互转换,但是推荐最好是从低转到高,不会造成数据流失;反之数据会流失。
4、基础知识
1.关系变量或字段名称必须以字母字符开头(与编程语言相比,少了下划线),之后可以跟上零个或多个字母、数字、或者下划线。
2.大小写敏感:关键字不敏感,其他敏感
3.注释:--(单行注释)、/**/(多行注释)
4.加载:load ‘xxx’ 【as 模式 using 加载函数】
5.存储:store processed into ‘xxx’ 【using 加载函数】
6.输出:dump processed;
7.循环遍历,每条记录只保留指定字段:foreach 关系变量 generate 字段名;
5、关系操作
映射:
1、map使用#:avg = foreach A generate bat#’batting_average’;
2、tuple使用.:B = foreach A generate t.x,t.$1;
3、bag无法明确:映射时可以通过创建一个包含用户需要的字段的bag:
A = load ‘input’ as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.x;
Filter使用语句:
notstsrt = filter divs by not symbol matches ‘CM.*’;
布尔操作符优先执行顺序:
not > and > or
Group使用语句:
daily = load ‘NYSE_daily’ as (exchange, stock, date, dividends);
grpd = group daily by (exchange, stock);
或者grpd = group daily all;
或者grpd = group daily exchange;
Avg = foreach grpd generate group, AVG(daily.dividends);
Describe grpd;
Order by:
daily = load ‘NYSE_daily’ as (exchange:chararray, symbol:chararray,date:chararray, open:float, high:float, low,float, close:float, volume:int, adj_close:float);
byclose = order daily by close desc,open;
Dump byclose;
Distinct:
daily = load ‘NYSE_daily’ as (exchange:chararray, symbol:chararray);
uniq = distinct daily;
Join:
Daily = load ‘NYSE_daily’ as (exchange, symbol, date, open, high, low, close, volume, adj_close);
Divs = load ‘NYSE_dividends’ as (exchange, symbol, date, dividends );
Jnd = join daily by(symbol, date),divs by (symbol, date);
(Join自连接时允许的,但这时数据会被加载两次)
Limit:
divs = load ‘NYSE_dividends’;
First10 = limit divs 10;
Sample:(样本抽取)
divs = load ‘NYSE_dividends’;
some = sample divs 0.1; --以10%的比例抽取样本
Parallel:
指定数据收集后如何进行Reduce阶段的并行数据处理,附加在其他操作符后使用:group、order、distinct、join、cogroup、cross
set default_parallel 10; --设置一个脚本范围内有效的参数值
daily = load ‘NYSE_daily’ as (exchange, symbol, date, open);
bysymbl = group daily by symbol parallel 10;
Java静态函数调用:
使用Java的integer类将十进制的数值转换为十六进制的值:
define hex InvokeForString(‘java.lang.Integer.toHexString’, ’int’);
divs = load ‘NYSE_daily’ by colume is not null;
inhex = foreach notnull generate symbol, hex((int)colume);
Flatten:
flatten会爆tuple内容打开:
A结构:(a,(b, c))
B = foreach A generate $0, flatten($1);
B返回结果:(a,b,c)
flatten打开bag:
A结构:({(b,c),(d,e)})
B = foreach A generate flatten($0);
B返回结果:(b,c) (d,e)
内嵌foreach:
daily = load ‘NYSE_daily’ as (exchange,symbol);
grpd = group daily by exchange;
uniqcnt = foreach grpg {
sym = daily.symbol;
uniq_sym = distinct sym;
Generate group, count(uniq_sym);
};
指定分片重复:using ‘replicated’
Union:
A = load ‘input1’ as (x:int, y:float);
B = load ‘input2’ as (x:int, y:double);
C = union A,B;
Describe C;
结果:C:{x:int, y:double}
Cross:
daily = load ‘NYSE_daily’ as (exchange:chararray, symbol:chararray);
divs = load ‘NYSE_dividends’ as (exchange:chararray, symbol:cahrarray, date:chararray, dividends:float);
tonsodata = cross daily, divs parallel 10;
如果给定的输入分别具有n和m条记录,那么cross的输出会有n x m 条记录。
通过Perl文件程序highdiv.pl实现代码过滤(stream):
divs = load ‘NYSE_dividends’ as (exchange, symbol, date, dividends);
highdivs = stream divs through ‘highdiv.pl’ as (exchange, symbol, date, dividends);
Mapreduce:实现mapreduce和pig共用
Crael = load ‘webcrawl’ as (url, pageid);
Normalized = foreach crawl generate normalize(url);
Goodurls = mapreduce ‘blacklistchecker.jar’ store normalized into ‘input’ load ‘output’ as (url, pageid);