pig-0.12.0-cdh5.2.0
路径:/opt/dev/pig/pig-0.12.0-cdh5.2.0
启动:pig
停止:quit;
环境变量
export PIG_HOME=/opt/dev/pig/pig-0.12.0-cdh5.2.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/etc/hadoop
source /etc/profile
启动
pig –x local
pig –x mapreduce
使用:
1.本地(local)模式:可以读取本地文件
2.mapreduce模式:只能读取hdfs上的文件
3.语法
(1)load加载数据
load 'filePath' [using PigStor
age(',')] [as (data
structure
)];
records = load '/opt/dev/pig/temp/t.txt' as (year: chararray,temperature: int);
records = load 'hdfs://wanggang:9000/bonc/student.txt' using PigStorage(',') as(classNo:chararray,studNo:chararray, score:int);
(2)store存储数据
store data into 'filePath' [using PigStorage(':')];
store records into ' hdfs://localhost:9000/bonc/student_out' using PigStorage(':');
(3)filter过滤数据(筛选)
filter data by conditions
records_c01 = filter records by classNo=='C01';
(4)group分组数据
group data by field [parallel 2]
parallel
2 表示启用2个mapReduce
grouped_records = group records by classNo parallel 2;
grouped_records = group valid_records by year;
(5)foreach遍历数据
Foreach对关系中的每一个记录循环,然后按指定模式生成一个新的关系。
foreach data generate structure
score_c01 = foreach records_c01 generate 'Teacher',$1,score;
max_temperature = foreach grouped_records generate group,MAX(valid_records.temperature);
(6)join连接数据
join data by field,data by field;
r_joined = join r_student by classNo,r_teacher by classNo;
(7)cross连接数据(类似于笛卡尔)
cross data data;
r = cross r_student,r_teacher;
(8)order排序数据
order data by field desc(asc),...;
r = order r_student by score desc, classNo asc;
(9)union联合数据
union data data;
r_union = union r_student, r_teacher;
(10)dump输出数据
dump data
(11)describe查看数据结构
describe data
4.案例:
文件:
1990 21
1990 18
1991 21
1992 30
1992 999
1990 23
1990 18
1991 21
1992 30
1992 999
1990 23
脚本:
records = load ‘/opt/dev/pig/temp/t.txt’ as (year: chararray,temperature: int);
dump records;
describe records;
valid_records = filter records by temperature!=999;
grouped_records = group valid_records by year;
dump grouped_records;
describe grouped_records;
max_temperature = foreach grouped_records generate group,MAX(valid_records.temperature);
dump records;
describe records;
valid_records = filter records by temperature!=999;
grouped_records = group valid_records by year;
dump grouped_records;
describe grouped_records;
max_temperature = foreach grouped_records generate group,MAX(valid_records.temperature);
dump max_temperature;