1. 加载数据:
records =load './Desktop/data.txt' using PigStorage as (year:int,temperature:int, quality:int);
2. 查看数据:
dump records;
3. 根据 quality 字段对 records 数据进行分组:
grouped_records =group records by quality;
4. 对分组数据进行统计 , 查看每种质量的天气数据:
count_records =foreach grouped_records generate group,COUNT($1);
如果出现下面的错误:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve count using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
一般都是因为使用的函数的大小写问题,如上面的 COUNT
join 的使用:
1. 记载另一个关系:
temp_records =load './Desktop/data-1.txt' using PigStorage as (year:int,temperature:int);
2. 将 temp_records 关系与 recirds 连接:
join_records =join records by year, temp_records by year;
3. 查看数据:
dump join_records;
按照 records 的 year 列进行排序 , 默认情况下为升序:
order_records =order records by year;
降序排序:
order_records =order records by year desc;
存储数据到自己的目录下:
store records into './Desktop/result';
查看存储的数据:
cat ./Desktop/result;
看数据的交集:
cross_records =cross records,temp_records;
去除重复的行:
distinct_records =distinct temp_records;
pig 的数据加载都不是第一时间的,在使用 dump 命令的时候,才进行数据的加载
foreach 的使用:
foreach_records =foreach records generate $0;
SUM 的使用:
1. 按照年份分组数据:
grouped_records =group temp_records by year;
2. 计算每个年份温度的总和:
sum_records =foreach grouped_records generate group,SUM(temp_records.temperature);
将 bag 包展开,具体应用还不清楚:
flatten_records =foreach grouped_records generate group,flatten(temp_records);
limit 应用:
limit_records =limit records 10;
union 应用:
union_records =union records,temp_records;
1.load data:
records =load './Desktop/data-2.txt' using PigStorage as(name:chararray,age:int,sex:int);
2. 对数据按性别进行分组 :
temp_records =group records by sex;
3.AVG 平均函数的树用:
avg_records =foreach temp_records generate group,AVG(records.age);
4.CONCAT 连接函数的使用:
concat_records =foreach records generate CONCAT('001-',name);
注意连接操作不能对 int 行使用
5.COUNT 函数的使用:
count_records =foreach temp_records generate group,COUNT(records);
6.MAX 函数的使用:
max_records =foreach temp_records generate group,MAX(records.age);
7.MIN 函数的使用:
min_records =foreach temp_records generate group,MIN(records.age);
8.SIZE 函数统计字母的个数:
size_records =foreach records generate SIZE(name);
9.SUM 求和函数的使用:
sum_records =foreach temp_records generate group,SUM(records.age);
10.TOKENIZE 分割函数的使用:
t_records =load './Desktop/data-3.txt' as(str:chararray);
tokenize_records =foreach t_records generate TOKENIZE(str);