Pig数据分析引擎 二 使用PigLatin语句分析数据

配置好环境
启动hadoop
--启动hadoop
start-all.sh
--查看是否处于安全模式
hdfs dfsadmin -safemode get
--启动historyServer 记录了所有的mapreduce程序的历史信息
[root@linux111 ~]# mr-jobhistory-daemon.sh start historyserver
进入pig命令提示符
[root@linux111 ~]# pig
grunt>
查看准备好的数据  hadoop input目录下的 emp.csv的文件
grunt> cat /input/emp.csv
7369,SMITH,CLERK,7902,1980/12/17,800,,20
7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30
7566,JONES,MANAGER,7839,1981/4/2,2975,,20
7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30
7782,CLARK,MANAGER,7839,1981/6/9,2450,,10
7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20
7839,KING,PRESIDENT,,1981/11/17,5000,,10
7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30
7876,ADAMS,CLERK,7788,1987/5/23,1100,,20
7900,JAMES,CLERK,7698,1981/12/3,950,,30
7902,FORD,ANALYST,7566,1981/12/3,3000,,20
7934,MILLER,CLERK,7782,1982/1/23,1300,,10
将emp.csv的文件加载为一个emp表(bag)
emp = load '/input/emp.csv';
查看emp表的结构
grunt> describe emp;
Schema for emp unknown.
如上所示 显示我们的emp表中的结构为空,因为我们直接是在加载外部csv的文件,在创建表的时候并没有指定结构

将emp.csv文件加载成一个表emp1(bag),并且指定tuple的schema
grunt> emp1 = load '/input/emp.csv' as (empno,ename,job,mgr,hiredate,sal,comm,deptno);
再次查看表结构
grunt> describe emp1;
emp1: {empno: bytearray,ename: bytearray,job: bytearray,mgr: bytearray,hiredate: bytearray,sal: bytearray,comm: bytearray,deptno: bytearray}
这个时候 就显示出来的emp1表的表结构,
注意:如果在创建bag的时候 没有指定列的数据类型  默认的数据类型是bytearray(字节数组)

查看emp1表的数据
//这里会触发一个计算,转换为mapreduce程序,在yarn容器记录中可以查看
dump emp1;
(7369,SMITH,CLERK,7902,1980/12/17,800,,20,,,,,,,)
(7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30,,,,,,,)
(7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30,,,,,,,)
(7566,JONES,MANAGER,7839,1981/4/2,2975,,20,,,,,,,)
(7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30,,,,,,,)
(7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30,,,,,,,)
(7782,CLARK,MANAGER,7839,1981/6/9,2450,,10,,,,,,,)
(7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20,,,,,,,)
(7839,KING,PRESIDENT,,1981/11/17,5000,,10,,,,,,,)
(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30,,,,,,,)
(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20,,,,,,,)
(7900,JAMES,CLERK,7698,1981/12/3,950,,30,,,,,,,)
(7902,FORD,ANALYST,7566,1981/12/3,3000,,20,,,,,,,)
(7934,MILLER,CLERK,7782,1982/1/23,1300,,10,,,,,,,)
将emp.csv文件加载成一个表emp2(bag),并且指定tuple的schema,并指定每列的数据类型
grunt> emp2 = load '/input/emp.csv' as (empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);
再次查看emp2表的结构
grunt> describe emp2;
emp2: {empno: int,ename: chararray,job: chararray,mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int}
然后使用dump命令 查看数据
dump emp2;
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
(,,,,,,,)
可以发现 上方emp2表中并没有数据,pig中在使用了数据类型之后,默认的分隔符是 tab 键,在tab键的情况下,数据自然查看不出

指定分隔符:使用pig的内置函数 PigStorage
emp2 = load '/input/emp.csv' using PigStorage (',') as (empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);
pig在遇到表相同的情况 会自动覆盖以前的表

再次使用dump命令查看表内容
dump emp2;
(7369,SMITH,CLERK,7902,1980/12/17,800,,20)
(7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30)
(7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30)
(7566,JONES,MANAGER,7839,1981/4/2,2975,,20)
(7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30)
(7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30)
(7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)
(7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20)
(7839,KING,PRESIDENT,,1981/11/17,5000,,10)
(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30)
(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20)
(7900,JAMES,CLERK,7698,1981/12/3,950,,30)
(7902,FORD,ANALYST,7566,1981/12/3,3000,,20)
(7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
创建部门表
查看文件在本地的位置
grunt> sh ls ~/input
data.txt
dept.csv
emp.csv
使用pig的  copyFromLocal 将文件从linux 上赋值到 hadoop 上
grunt> copyFromLocal /root/input/dept.csv /input
查看部门表中有哪些数据
grunt> cat /input/dept.csv
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
在pig上创建 dept bag
grunt> dept = load '/input/dept.csv' using PigStorage (',') as (deptno:int,dname:chararray,loc:chararray);
在Pig上进行查询操作
查询员工信息, 要求查询的列,员工号,员工姓名,薪水
SQL:select empno,ename,sal from emp2;
grunt> emp4 = foreach emp2 generate empno,ename,sal;
查询员工信息, 按sal排序
SQL:select * from emp2 order by sal;
PigLatin: emp5 = order emp2 by sal;
分组:查询每个部门工资的最大值   略微麻烦
SQL:select deptno,max(sal) from emp2 group by deptno;

PigLatin:需要分为两步操作
//先将emp2 表进行分组 按部门号 ----> emp61
emp61 = group emp2 by deptno;
//查看emp61表的结构
grunt> describe emp61;
emp61: {group: int,emp2: {(empno: int,ename: chararray,job: chararray,mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int)}}
//查看表的数据
(10,{(7934,MILLER,CLERK,7782,1982/1/23,1300,,10),
                         (7839,KING,PRESIDENT,,1981/11/17,5000,,10),
                         (7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)})

                    (20,{(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20),
                         (7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20),
                         (7369,SMITH,CLERK,7902,1980/12/17,800,,20),
                         (7566,JONES,MANAGER,7839,1981/4/2,2975,,20),
                         (7902,FORD,ANALYST,7566,1981/12/3,3000,,20)})

                    (30,{(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30),
                         (7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30),
                         (7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30),
                         (7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30),
                         (7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30),
                         (7900,JAMES,CLERK,7698,1981/12/3,950,,30)})
//在emp61的表上在进行操作
emp62 = foreach emp61 generate group,MAX(emp2.sal);
//最后查看emp62表中最后的结果
(10,5000)
(20,3000)
(30,2850)
查询10部门员工的信息
SQL:select * from emp2 where deptno = 10;
PigLatin:emp7 = filter emp2 by deptno == 10;//注意此处为==
多表查询:查询员工信息: 员工姓名  部门名称
SQL:   select e.ename,d.dname from emp2 e,dept d where e.deptno=d.deptno;
PL:    emp81 = join dept by deptno,emp2 by deptno;
       emp82 = foreach emp81 generate dept::dname,emp2::ename
集合运算:并集  交集  差集
查询10和20号部门的员工信息
    SQL: select * from emp2 where deptno=10
         union
         select * from emp2 where deptno=20;
问题:在SQL中,是否任意的集合都可以进行集合运算?(做集合运算,对集合有要求吗?)
         有要求的:参与运算的各个集合必须列数相同且类型一致
    PL: emp10 = filter emp2 by deptno==10;
        emp20 = filter emp2 by deptno==20;
        执行集合运算: emp10_20 = union emp10,emp20;
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值