一、Pig简介和Pig的安装配置
1、最早是由Yahoo开发,后来给了Apache
2、支持语言:PigLatin 类似SQL
3、翻译器 PigLatin ---> MapReduce(Spark)
4、安装和配置
(1)tar -zxvf pig-0.17.0.tar.gz -C ~/training/
(2)设置环境变量 vi ~/.bash_profile
PIG_HOME=/root/training/pig-0.17.0
export PIG_HOME
PATH=$PIG_HOME/bin:$PATH
export PATH
两种配置模式(运行模式)
(1)本地模式:操作Linux的文件
启动: pig -x local
日志:Connecting to hadoop file system at: file:///
(2)集群模式:链接到HDFS
设置环境变量 指向Hadoop配置文件所在的目录
PIG_CLASSPATH=/root/training/hadoop-2.7.3/etc/hadoop
export PIG_CLASSPATH
启动: pig
日志: Connecting to hadoop file system at: hdfs://bigdata11:9000
二、Pig的常用命令: 操作HDFS
ls、cd、cat、mkdir、pwd
copyFromLocal(上传)、copyToLocal(下载)
sh: 调用操作系统的命令
register、define =====> 使用Pig的自定义函数
三、Pig的数据模型(重要) ----> Apache Storm流式计算
Pig的数据模型
四、使用PigLatin语句分析和处理数据
1、需要使用Hadoop的HistoryServer
mr-jobhistory-daemon.sh start historyserver
http://192.168.157.11:19888/jobhistory
2、常用的PigLatin语句
(*)load 加载数据到bag(表)
(*)foreach 相当于循环,对bag每一条数据tuple进行处理
(*)filter 相当于where
(*)group by 分组
(*)join 连接
(*)generate 提取列
(*)union/intersect 集合运算
(*)输出:dump 直接打印的屏幕上
store 输出到HDFS
注意:有些语句会触发计算,有些不会
Spark算子(API方法):Transformation:不会触发计算
Action: 会触发计算
3、举例: 7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
(1) 加载员工数据到表
emp = load '/scott/emp.csv';
查询表的结构
describe emp; ---> Schema for emp unknown.
(2) 加载员工数据到表,指定每个tuple的schema和类型
emp = load '/scott/emp.csv' as(empno,ename,job,mgr,hiredate,sal,comm,deptno);
默认的数据类型:bytearray
默认分隔符:制表符
emp = load '/scott/emp.csv' as(empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);
最后版本
emp = load '/scott/emp.csv' using PigStorage(',') as(empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);
创建一个部门表
dept = load '/scott/dept.csv' using PigStorage(',') as(deptno:int,dname:chararray,loc:chararray);
(3) 查询员工信息:员工号 姓名 薪水
SQL: select empno,ename,sal from emp;
PL: emp3 = foreach emp generate empno,ename,sal;
(4) 查询员工信息:按照月薪排序
SQL: select * from emp order by sal;
PL: emp4 = order emp by sal;
(5) 分组:求每个部门的工资的最大值
SQL: select deptno,max(sal) from emp group by deptno;
PL: 第一步:分组
emp51 = group emp by deptno;
表结构:
emp51: {group: int,
emp: {(empno: int,ename: chararray,job: chararray,mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int)}}
数据:
(10,{(7934,MILLER,CLERK,7782,1982/1/23,1300,,10),
(7839,KING,PRESIDENT,,1981/11/17,5000,,10),
(7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)})
(20,{(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20),
(7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20),
(7369,SMITH,CLERK,7902,1980/12/17,800,,20),
(7566,JONES,MANAGER,7839,1981/4/2,2975,,20),
(7902,FORD,ANALYST,7566,1981/12/3,3000,,20)})
(30,{(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30),
(7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30),
(7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30),
(7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30),
(7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30),
(7900,JAMES,CLERK,7698,1981/12/3,950,,30)})
第二步:求每个部门的工资最大值
emp52 = foreach emp51 generate group,MAX(emp.sal)
(6) 查询10号部门的员工
SQL: select * from emp where deptno=10;
PL: emp6 = filter emp by deptno==10; 注意:两个等号
(7) 多表查询
查询员工信息: 员工姓名 部门名称
SQL: select e.ename,d.dname from emp e,dept d where e.deptno=d.deptno;
PL: emp71 = join dept by deptno,emp by deptno;
emp72 = foreach emp71 generate dept::dname,emp::ename;
(8) 集合运算:关系型数据库Oracle:参与集合运算的各个集合必须列数相同且类型一致
10和20号部门的员工
SQL: select * from emp where deptno=10
union
select * from emp where deptno=20;
PL: emp10 = filter emp by deptno==10;
emp20 = filter emp by deptno==20;
emp10_20 = union emp10,emp20;
(9) 使用PL实现WordCount:参考P57
① 加载数据
mydata = load '/data/data.txt' as (line:chararray);
② 将字符串分割成单词
words = foreach mydata generate flatten(TOKENIZE(line)) as word;
③ 对单词进行分组
grpd = group words by word;
④ 统计每组中单词数量
cntd = foreach grpd generate group,COUNT(words);
⑤ 打印结果
dump cntd;
五、Pig的自定义函数: 也是一个Java程序,Pig的自定义函数比Hive的自定义函数要麻烦(Pig的自定义函数有三种:)
1、自定义过滤函数:相当于where条件
2、自定义运算函数:
3、自定义加载函数:使用load语句加载数据,生成一个bag
默认:一行解析成一个Tuple
需要MR的jar包
依赖的jar包:
1、/root/training/pig-0.17.0/pig-0.17.0-core-h2.jar
2、/root/training/pig-0.17.0/lib
3、/root/training/pig-0.17.0/lib/h2
4、$HADOOP_HOME/share/hadoop/common
5、$HADOOP_HOME/share/hadoop/common/lib
如何使用Maven搭建开发环境???
注册jar包: register define
register /root/temp/p1.jar
myresult3 = load '/input/data.txt' using demo.pig.MyLoadFunc();
定义别名:define myload demo.pig.MyLoadFunc;
package demo.pig;
import java.io.IOException;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
//实现自定义的过滤函数:实现:查询过滤薪水大于2000的员工
public class IsSalaryTooHigh extends FilterFunc{
@Override
public Boolean exec(Tuple tuple) throws IOException {
/*参数tuple,调用的时候 传递的参数
*
* 在PigLatin调用:
* myresult1 = filter emp by demo.pig.IsSalaryTooHigh(sal)
*/
//取出薪水
int sal = (int) tuple.get(0);
return sal>2000?true:false;
}
}
package demo.pig;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
//根据员工的薪水,判断级别
//泛型:经过运算后,结果的类型
public class CheckSalaryGrade extends EvalFunc<String>{
@Override
public String exec(Tuple tuple) throws IOException {
// 调用: myresult2 = foreach emp generate ename,sal,demo.pig.CheckSalaryGrade(sal);
int sal = (int)tuple.get(0);
if(sal<1000) return "Grade A";
else if(sal>=1000 && sal<3000) return "Grade B";
else return "Grade C";
}
}
Pig的自定义加载函数
代码实现
package demo.pig;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.pig.LoadFunc;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class MyLoadFunc extends LoadFunc{
//定义一个变量保存输入流
private RecordReader reader ;
@Override
public InputFormat getInputFormat() throws IOException {
// 输入数据的格式:字符串
return new TextInputFormat();
}
@Override
public Tuple getNext() throws IOException {
// 从输入流中读取一行,如何解析生成返回的tuple
//数据: I love Beijing
Tuple result = null;
try{
//判断是否读入了数据
if(!this.reader.nextKeyValue()){
//没有数据
return result; //----> 是null值
}
//数据数据: I love Beijing
String data = this.reader.getCurrentValue().toString();
//生成返回的结果:Tuple
result = TupleFactory.getInstance().newTuple();
//分词
String[] words = data.split(" ");
//每一个单词单独生成一个tuple(s),再把这些tuple放入一个bag中。
//在把这个bag放入result中
//创建一个表
DataBag bag = BagFactory.getInstance().newDefaultBag();
for(String w:words){
//为每个单词生成tuple
Tuple aTuple = TupleFactory.getInstance().newTuple();
aTuple.append(w); //将单词放入tuple
//再把这些tuple放入一个bag中
bag.add(aTuple);
}
//在把这个bag放入result中
result.append(bag);
}catch(Exception ex){
ex.printStackTrace();
}
return result;
}
@Override
public void prepareToRead(RecordReader reader, PigSplit arg1) throws IOException {
// RecordReader reader: 代表HDFS的输入流
this.reader = reader;
}
@Override
public void setLocation(String path, Job job) throws IOException {
// 从HDFS输入的路径
FileInputFormat.setInputPaths(job, new Path(path));
}
}