CentOS7 Pig 安装和使用

                   CentOS7  Pig 安装和使用(初级)

一、Pig安装

  1. 下载pig,官网
    https://pig.apache.org/

  2. 上传并解压

tar -zxvf pig-0.17.0.tar.gz -C /home/hadoop/

  1. 配置环境变量

#set pig env
export PIG_HOME=/home/hadoop/pig-0.17.0
export PIG_CLASSPATH= H A D O O P H O M E / c o n f e x p o r t P A T H = HADOOP_HOME/conf export PATH= HADOOPHOME/confexportPATH=PATH:$PIG_HOME/bin

4.配置
pig.properties文件
在Pig的 conf 文件夹中,我们有一个名为 pig.properties 的文件。在pig.properties文件中,可以设置如下所示的各种参数。

[hadoop@hadoop ~]$ pig -h properties
The following properties are supported:
Logging:
verbose=true|false; default is false. This property is the same as -v switch
brief=true|false; default is false. This property is the same as -b switch
debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch
aggregate.warning=true|false; default is true. If true, prints count of warnings
of each type rather than logging each warning.
Performance tuning:
pig.cachedbag.memusage=; default is 0.2 (20% of all memory).
Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner=true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery=true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
pig.tmpfilecompression=true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec=lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination=true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.
pig.exec.mapPartAgg=true|false. Default is false.
Determines if partial aggregation is done within map phase,
before records are sent to combiner.
pig.exec.mapPartAgg.minReduction=. Default is 10.
If the in-map partial aggregation does not reduce the output num records
by this factor, it gets disabled.
Miscellaneous:
exectype=mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
pig.additional.jars.uris=. Used in place of register command.
udf.import.list=. Used to avoid package names in UDF.
stop.on.failure=true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=. e.g. +08:00. Default is the default timezone of the host.
Determines the timezone used to handle datetime datatype and UDFs.
pig.artifacts.download.location= ; default is ~/.groovy/grapes
Determines the location to download the artifacts when registering jars using ivy coordinates.
Additionally, any Hadoop property can be specified.
19/04/30 15:10:01 INFO pig.Main: Pig script completed in 153 milliseconds (153 ms)
[hadoop@hadoop ~]$

[hadoop@hadoop ~]$ pig -version
Apache Pig version 0.17.0 (r1797386)
compiled Jun 02 2017, 15:41:58
[hadoop@hadoop ~]$

二、Pig命令简介
pig 有两种执行模式:
Local
mapreduce

Pig执行机制
Apache Pig脚本可以通过三种方式执行,即交互模式,批处理模式和嵌入式模式。
交互模式(Grunt shell) - 你可以使用Grunt shell以交互模式运行Apache Pig。在此shell中,你可以输入Pig Latin语句并获取输出(使用Dump运算符)。
批处理模式(脚本) - 你可以通过将Pig Latin脚本写入具有 .pig 扩展名的单个文件中,以批处理模式运行Apache Pig。
嵌入式模式(UDF) - Apache Pig允许在Java等编程语言中定义我们自己的函数(UDF用户定义函数),并在我们的脚本中使用它们。

编写一个pig脚本,如下:
vi sample_script.pig
student = LOAD ‘hdfs://localhost:9000/pig_data/student.txt’ USING
PigStorage(’,’) as (id:int,name:chararray,city:chararray);
Dump student;
Local模式:$ pig -x local sample_script.pig
Mapreduce模式:$ pig -x mapreduce sample_script.pig

  1. Grunt Shell

1)sh命令:用于执行SHELL命令,参考如下
grunt> sh ls -lsa
total 32
0 drwx------ 11 hadoop hadoop 292 Apr 30 15:08 .
0 drwxr-xr-x. 4 root root 35 Apr 30 10:48 …
4 -rw------- 1 hadoop hadoop 3159 Apr 30 15:08 .bash_history
4 -rw-r–r-- 1 hadoop hadoop 18 Oct 31 01:07 .bash_logout

grunt> sh df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 52G 5.9G 47G 12% /
devtmpfs 2.0G 0 2.0G 0% /dev

2)fs命令
使用 fs 命令,我们可以从Grunt shell调用任何FsShell命令。这里就是本地文件

grunt> fs -ls
Found 16 items
-rw------- 1 hadoop hadoop 3159 2019-04-30 15:08 .bash_history
-rw-r–r-- 1 hadoop hadoop 18 2018-10-31 01:07 .bash_logout
-rw-r–r-- 1 hadoop hadoop 193 2018-10-31 01:07 .bash_profile
-rw-r–r-- 1 hadoop hadoop 797 2019-04-30 15:08 .bashrc
drwxrwxr-x - hadoop hadoop 18 2019-04-30 10:48 .cache

如果是mapreduce,如下:

3)clear 清屏
4)help 帮助查找
5)history命令
此命令显示自Grunt shell被调用以来执行/使用的语句的列表
6)set命令
使用此命令,可以将值设置到以下key。
Key 说明和值
default_parallel 通过将任何整数作为值传递给此key来设置映射作业的reducer数。
debug 关闭或打开Pig中的调试功能通过传递on/off到这个key。
job.name 通过将字符串值传递给此key来将作业名称设置为所需的作业。
job.priority 通过将以下值之一传递给此key来设置作业的优先级:
• very_low
• low
• normal
• high
• very_high
stream.skippath 对于流式传输,可以通过将所需的路径以字符串形式传递到此key,来设置不传输数据的路径。
7)exec/run 命令
使用 exec 命令,我们可以从Grunt shell执行Pig脚本。
grunt> exec /sample_script.pig
grunt> run /sample_script.pig

三、使用和测试Pig

  1. 创建pig数据目录:
    [hadoop@hadoop ~]$ hadoop fs -mkdir -p /Pig_Data
    [hadoop@hadoop ~]$ hadoop fs -ls /
    Found 3 items
    drwxr-xr-x - hadoop supergroup 0 2019-04-30 16:18 /Pig_Data
    drwxr-xr-x - hadoop supergroup 0 2019-04-30 16:04 /home
    drwxrwx— - hadoop supergroup 0 2019-04-30 15:40 /tmp
    [hadoop@hadoop ~]$
    2.准备测试数据并上传至hdfs
    -rw-rw-r-- 1 hadoop hadoop 236 Apr 30 16:20 student_data.txt
    [hadoop@hadoop test]$ more student_data.txt
    001,Rajiv,Reddy,9848022337,Hyderabad
    002,siddarth,Battacharya,9848022338,Kolkata
    003,Rajesh,Khanna,9848022339,Delhi
    004,Preethi,Agarwal,9848022330,Pune
    005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
    006,Archana,Mishra,9848022335,Chennai.
    [hadoop@hadoop test]$

[hadoop@hadoop test]$ hadoop fs -put student_data.txt /Pig_Data
[hadoop@hadoop test]$ hadoop fs -ls /Pig_Data
Found 1 items
-rw-r–r-- 1 hadoop supergroup 236 2019-04-30 16:22 /Pig_Data/student_data.txt
[hadoop@hadoop test]$ hadoop fs -cat /Pig_Data/student_data.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
[hadoop@hadoop test]$

3.文本文件加载和存储
[hadoop@hadoop test]$ pig -x mapreduce
grunt> student = load ‘/Pig_Data/student_data.txt’ using PigStorage(’,’) as (id:int,firstname:chararray,lastname:chararray,phone:chararray,city:chararray);

加载–
student = load ‘/Pig_Data/student_data.txt’ using PigStorage(’,’) as (id:int,firstname:chararray,lastname:chararray,phone:chararray,city:chararray);
存储–
store student into ‘/pig_output/’ using PigStorage(’,’);

如下就是显示成功:

四、使用和测试Pig Diagnostic诊断运算符
Load 语句会简单地将数据加载到Apache Pig中的指定关系中。要验证Load语句的执行,必须使用Diagnostic运算符。Pig Latin提供四种不同类型的诊断运算符:
• Dump运算符
• Describe运算符
• Explanation运算符
• Illustration运算符

1.dump 用于调试和输出内容
[hadoop@hadoop test]$ pig -x mapreduce
grunt> student = load ‘/Pig_Data/student_data.txt’ using PigStorage(’,’) as (id:int,firstname:chararray,lastname:chararray,phone:chararray,city:chararray);
grunt> dump student;

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai.)

2.describe用于查看关系的模式,类似于表结构

grunt> describe student;
student: {id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray}
grunt>

3.explain 运算符用于显示关系的逻辑,物理和MapReduce执行计划。
grunt> explain student;

4.illustrate 运算符为你提供了一系列语句的逐步执行。
grunt> illustrate student;

五、分组和连接运算符

  1. 准备数据

[hadoop@hadoop test]$ ll
total 20
-rw-rw-r-- 1 hadoop hadoop 10573 Apr 30 17:43 pig_1556616865324.log
-rw-rw-r-- 1 hadoop hadoop 236 Apr 30 16:20 student_data.txt
-rw-rw-r-- 1 hadoop hadoop 339 Apr 30 17:44 student_details.txt
[hadoop@hadoop test]$ hadoop fs -ls /
Found 4 items
drwxr-xr-x - hadoop supergroup 0 2019-04-30 16:22 /Pig_Data
drwxr-xr-x - hadoop supergroup 0 2019-04-30 16:04 /home
drwxr-xr-x - hadoop supergroup 0 2019-04-30 17:18 /pig_output
drwxrwx— - hadoop supergroup 0 2019-04-30 17:34 /tmp
[hadoop@hadoop test]$ hadoop fs -put student_details.txt /Pig_Data/
[hadoop@hadoop test]$ hadoop fs -ls /Pig_Data/
Found 2 items
-rw-r–r-- 1 hadoop supergroup 236 2019-04-30 16:22 /Pig_Data/student_data.txt
-rw-r–r-- 1 hadoop supergroup 339 2019-04-30 17:44 /Pig_Data/student_details.txt
[hadoop@hadoop test]$

[hadoop@hadoop test]$ cat student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
[hadoop@hadoop test]$

–开始:

student_details = LOAD ‘/Pig_Data/student_details.txt’ USING PigStorage(’,’) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
grunt> group_data = GROUP student_details by age;
grunt> dump group_data;

2019-04-30 17:47:13,960 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022338,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,trivendram)})

grunt> describe group_data;
group_data: {group: int,student_details: {(id: int,firstname: chararray,lastname: chararray,age: int,phone: chararray,city: chararray)}}
grunt> group_multiple = GROUP student_details by (age, city);
grunt> group_all = GROUP student_details All;
grunt> dump group_all;
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,trivendram),(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar),(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyderabad)})

COGROUP 运算符的运作方式与 GROUP 运算符相同。两个运算符之间的唯一区别是 group 运算符通常用于一个关系,而 cogroup 运算符用于涉及两个或多个关系的语句。
cogroup_data = COGROUP student_details by age, employee_details by age;

后面还有分组,排序,JOIN,过滤,还有提重复,有内置函数也可以UDF函数等等,后续需要使用,可以继续研究,此文档完结。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

fangwei1234

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值