1.下载软件:

wget http://apache.fayea.com/pig/pig-0.15.0/pig-0.15.0.tar.gz


2.解压

tar -zxvf pig-0.15.0.tar.gz

mv pig-0.15.0 /usr/local/

ln -s pig-0.15.0 pig


3.配置环境变量:

export PATH=PATH=$HOME/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/pig/bin:$PATH;

export PIG_CLASSPATH=/usr/local/hadoop/etc/hadoop;


4.进入grunt shell:


以本地模式登录pig: 该方式的所有文件和执行过程都在本地,一般用于测试程序

[hadoop@host61 ~]$ pig -x local

15/10/03 01:14:09 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/10/03 01:14:09 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType

2015-10-03 01:14:09,756 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-10-03 01:14:09,758 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1443860049744.log

2015-10-03 01:14:10,133 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found

2015-10-03 01:14:12,648 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 01:14:12,656 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 01:14:12,685 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

2015-10-03 01:14:13,573 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum

grunt> 

以Mapreduce模式登录:实际工作模式:

[hadoop@host63 ~]$ pig

15/10/03 02:11:54 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/10/03 02:11:54 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

15/10/03 02:11:54 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-03 02:11:55,086 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-10-03 02:11:55,087 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1443863515062.log

2015-10-03 02:11:55,271 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found

2015-10-03 02:11:59,735 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 02:11:59,740 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 02:11:59,742 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://host61:9000/

2015-10-03 02:12:06,256 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 02:12:06,257 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: host61:9001

2015-10-03 02:12:06,265 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

grunt> 


5.pig的运行方式有如下三种:

1.脚本

2.grunt

3.嵌入式

6.登录pig,并使用常用的命令:

[hadoop@host63 ~]$ pig

15/10/03 06:01:01 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/10/03 06:01:01 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

15/10/03 06:01:01 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-03 06:01:01,412 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

2015-10-03 06:01:01,413 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1443877261408.log

2015-10-03 06:01:01,502 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found

2015-10-03 06:01:03,657 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 06:01:03,657 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 06:01:03,662 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://host61:9000/

2015-10-03 06:01:05,968 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 06:01:05,968 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: host61:9001

2015-10-03 06:01:05,979 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

grunt> help

Commands:

<pig latin statement>; - See the PigLatin manual for details: http://hadoop.apache.org/pig

File system commands:

fs <fs arguments> - Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html

Diagnostic commands:

describe <alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.

explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<param_value>]

[-param_file <file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.

-script - Explain the entire script.

-out - Store the output into directory rather than print to stdout.

-brief - Don't expand nested plans (presenting a smaller graph for overview).

-dot - Generate the output in .dot format. Default is text format.

-xml - Generate the output in .xml format. Default is text format.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

alias - Alias to explain.

dump <alias> - Compute the alias and writes the results to stdout.

Utility Commands:

exec [-param <param_name>=param_value] [-param_file <file_name>] <script> - 

Execute the script with access to grunt environment including aliases.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

script - Script to be executed.

run [-param <param_name>=param_value] [-param_file <file_name>] <script> - 

Execute the script with access to grunt environment. 

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

script - Script to be executed.

sh  <shell command> - Invoke a shell command.

kill <job_id> - Kill the hadoop job specified by the hadoop job id.

set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.

The following keys are supported: 

default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.

debug - Set debug on or off. Default is off.

job.name - Single-quoted name for jobs. Default is PigLatin:<script name>

job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal

stream.skippath - String that contains the path. This is used by streaming.

any hadoop property.

help - Display this message.

history [-n] - Display the list statements in cache.

-n Hide line numbers. 

quit - Quit the grunt shell.

grunt> help sh

Commands:

<pig latin statement>; - See the PigLatin manual for details: http://hadoop.apache.org/pig

File system commands:

fs <fs arguments> - Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html

Diagnostic commands:

describe <alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.

explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<param_value>]

[-param_file <file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.

-script - Explain the entire script.

-out - Store the output into directory rather than print to stdout.

-brief - Don't expand nested plans (presenting a smaller graph for overview).

-dot - Generate the output in .dot format. Default is text format.

-xml - Generate the output in .xml format. Default is text format.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

alias - Alias to explain.

dump <alias> - Compute the alias and writes the results to stdout.

Utility Commands:

exec [-param <param_name>=param_value] [-param_file <file_name>] <script> - 

Execute the script with access to grunt environment including aliases.

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

script - Script to be executed.

run [-param <param_name>=param_value] [-param_file <file_name>] <script> - 

Execute the script with access to grunt environment. 

-param <param_name - See parameter substitution for details.

-param_file <file_name> - See parameter substitution for details.

script - Script to be executed.

sh  <shell command> - Invoke a shell command.

kill <job_id> - Kill the hadoop job specified by the hadoop job id.

set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.

The following keys are supported: 

default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.

debug - Set debug on or off. Default is off.

job.name - Single-quoted name for jobs. Default is PigLatin:<script name>

job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal

stream.skippath - String that contains the path. This is used by streaming.

any hadoop property.

help - Display this message.

history [-n] - Display the list statements in cache.

-n Hide line numbers. 

quit - Quit the grunt shell.

2015-10-03 06:02:22,264 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <EOL> "\n "" at line 2, column 8.

Was expecting one of:

"cat" ...

"clear" ...

"cd" ...

"cp" ...

"copyFromLocal" ...

"copyToLocal" ...

"dump" ...

"\\d" ...

"describe" ...

"\\de" ...

"aliases" ...

"explain" ...

"\\e" ...

"help" ...

"history" ...

"kill" ...

"ls" ...

"mv" ...

"mkdir" ...

"pwd" ...

"quit" ...

"\\q" ...

"register" ...

"using" ...

"as" ...

"rm" ...

"set" ...

"illustrate" ...

"\\i" ...

"run" ...

"exec" ...

"scriptDone" ...

<IDENTIFIER> ...

<PATH> ...

<QUOTEDSTRING> ...

Details at logfile: /home/hadoop/pig_1443877261408.log

查看当前目录:

grunt> ls

hdfs://host61:9000/user/hadoop/.Trash <dir>

查看根目录:

grunt> ls /

hdfs://host61:9000/in <dir>

hdfs://host61:9000/out <dir>

hdfs://host61:9000/user <dir>

切换目录:

grunt> cd /

显示当前目录:

grunt> ls

hdfs://host61:9000/in <dir>

hdfs://host61:9000/out <dir>

hdfs://host61:9000/user <dir>

grunt> cd /in

grunt> ls

hdfs://host61:9000/in/jdk-8u60-linux-x64.tar.gz<r 3> 181238643

hdfs://host61:9000/in/mytest1.txt<r 3> 23

hdfs://host61:9000/in/mytest2.txt<r 3> 24

hdfs://host61:9000/in/mytest3.txt<r 3> 4

查看文件信息:

grunt> cat mytest1.txt

this is the first file


拷贝hdfs中的文件至操作系统:

grunt> copyToLocal /in/mytest5.txt /home/hadoop/mytest.txt

[hadoop@host63 ~]$ ls -l mytest.txt

-rw-r--r--. 1 hadoop hadoop 102 Oct  3 06:23 mytest.txt

使用sh+操作系统命令可以在grunt中执行操作系统中的命令:

grunt> sh ls -l /home/hadoop/mytest.txt

-rw-r--r--. 1 hadoop hadoop 102 Oct  3 06:23 /home/hadoop/mytest.txt


7.pig的数据模型:

bag:表

tuple:行,记录

field:属性

pig不要求相同bag里面的不同tuple有相同数量或相同类型的field;


8.pig latin的常用语句:

LOAD:指出载入数据的方法;

FOREACH:逐行扫描并进行某种处理;

FILTER:过滤行;

DUMP:把结果显示到屏幕;

STORE:把结果保存到文件;


9.数据处理样例:


产生测试文件:

[hadoop@host63 tmp]$ ls -l / |awk '{if(NR != 1)print $NF"#"$5}' >/tmp/mytest.txt

[hadoop@host63 tmp]$ cat /tmp/mytest.txt

bin#4096

boot#1024

dev#3680

etc#12288

home#4096

lib#4096

lib64#12288

lost+found#16384

media#4096

mnt#4096

opt#4096

proc#0

root#4096

sbin#12288

selinux#0

srv#4096

sys#0

tmp#4096

usr#4096

var#4096

装载文件:

grunt> records = LOAD '/tmp/mytest.txt' USING PigStorage('#') AS (filename:chararray,size:int);

2015-10-03 07:35:48,479 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-10-03 07:35:48,480 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 07:35:48,497 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum

2015-10-03 07:35:48,716 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-10-03 07:35:48,723 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum

2015-10-03 07:35:48,723 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS


显示文件:

grunt> DUMP records;

(bin,4096)

(boot,1024)

(dev,3680)

(etc,12288)

(home,4096)

(lib,4096)

(lib64,12288)

(lost+found,16384)

(media,4096)

(mnt,4096)

(opt,4096)

(proc,0)

(root,4096)

(sbin,12288)

(selinux,0)

(srv,4096)

(sys,0)

(tmp,4096)

(usr,4096)

(var,4096)

显示records的结构:

grunt> DESCRIBE records;

records: {filename: chararray,size: int}

过滤记录:

grunt> filter_records =FILTER records BY size>4096;

grunt> DUMP fileter_records;

(etc,12288)

(lib64,12288)

(lost+found,16384)

(sbin,12288)

grunt> DESCRIBE filter_records;

filter_records: {filename: chararray,size: int}

分组:

grunt> group_records =GROUP records BY size;

grunt> DUMP group_records;

(0,{(sys,0),(proc,0),(selinux,0)})

(1024,{(boot,1024)})

(3680,{(dev,3680)})

(4096,{(var,4096),(usr,4096),(tmp,4096),(srv,4096),(root,4096),(opt,4096),(mnt,4096),(media,4096),(lib,4096),(home,4096),(bin,4096)})

(12288,{(etc,12288),(lib64,12288),(sbin,12288)})

(16384,{(lost+found,16384)})

grunt> DESCRIBE group_records;

group_records: {group: int,records: {(filename: chararray,size: int)}}

格式化:

grunt> format_records = FOREACH group_records GENERATE group, FLATTEN(records);

去重:

grunt> dis_records =DISTINCT records;

排序:

grunt> ord_records =ORDER dis_records BY size desc;

取前3行数据:

grunt> top_records=LIMIT ord_records 3;

求最大值:

grunt> max_records =FOREACH group_records GENERATE group,MAX(records.size);

grunt> DUMP max_records;

(0,0)

(1024,1024)

(3680,3680)

(4096,4096)

(12288,12288)

(16384,16384)

查看执行计划:

grunt> EXPLAIN max_records;

保存记录集:

grunt> STORE group_records INTO '/tmp/mytest_group';

grunt> STORE filter_records INTO '/tmp/mytest_filter';

grunt> STORE max_records INTO '/tmp/mytest_max';


10.UDF

pig支持使用java,python,javascript编写UDF