pig--- Use the Parallel Features

最新推荐文章于 2020-03-27 23:22:56 发布

xiewenbo

最新推荐文章于 2020-03-27 23:22:56 发布

阅读量1k

点赞数

分类专栏： pig

pig 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

set

Shows/Assigns values to keys used in Pig.

Syntax

set [key 'value']

Terms

key	Key (see table). Case sensitive.
value	Value for key (see table). Case sensitive.

Usage

Use the set command to assign values to keys, as shown in the table. All keys and their corresponding values (for Pig and Hadoop) are case sensitive. If set command is used without key/value pair argument, Pig prints all the configurations and system properties.

Key	Value	Description
default_parallel	a whole number	Sets the number of reducers for all MapReduce jobs generated by Pig (see Use the Parallel Features).
debug	on/off	Turns debug-level logging on or off.
job.name	Single-quoted string that contains the job name.	Sets user-specified name for the job
job.priority	Acceptable values (case insensitive): very_low, low, normal, high, very_high	Sets the priority of a Pig job.
stream.skippath	String that contains the path.	For streaming, sets the path from which not to ship data (see DEFINE (UDFs, streaming) and About Auto-Ship).

All Pig and Hadoop properties can be set, either in the Pig script or via the Grunt command line.

Examples

In this example key value pairs are set at the command line.

grunt> SET debug 'on'
grunt> SET job.name 'my job'
grunt> SET default_parallel 100

In this example default_parallel is set in the Pig script; all MapReduce jobs that get launched will use 20 reducers.

SET default_parallel 20;
A = LOAD 'myfile.txt' USING PigStorage() AS (t, u, v);
B = GROUP A BY t;
C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
D = ORDER C BY mycount;
STORE D INTO 'mysortedcount' USING PigStorage();

In this example multiple key value pairs are set in the Pig script. These key value pairs are put in job-conf by Pig (making the pairs available to Pig and Hadoop). This is a script-wide setting; if a key value is defined multiple times in the script the last value will take effect and will be set for all jobs generated by the script.

...
SET mapred.map.tasks.speculative.execution false; 
SET pig.logfile mylogfile.log; 
SET my.arbitrary.key my.arbitary.value; 
...

You can set the number of reduce tasks for the MapReduce jobs generated by Pig using two parallel features. (The parallel features only affect the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.)

You Set the Number of Reducers

Use the set default parallel command to set the number of reducers at the script level.

Alternatively, use the PARALLEL clause to set the number of reducers at the operator level. (In a script, the value set via the PARALLEL clause will override any value set via "set default parallel.") You can include the PARALLEL clause with any operator that starts a reduce phase: COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), and ORDER BY.

The number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 1 GB of data behaves efficiently.

Let Pig Set the Number of Reducers

If neither "set default parallel" nor the PARALLEL clause are used, Pig sets the number of reducers using a heuristic based on the size of the input data. You can set the values for these properties:

pig.exec.reducers.bytes.per.reducer - Defines the number of input bytes per reduce; default value is 1000*1000*1000 (1GB).
pig.exec.reducers.max - Defines the upper bound on the number of reducers; default is 999.

The formula, shown below, is very simple and will improve over time. The computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer)

Examples

In this example PARALLEL is used with the GROUP operator.

A = LOAD 'myfile' AS (t, u, v);
B = GROUP A BY t PARALLEL 18;
...

In this example all the MapReduce jobs that get launched use 20 reducers.

SET default_parallel 20;
A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
B = GROUP A BY t;
C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
D = ORDER C BY mycount;
STORE D INTO ‘mysortedcount’ USING PigStorage();

xiewenbo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pig--- Use the Parallel Features

setShows/Assigns values to keys used in Pig.Syntaxset [key 'value']TermskeyKey (see table). Case sensitive.valueValue for key (see tab
复制链接

扫一扫

专栏目录