最近开始研究MR的编写与运行原理,稍后会带来研究的成果。关于如何运行MR程序很多博客都有介绍,这里就不多描述了。该篇博客会持续更新,增加新的内容上来。
目前碰到的第一个问题就是在Eclipse里开发MR时,碰到权限问题,
报InvalidInputException:Input path does not exist: hdfs://master:8020/user/chenyi/test_in错误的信息。
因为在提交JOB的时候,取得是当前系统的用户信息,所以导致在hdfs上面没有目录。
简单解决方式是:
设置dfs.permissions的值为false。
这样方便了开发,(在生产环境中还是需要带有权限的)。运行的结果如下:
2012-08-09 11:41:26.618 java[459:1203] Unable to load realm info from SCDynamicStore
12/08/09 11:41:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/08/09 11:41:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/08/09 11:41:27 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/09 11:41:27 INFO mapred.JobClient: Running job: job_local_0001
12/08/09 11:41:27 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/08/09 11:41:27 INFO mapred.MapTask: numReduceTasks: 1
12/08/09 11:41:27 INFO mapred.MapTask: io.sort.mb = 100
12/08/09 11:41:27 INFO mapred.MapTask: data buffer = 79691776/99614720
12/08/09 11:41:27 INFO mapred.MapTask: record buffer = 262144/327680
12/08/09 11:41:27 INFO mapred.MapTask: Starting flush of map output
12/08/09 11:41:27 INFO mapred.MapTask: Finished spill 0
12/08/09 11:41:27 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/08/09 11:41:28 INFO mapred.JobClient: map 0% reduce 0%
12/08/09 11:41:30 INFO mapred.LocalJobRunner: hdfs://10.200.187.77:8020/user/chenyi8888/test_in/sample.txt:0+529
12/08/09 11:41:30 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/08/09 11:41:30 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/08/09 11:41:30 INFO mapred.LocalJobRunner:
12/08/09 11:41:30 INFO mapred.Merger: Merging 1 sorted segments
12/08/09 11:41:30 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 24 bytes
12/08/09 11:41:30 INFO mapred.LocalJobRunner:
12/08/09 11:41:30 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/08/09 11:41:30 INFO mapred.LocalJobRunner:
12/08/09 11:41:30 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/08/09 11:41:30 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://10.200.187.77:8020/user/chenyi8888/test_out
12/08/09 11:41:31 INFO mapred.JobClient: map 100% reduce 0%
12/08/09 11:41:33 INFO mapred.LocalJobRunner: reduce > reduce
12/08/09 11:41:33 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
12/08/09 11:41:34 INFO mapred.JobClient: map 100% reduce 100%
12/08/09 11:41:34 INFO mapred.JobClient: Job complete: job_local_0001
12/08/09 11:41:34 INFO mapred.JobClient: Counters: 20
12/08/09 11:41:34 INFO mapred.JobClient: File Input Format Counters
12/08/09 11:41:34 INFO mapred.JobClient: Bytes Read=529
12/08/09 11:41:34 INFO mapred.JobClient: File Output Format Counters
12/08/09 11:41:34 INFO mapred.JobClient: Bytes Written=17
12/08/09 11:41:34 INFO mapred.JobClient: FileSystemCounters
12/08/09 11:41:34 INFO mapred.JobClient: FILE_BYTES_READ=45286
12/08/09 11:41:34 INFO mapred.JobClient: HDFS_BYTES_READ=1058
12/08/09 11:41:34 INFO mapred.JobClient: FILE_BYTES_WRITTEN=130144
12/08/09 11:41:34 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=17
12/08/09 11:41:34 INFO mapred.JobClient: Map-Reduce Framework
12/08/09 11:41:34 INFO mapred.JobClient: Map output materialized bytes=28
12/08/09 11:41:34 INFO mapred.JobClient: Map input records=5
12/08/09 11:41:34 INFO mapred.JobClient: Reduce shuffle bytes=0
12/08/09 11:41:34 INFO mapred.JobClient: Spilled Records=4
12/08/09 11:41:34 INFO mapred.JobClient: Map output bytes=45
12/08/09 11:41:34 INFO mapred.JobClient: Total committed heap usage (bytes)=259915776
12/08/09 11:41:34 INFO mapred.JobClient: Map input bytes=529
12/08/09 11:41:34 INFO mapred.JobClient: SPLIT_RAW_BYTES=113
12/08/09 11:41:34 INFO mapred.JobClient: Combine input records=5
12/08/09 11:41:34 INFO mapred.JobClient: Reduce input records=2
12/08/09 11:41:34 INFO mapred.JobClient: Reduce input groups=2
12/08/09 11:41:34 INFO mapred.JobClient: Combine output records=2
12/08/09 11:41:34 INFO mapred.JobClient: Reduce output records=2
12/08/09 11:41:34 INFO mapred.JobClient: Map output records=5
第二问题就是相关的数据格式影响的对应类型,如下表:
问题三:map方法中的key的值,关于value很简单就是文件中的一行数据。
在最近几个例子里面,map的key基本没怎么使用(因为使用的是TextInputFormat类),那么key里存放的是什么值呢,通过调试发现存放的是数据分片的字节偏移量,例如有5行记录如下:
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
map会被调用5次,每次调用key的值变化如下:
第一次为0
第二次为106
第三次为212
第四次为318
第五次为424
待续……………………