有一个格式化的数据文件,用\t分割列,第2列为产品名称。现在需求把数据文件根据产品名切分为多个文件,使用MapReduce程序要如何实现?
原始文件:
[root@localhost opt]# cat aprData
1 a1 a111
2 a2 a211
3 a1 a112
4 a1 a112
5 a1 a112
6 a1 a112
7 a2 a112
8 a2 a112
9 a2 a112
10 a3 a113
思路:
1.用一个mapreduce程序找出所有产品名称:
1.1map<k2,v2>为<产品名称,null>
1.2reduce<k3,v3>为<产品名称,null>
实现:AprProduces类
[root@localhost opt]# hadoop jar apr-produces.jar /aprData /aprProduce-output
Warning: $HADOOP_HOME is deprecated.
16/05/01 15:00:12 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/05/01 15:00:12 INFO input.FileInputFormat: Total input paths to process : 1
16/05/01 15:00:12 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/05/01 15:00:12 WARN snappy.LoadSnappy: Snappy native library not loaded
16/05/01 15:00:13 INFO mapred.JobClient: Running job: job_201605010048_0020
16/05/01 15:00:14 INFO mapred.JobClient: map 0% reduce 0%
16/05/01 15:00:33 INFO mapred.JobClient: map 100% reduce 0%
16/05/01 15:00:45 INFO mapred.JobClient: map 100% reduce 100%
16/05/01 15:00:50 INFO mapred.JobClient: Job complete: job_201605010048_0020
16/05/01 15:00:50 INFO mapred.JobClient: Counters: 29
16/05/01 15:00:50 INFO mapred.JobClient: Map-Reduce Framework
16/05/01 15:00:50 INFO mapred.JobClient: Spilled Records=20
16/05/01 15:00:50 INFO mapred.JobClient: Map output materialized bytes=56
16/05/01 15:00:50 INFO mapred.JobClient: Reduce input records=10
16/05/01 15:00:50 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3868389376
16/05/01 15:00:50 INFO mapred.JobClient: Map input records=10
16/05/01 15:00:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=89
16/05/01 15:00:50 INFO mapred.JobClient: Map output bytes=30
16/05/01 15:00:50 INFO mapred.JobClient: Reduce shuffle bytes=56
16/05/01 15:00:50 INFO mapred.JobClient: Physical memory (bytes) snapshot=240697344
16/05/01 15:00:50 INFO mapred.JobClient: Reduce input groups=3
16/05/01 15:00:50 INFO mapred.JobClient: Combine output records=0
16/05/01 15:00:50 INFO mapred.JobClient: Reduce output records=3
16/05/01 15:00:50 INFO mapred.JobClient: Map output records=10
16/05/01 15:00:50 INFO mapred.JobClient: Combine input records=0
16/05/01 15:00:50 INFO mapred.JobClient: CPU time spent (ms)=1490
16/05/01 15:00:50 INFO mapred.JobClient: Total committed heap usage (bytes)=177016832
16/05/01 15:00:50 INFO mapred.JobClient: File Input Format Counters
16/05/01 15:00:50 INFO mapred.JobClient: Bytes Read=101
16/05/01 15:00:50 INFO mapred.JobClient: FileSystemCounters
16/05/01 15:00:50 INFO mapred.JobClient: HDFS_BYTES_READ=190
16/05/01 15:00:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=43049
16/05/01 15:00:50 INFO mapred.JobClient: FILE_BYTES_READ=56
16/05/01 15:00:50 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=9
16/05/01 15:00:50 INFO mapred.JobClient: Job Counters
16/05/01 15:00:50 INFO mapred.JobClient: Launched map tasks=1
16/05/01 15:00:50 INFO mapred.JobClient: Launched reduce tasks=1
16/05/01 15:00:50 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=11002
16/05/01 15:00:50 INFO mapre