前提:
http://blog.csdn.net/wuxidemo/article/details/77115931
设置好分布式部署,并启动
idea开发mapreduce程序:
1. 新建maven project,修改pom.xml增加hadoop相关依赖
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.yf</groupId> <artifactId>mapreduce</artifactId> <version>1.0-SNAPSHOT</version> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.6</source> <target>1.6</target> </configuration> </plugin> </plugins> </build> <packaging>jar</packaging> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <hadoop.version>3.0.0-alpha4</hadoop.version> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>${hadoop.version}</version> </dependency> </dependencies>
2. 编写mapreduce代码:
package com.yf; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; import java.util.Arrays; import java.util.HashSet; public class InvertedIndexMapReduce { public static class Map extends Mapper<LongWritable,Text,Text,Text>{ private Text documentId; private Text word = new Text(); @Override protected void setup(Context context) throws IOException, InterruptedException { String filename = ((FileSplit)context.getInputSplit()).getPath().getName(); documentId = new Text(filename); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { for(String token : StringUtils.split(value.toString())){ word.set(token); context.write(word,documentId); } } } public static class Reduce extends Reducer<Text,Text,Text,Text>{ private Text docIds = new Text(); @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { HashSet<Text> uniqueDocIds = new HashSet<Text>(); for(Text docId: values){ uniqueDocIds.add(new Text(docId)); } docIds.set(new Text(StringUtils.join(uniqueDocIds,","))); context.write(key,docIds); } } public static void main(String[] args) throws Exception{ Configuration conf = new Configuration(); Job job = new Job(conf); job.setJarByClass(InvertedIndexMapReduce.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); Path outputPath = new Path(args[1]); FileInputFormat.setInputPaths(job,new Path(args[0])); FileOutputFormat.setOutputPath(job,outputPath); outputPath.getFileSystem(conf).delete(outputPath,true); job.waitForCompletion(true);//通知jobtracker运行这个作业,并阻塞直到作业完成 } }
3. 编译打包成jar包
源代码目录运行 mvn package
4. 运行到hadoop环境里
拷贝jar到运行hadoop的centos的系统中
新建几个txt文件并传送到hdfs的/input目录下,例如
hadoop fs -put file1.txt /input
hadoop fs -put file2.txt /input
运行 hadoop jar xxx.jar com.yf.InvertedIndexMapReduce /input/*.txt /output
大致运行log如下:
2017-08-14 20:46:36,031 INFO client.RMProxy: Connecting to ResourceManager at hadoop.master/192.168.0.116:8040
2017-08-14 20:46:37,022 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-08-14 20:46:37,459 INFO input.FileInputFormat: Total input files to process : 2
2017-08-14 20:46:37,614 INFO mapreduce.JobSubmitter: number of splits:2
2017-08-14 20:46:37,829 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2017-08-14 20:46:38,164 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1502708248236_0013
2017-08-14 20:46:38,517 INFO impl.YarnClientImpl: Submitted application application_1502708248236_0013
2017-08-14 20:46:38,645 INFO mapreduce.Job: The url to track the job: http://hadoop.master:8088/proxy/application_1502708248236_0013/
2017-08-14 20:46:38,646 INFO mapreduce.Job: Running job: job_1502708248236_0013
2017-08-14 20:46:48,992 INFO mapreduce.Job: Job job_1502708248236_0013 running in uber mode : false
2017-08-14 20:46:48,995 INFO mapreduce.Job: map 0% reduce 0%
......
2017-08-14 20:47:09,551 INFO mapreduce.Job: map 100% reduce 0%
......
2017-08-14 20:47:34,000 INFO mapreduce.Job: map 100% reduce 100%
2017-08-14 20:47:34,026 INFO mapreduce.Job: Job job_1502708248236_0013 completed successfully
2017-08-14 20:47:34,338 INFO mapreduce.Job: Counters: 57
File System Counters
FILE: Number of bytes read=102
FILE: Number of bytes written=570254
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=224
HDFS: Number of bytes written=76
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Failed map tasks=4
Failed reduce tasks=1
Killed map tasks=1
Launched map tasks=6
Launched reduce tasks=2
Other local map tasks=4
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=48685
Total time spent by all reduces in occupied slots (ms)=17000
Total time spent by all map tasks (ms)=48685
Total time spent by all reduce tasks (ms)=17000
Total vcore-milliseconds taken by all map tasks=48685
Total vcore-milliseconds taken by all reduce tasks=17000
Total megabyte-milliseconds taken by all map tasks=49853440
Total megabyte-milliseconds taken by all reduce tasks=17408000
Map-Reduce Framework
Map input records=2
Map output records=6
Map output bytes=84
Map output materialized bytes=108
Input split bytes=200
Combine input records=0
Combine output records=0
Reduce input groups=4
Reduce shuffle bytes=108
Reduce input records=6
Reduce output records=4
Spilled Records=12
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=389
CPU time spent (ms)=2260
Physical memory (bytes) snapshot=564514816
Virtual memory (bytes) snapshot=7572033536
Total committed heap usage (bytes)=301146112
Peak Map Physical memory (bytes)=224354304
Peak Map Virtual memory (bytes)=2521751552
Peak Reduce Physical memory (bytes)=117166080
Peak Reduce Virtual memory (bytes)=2528530432
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=24
File Output Format Counters
Bytes Written=76
此时查看hdfs的output目录,运行结果如下:
[glsc@hadoop java_app]$ hadoop fs -ls /output
Found 2 items
-rw-r--r-- 2 glsc supergroup 0 2017-08-14 20:47 /output/_SUCCESS
-rw-r--r-- 2 glsc supergroup 76 2017-08-14 20:47 /output/part-r-00000
[glsc@hadoop java_app]$ hadoop fs -cat /output/part-r-00000
cat file1.txt,file2.txt
dog file2.txt
mat file1.txt
sat file1.txt,file2.txt
例子是一个反向索引,运行结果显示成功。