在使用idea开发wordcount实例中,首先构建一个maven工程,需要引入的依赖有:
<repositories> <repository> <id>apache</id> <url>http://maven.apache.org</url> </repository> </repositories> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-dependency-plugin</artifactId> <configuration> <excludeTransitive>false</excludeTransitive> <stripVersion>true</stripVersion> <outputDirectory>./lib</outputDirectory> </configuration> </plugin> </plugins> </build>
在引入各个依赖后,点击自己创建的项目,选择open module setting如下所示;
之后,引入Hadoop的包,如下图所示:
选择自己Hadoop的路径,之后选择以下所示的文件夹,选中引入即可。
之后点击配置,配置本项目的文件输入路径和输出路径,在program arguments中前一个为文件输入路径,后一个为输出路径,当然,此时的路径均为hdfs集群路径,应该将创建的文件夹上传到hdfs集群中,然后把该路径写入,否则会报找不到文件路径的错误,出错解决办法参考我上一篇博客。
配置完成后,将core-site.xml配置文件引入,如下图所示:
在一切配置准备完成后,便可以进行编码了,首先创建一个java类名为WordCount,具体代码如下所示:
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; import java.util.StringTokenizer; public class WordCount { //编写TokenizerMapper类继承Mapper类 public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{ //定义变量one值设置为1,为每个单词定义value为1 public static final IntWritable one=new IntWritable(1); private Text word=new Text(); //编写map函数,其中输入参数为value(即为单词),输出参数为context public void map(Object key,Text values,Context context) throws IOException, InterruptedException { StringTokenizer str=new StringTokenizer(values.toString()); while(str.hasMoreTokens()){ word.set(str.nextToken()); context.write(word,one); } } } //定义IntSumReducer继承Reducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{ private IntWritable result=new IntWritable(); //定义reduce方法 public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { //遍历,将统计各个单词的总个数 int sum=0; for (IntWritable val:values) { sum+=val.get(); } result.set(sum); context.write(key,result); } } //编写主函数 public static void main(String[] args) throws Exception{ Configuration conf=new Configuration(); Job job=Job.getInstance(conf,"wordCount"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); //添加文件的输入路径 FileInputFormat.addInputPath(job, new Path(args[0])); //添加文件的输出路径 FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)?0:1); } }
一切就绪后,点击运行便可运行出结果。当然在运行之前要开启hadoop集群。