前言:
公司要搞一个大数据项目,做之前先让我搭建三个节点hadoop+hive做技术预研.搭建好之后,想试试mapreduce程序能不能跑得起来,一直以来开发工具用的都是eclipse,但现在的同事都习惯用idea,我试着用了之后立刻就喜欢上了.就决定配置idea的hadoop开发环境.
注意:
idea的hadoop项目采用的maven管理,mapreduce程序运行在windows环境也即本地运行,不用在本地安装hadoop以及配置HADOOP_HOME
跨平台提交任务需要在yarn-site.xml中添加如下参数:
<property>
<name>mapreduce.app-submission.cross-platform</name>
<value>true</value>
</property>
项目创建过程:
1.新建maven项目,直接点击下一步
2填写项目名后下一步
3.finish
pom文件内容:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>MapReduceTest</artifactId>
<groupId>com.qi</groupId>
<version>1.0-SNAPSHOT</version>
<relativePath>../../../pom.xml</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>wordcount</artifactId>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.7.6</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
java代码:
package com.qi.demo;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
public class WordCount {
public static class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable> {
public static final IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
this.word.set(itr.nextToken());
context.write(this.word, one);
}
}
}
public static class IntSumReduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
IntWritable val;
for (Iterator i = values.iterator(); i.hasNext(); sum += val.get()) {
val = (IntWritable) i.next();
}
this.result.set(sum);
context.write(key, this.result);
}
}
public static void main(String[] args)
throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
//使用本地目录先将此配置注释掉,master的用法必须在windows配置主机映射
//conf.set("fs.defaultFS", "hdfs://master:9000");
//异常就是由于没有设置此参数造成的
System.setProperty("HADOOP_USER_NAME","root");
Job job = Job.getInstance(conf, "WordCount");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCount.TokenizerMapper.class);
job.setReducerClass(WordCount.IntSumReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//为了方便输入和输出路径写死在了代码中,可以在run->edit configrations中配置
FileInputFormat.addInputPath(job, new Path("input"));
//output目录已存在则执行删除操作,避免每次都手动删除
FileSystem.get(conf).delete(new Path("output"), true);
FileOutputFormat.setOutputPath(job, new Path("output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
在项目根目录新建input目录并创建words.txt文本文件,加入一些测试数据
将集群上etc/hadoop中的log4j文件放入resources目录
- 本地目录+本地运行
- hdfs目录+本地运行
本地目录+本地运行
第一次运行报了一个bug
刚开始觉得是自己input目录路径写的不对,原来用的相对路径,然后改成了绝对路径
但还是不行.经过打断点debug确定了此异常的原因是本地提交的任务是使用的当前windows用户提交的,此时每次运行都会自动在input路径前添加user/用户名(我的用户名是中文,显示乱码)
加上 System.setProperty(“HADOOP_USER_NAME”,“root”);使用windows的root用户提交,运行成功
运行结果:
hdfs目录+本地运行
hdfs上input目录
这种写法仍然会在inputq前加上user/用户名
这种写法能够运行成功但output前也会加user/root
这种写法才是正确的或者写全路径hdfs://192.168.0.xxx:9000/input"
运行结果如下: