MapReduce之字符串排序
一、需求说明
- 要求:自定义一个字符串排序规则,实现字符串与默认的字典顺序相反。以wordcount程序为基础来改写,实现小写在前,大写在后的排序规则。
二、测试数据
三、编程思路
- 因在MapReduce中字符串数据类型默认是英文的26字母大小写排序的(大写在前,小写在后),因此我们只需要写一个类继承Text.Comparator,重写compare方法即可
四、实现步骤
-
在Idea或eclipse中创建maven项目
-
在pom.xml中添加hadoop依赖
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.3</version> </dependency>
-
添加log4j.properties文件在资源目录下即resources,文件内容如下:
### 配置根 ### log4j.rootLogger = debug,console,fileAppender ## 配置输出到控制台 ### log4j.appender.console = org.apache.log4j.ConsoleAppender log4j.appender.console.Target = System.out log4j.appender.console.layout = org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern = %d{ABSOLUTE} %5p %c:%L - %m%n ### 配置输出到文件 ### log4j.appender.fileAppender = org.apache.log4j.FileAppender log4j.appender.fileAppender.File = logs/logs.log log4j.appender.fileAppender.Append = false log4j.appender.fileAppender.Threshold = DEBUG,INFO,WARN,ERROR log4j.appender.fileAppender.layout = org.apache.log4j.PatternLayout log4j.appender.fileAppender.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss} [ %t:%r ] - [ %p ] %m%n
-
编写序列化类TextKeySortComparable.java,重写compare方法实现自定义排序
import org.apache.hadoop.io.Text; public class TextKeySortComparable extends Text.Comparator { /** * 默认是按照英文的字典顺序进行排序的(大写在前,小写在后) * 重写compareTo方法,实现(小写在前,大写在后) * @return */ public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { return -super.compare(b1, s1, l1, b2, s2, l2); } }
-
编写自定义Mapper类
-
编写自定义Reducer类
-
编写自定义Driver类
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.util.Random;
public class WordCountJob {
public static void main(String[] args) throws Exception {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(WordCountJob.class);
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//添加自定义的排序规则
job.setSortComparatorClass(TextKeySortComparable.class);
FileInputFormat.setInputPaths(job,new Path("D:\\word.txt"));//D:\word.txt
FileOutputFormat.setOutputPath(job,new Path(getOutputDir()));//"D:\\mapreduce\\wc02\\"
boolean result = job.waitForCompletion(true);
if (result)
System.out.println("运行成功");
else
System.out.println("运行失败");
}
//用于产生随机输出目录
public static String getOutputDir(){
String prefix = "F:\\NIIT\\hadoopOnWindow\\output\\";
long time = System.currentTimeMillis();
int random = new Random().nextInt(1000);
return prefix + "result_" + time + "_" + random;
}
}
- 本地运行代码,测试下结果正确与否
四、打包上传到集群中运行(参考如下步骤)
-
上传测试数据到hdfs中的datas目录下
-
本地运行测试结果正确后,需要对Driver类输入输出部分代码进行修改,具体修改如下:
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1])); -
将程序打成jar包,需要在pom.xml中配置打包插件
<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId> maven-assembly-plugin </artifactId> <configuration> <!-- 使用Maven预配置的描述符--> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <!-- 绑定到package生命周期 --> <phase>package</phase> <goals> <!-- 只运行一次 --> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
按照如下图所示进行操作
-
提交集群运行,执行如下命令:
hadoop jar packagedemo-1.0-SNAPSHOT.jar com.niit.mr.WordCountJob /datas/test.txt /output/wc/
至此,所有的步骤已经完成,大家可以试试,祝大家好运~~~~