一、排序
1.默认排序
MapReduce流程中,会自动对mapper输出的key-value,按照key的默认规则进行排序
规则: key如果是数字:从小到大升序排序, key如果是string:字典顺序排序,a>b>c…
自动排序的时机:1. MapTask输出key-value后,会对key进行排序,然后才会将排序后的key-value写出到本地磁盘。 2. ReduceTask在下载n个maptask输出的本地临时结果后,在merge阶段,将多个临时文件合并成1个大的文件过程中,会再次对整体Key进行一次排序:归并排序
MapTask阶段排序,每个MapTask完成各自数据的key的排序 - 局部并行执行,同时排序,效率高。
ReduceTask阶段排序,减少了部分key之间的比较,直接做归并排序。- 归并排序,减少比较次数,效率高。
默认排序:默认排序调用mapper输出key的compareTo方法比较大小,决定排序规则。默认按照升序排列
例如我们在定义了输出的键或者值的类型之后,它在排序的时候会自动调用该类型中的compareTo方法。
IntWritable中的方法:
public class IntWritable implements WritableComparable<IntWritable> {
private int value;
public IntWritable() {}
public IntWritable(int value) {this.set(value);}
public void set(int value) {
this.value = value;
}
public int get() {
return this.value;
}
public void readFields(DataInput in) throws IOException {
this.value = in.readInt();
}
public void write(DataOutput out) throws IOException {
out.writeInt(this.value);
}
public boolean equals(Object o) {
if (!(o instanceof IntWritable)) {
return false;
} else {
IntWritable other = (IntWritable)o;
return this.value == other.value;
}
}
public int hashCode() {
return this.value;
}
public int compareTo(IntWritable o) {
int thisValue = this.value;
int thatValue = o.value;
return thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1);
}
public String toString() {
return Integer.toString(this.value);
}
static {
WritableComparator.define(IntWritable.class, new IntWritable.Comparator());
}
public static class Comparator extends WritableComparator {
public Comparator() {
super(IntWritable.class);
}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
int thisValue = readInt(b1, s1);
int thatValue = readInt(b2, s2);
return thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1);
}
}
}
public int compareTo(IntWritable o) {
int thisValue = this.value;
int thatValue = o.value;
return thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1);
}
看一下Comparable的注释文档中 的说明:
/*
* Copyright (c) 1997, 2013, Oracle and/or its affiliates. All rights reserved.
* ORACLE PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.
*/
package java.lang;
import java.util.*;
/**
* This interface imposes a total ordering on the objects of each class that
* implements it. This ordering is referred to as the class's <i>natural
* ordering</i>, and the class's <tt>compareTo</tt> method is referred to as
* its <i>natural comparison method</i>.<p>
*
* Lists (and arrays) of objects that implement this interface can be sorted
* automatically by {@link Collections#sort(List) Collections.sort} (and
* {@link Arrays#sort(Object[]) Arrays.sort}). Objects that implement this
* interface can be used as keys in a {@linkplain SortedMap sorted map} or as
* elements in a {@linkplain SortedSet sorted set}, without the need to
* specify a {@linkplain Comparator comparator}.<p>
*
* The natural ordering for a class <tt>C</tt> is said to be <i>consistent
* with equals</i> if and only if <tt>e1.compareTo(e2) == 0</tt> has
* the same boolean value as <tt>e1.equals(e2)</tt> for every
* <tt>e1</tt> and <tt>e2</tt> of class <tt>C</tt>. Note that <tt>null</tt>
* is not an instance of any class, and <tt>e.compareTo(null)</tt> should
* throw a <tt>NullPointerException</tt> even though <tt>e.equals(null)</tt>
* returns <tt>false</tt>.<p>
*
* It is strongly recommended (though not required) that natural orderings be
* consistent with equals. This is so because sorted sets (and sorted maps)
* without explicit comparators behave "strangely" when they are used with
* elements (or keys) whose natural ordering is inconsistent with equals. In
* particular, such a sorted set (or sorted map) violates the general contract
* for set (or map), which is defined in terms of the <tt>equals</tt>
* method.<p>
*
* For example, if one adds two keys <tt>a</tt> and <tt>b</tt> such that
* {@code (!a.equals(b) && a.compareTo(b) == 0)} to a sorted
* set that does not use an explicit comparator, the second <tt>add</tt>
* operation returns false (and the size of the sorted set does not increase)
* because <tt>a</tt> and <tt>b</tt> are equivalent from the sorted set's
* perspective.<p>
*
* Virtually all Java core classes that implement <tt>Comparable</tt> have natural
* orderings that are consistent with equals. One exception is
* <tt>java.math.BigDecimal</tt>, whose natural ordering equates
* <tt>BigDecimal</tt> objects with equal values and different precisions
* (such as 4.0 and 4.00).<p>
*
* For the mathematically inclined, the <i>relation</i> that defines
* the natural ordering on a given class C is:<pre>
* {(x, y) such that x.compareTo(y) <= 0}.
* </pre> The <i>quotient</i> for this total order is: <pre>
* {(x, y) such that x.compareTo(y) == 0}.
* </pre>
*
* It follows immediately from the contract for <tt>compareTo</tt> that the
* quotient is an <i>equivalence relation</i> on <tt>C</tt>, and that the
* natural ordering is a <i>total order</i> on <tt>C</tt>. When we say that a
* class's natural ordering is <i>consistent with equals</i>, we mean that the
* quotient for the natural ordering is the equivalence relation defined by
* the class's {@link Object#equals(Object) equals(Object)} method:<pre>
* {(x, y) such that x.equals(y)}. </pre><p>
*
* This interface is a member of the
* <a href="{@docRoot}/../technotes/guides/collections/index.html">
* Java Collections Framework</a>.
*
* @param <T> the type of objects that this object may be compared to
*
* @author Josh Bloch
* @see java.util.Comparator
* @since 1.2
*/
public interface Comparable<T> {
/**
* Compares this object with the specified object for order. Returns a
* negative integer, zero, or a positive integer as this object is less
* than, equal to, or greater than the specified object.
*
* <p>The implementor must ensure <tt>sgn(x.compareTo(y)) ==
* -sgn(y.compareTo(x))</tt> for all <tt>x</tt> and <tt>y</tt>. (This
* implies that <tt>x.compareTo(y)</tt> must throw an exception iff
* <tt>y.compareTo(x)</tt> throws an exception.)
*
* <p>The implementor must also ensure that the relation is transitive:
* <tt>(x.compareTo(y)>0 && y.compareTo(z)>0)</tt> implies
* <tt>x.compareTo(z)>0</tt>.
*
* <p>Finally, the implementor must ensure that <tt>x.compareTo(y)==0</tt>
* implies that <tt>sgn(x.compareTo(z)) == sgn(y.compareTo(z))</tt>, for
* all <tt>z</tt>.
*
* <p>It is strongly recommended, but <i>not</i> strictly required that
* <tt>(x.compareTo(y)==0) == (x.equals(y))</tt>. Generally speaking, any
* class that implements the <tt>Comparable</tt> interface and violates
* this condition should clearly indicate this fact. The recommended
* language is "Note: this class has a natural ordering that is
* inconsistent with equals."
*
* <p>In the foregoing description, the notation
* <tt>sgn(</tt><i>expression</i><tt>)</tt> designates the mathematical
* <i>signum</i> function, which is defined to return one of <tt>-1</tt>,
* <tt>0</tt>, or <tt>1</tt> according to whether the value of
* <i>expression</i> is negative, zero or positive.
*
* @param o the object to be compared.
* @return a negative integer, zero, or a positive integer as this object
* is less than, equal to, or greater than the specified object.
*
* @throws NullPointerException if the specified object is null
* @throws ClassCastException if the specified object's type prevents it
* from being compared to this object.
*/
public int compareTo(T o);
}
2.自定义排序
自定义一个降序排序的类,然后再使用输出类型的时候使用这个类和IIntWritable一样,会自动调用里面的compareTo方法进行排序
package demo5;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class DescIntWritable implements WritableComparable<DescIntWritable> {
private int value; //new IntWritable(9)
public DescIntWritable() {
}
public DescIntWritable(int value) {
this.value = value;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeInt(value);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
value = dataInput.readInt();
}
@Override
public int compareTo(DescIntWritable o) {
return o.value - this.value; //降序
}
@Override
public String toString() {
return Integer.toString(this.value);
}
}
3.二次排序
在定义排序规则的时候先判断两个值是否相等,如果不相等的话就直接使用降序排列,如果相等的话就按照另一个条件进行排序
大致代码 为:
public class PlayWritable implements WritableComparable<PlayWritable> {
private int viewer;
private int length;
/**
* 按照viewer降序,如果viewer相同,按照length降序
*/
public int compareTo(PlayWritable o) {
if(this.viewer != o.viewer){
return o.viewer - this.viewer;
}else{
return o.length - this.length;
}
}
// 有参无参构造方法,get set方法,序列化方法和tostring
...
}```
具体实现:
先自定义一个类型:
```java
package demo6;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class Player implements WritableComparable<Player> {
private int peopleNum;
private int videoTime;
public Player() {
}
public Player(int peopleNum, int videoTime) {
this.peopleNum = peopleNum;
this.videoTime = videoTime;
}
@Override
public int compareTo(Player o) {
//1.先按照观众人数降序排序 2.如果观众人数相同,则按照直播时长排序
if(this.peopleNum == o.peopleNum){
return o.videoTime - this.videoTime;
}
return o.peopleNum - this.peopleNum;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeInt(peopleNum);
dataOutput.writeInt(videoTime);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
peopleNum = dataInput.readInt();
videoTime = dataInput.readInt();
}
@Override
public String toString() {
return peopleNum + "\t" + videoTime;
}
}
然后再主方法中使用这个类型作为输出的类型他就会自动调用里面的compareTo方法进行排序
package demo6;
import demo5.DescIntWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import java.io.IOException;
public class TestJob extends Configured implements Tool {
public static void main(String[] args) throws Exception{
ToolRunner.run(new TestJob(),args);
}
@Override
public int run(String[] strings) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS","hdfs://hadoop10:9000");
Job job = Job.getInstance(conf);
job.setJarByClass(TestJob.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job,new Path("/testdata.txt"));
FileOutputFormat.setOutputPath(job,new Path("/out"));
job.setMapperClass(Sort3Mapper.class);
job.setReducerClass(Sort3Reducer.class);
job.setMapOutputKeyClass(Player.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Player.class);
return job.waitForCompletion(true)?1:0;
}
static class Sort3Mapper extends Mapper<LongWritable, Text, Player,Text>{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] arr = value.toString().split("\t");
context.write(new Player(Integer.parseInt(arr[1]),Integer.parseInt(arr[2])),new Text(arr[0]));
}
}
static class Sort3Reducer extends Reducer<Player,Text,Text,Player>{
@Override
protected void reduce(Player key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text name:values){
context.write(name,key);
}
}
}
}
二、组件
1.FileInputFormat(TextInputFormat)
1.首先这个类主要有两个作用:1.根据读取的文件路径计算出对应split逻辑切片.2.设置要读取的文件路径
2.我们来看一下我们在项目中哪里使用到了这个技术,1.在设置文件的输入路径的时候是调用了FileInputFromat这个类的setOutputPath方法和 setOutputPath方法。 2.在我们设置读取文件的路径前面作业的设置输入格式的类里面设置了这个TextINputFormat这个类其实点进去看的话它也是继承了FileInputFormat这个类。如下:
3.第一个来看一下他的第一个作用的实现addInputPath。这个方法设置了我们这个Job作业要计算的文件从那个路径里面读取。
public static void addInputPath(Job job, Path path) throws IOException {
Configuration conf = job.getConfiguration();
path = path.getFileSystem(conf).makeQualified(path);
String dirStr = StringUtils.escapeString(path.toString());
String dirs = conf.get("mapreduce.input.fileinputformat.inputdir");
conf.set("mapreduce.input.fileinputformat.inputdir", dirs == null ? dirStr : dirs + "," + dirStr);
}
在最后一行我们可以看到addInputPath的每一次调用,都会将输入路径(Path会被转换为字符串形式)与原有值以“,”分隔进行拼接(并不会覆盖原有值),并将这一次添加的路径追加到dirs后面。
所以我们在这里设置路径的时候有三种方式可以添加路径
//指定一个输入文件
FileInputFormat.addInputPath(job,new Path("/hdfs文件"));
//指定一个输入目录
FileInputFormat.addInputPath(job,new Path("/hdfs目录"));
//指定多个输入文件
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job,new Path("/hdfs/文件1.txt"));
FileInputFormat.addInputPath(job,new Path("/hdfs/文件2.txt"));
FileInputFormat.addInputPath(job,new Path("/hdfs/文件3.txt"));
3.第一个来看一下他的第二个作用的根据读取的文件路径计算出对应split逻辑切片。
在FileInputFormat中有这样一个类getSplits 在这个类中涉及到了这几个参数:
//设置最小的split大小为1
protected long getFormatMinSplitSize() {
return 1L;
}
//设置最大的split大小为 9223372036854775807L
public static long getMaxSplitSize(JobContext context) {
return context.getConfiguration().getLong("mapreduce.input.fileinputformat.split.maxsize", 9223372036854775807L);
}
public List<InputSplit> getSplits(JobContext job) throws IOException {
StopWatch sw = (new StopWatch()).start();
long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
List<InputSplit> splits = new ArrayList();
List<FileStatus> files = this.listStatus(job);
Iterator i$ = files.iterator();
while(true) {
while(true) {
while(i$.hasNext()) {
FileStatus file = (FileStatus)i$.next();
Path path = file.getPath();
long length = file.getLen();
if (length != 0L) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus)file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0L, length);
}
if (this.isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining;
int blkIndex;
for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) {
blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
}
if (bytesRemaining != 0L) {
blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
splits.add(this.makeSplit(path, length - bytesRemaining, bytesRemaining, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
}
} else {
if (LOG.isDebugEnabled() && length > Math.min(file.getBlockSize(), minSize)) {
LOG.debug("File is not splittable so no parallelization is possible: " + file.getPath());
}
splits.add(this.makeSplit(path, 0L, length, blkLocations[0].getHosts(), blkLocations[0].getCachedHosts()));
}
} else {
splits.add(this.makeSplit(path, 0L, length, new String[0]));
}
}
job.getConfiguration().setLong("mapreduce.input.fileinputformat.numinputfiles", (long)files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size() + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
return splits;
}
}
}
根据代码可以分析到到,它是对一个目标文件进行切分操作。如何拆分文件依据以下几个参数:maxsize,BlockSize,minsize,具体的分片规则在下面Split中说明。
2.split
1.Split概念
在MapTask执行自己分到的需要处理的数据之前,FileInputFormat会对HDFS源文件的逻辑拆分块。(并非真正的数据块)
在计算分片过程中涉及到了如下三个参数:start 从哪儿开始读数据、length 当前MapTask读取多少数据、hosts 文件所在hdfs的节点位置。
protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}
2.分片的原则是splitSize不会小于minSize,不会大于maxSize,如果blockSize能够满足以上要求就取blockSize,如果不能的话就在maxSize和minSize中取值。
3.在MapTask将要处理数据的时候由于MapManager是和DataNode放在一台服务器上面的,如果刚好自己要计算的数据就放在本机上,不用通过网络的传输从其他节点那里那数据的话也会 大大的提高效率。我们的SplitSize的大小按理论来说我们如果真的想要更改的话只用设置minSize就可以控制最终Splite的大小。那么一般我们不会对这里面的数据进行修改,那一个SplitSize的大小就是blockSize的实际大小。由于这些数据是从HDFS中拿到的,也就是说这些块的最大就是128M。过大或者过小,都会导致MapTask跨节点读取数据文件,导致数据传输速度降低。降低效率。
3.MapTask并行度
1.MapTask并行
利用多个服务器节点的资源,并行处理数据,提高数据处理速度。
2.MapTask并行度决定因素
①. 文件大小 例如:400M的文件,会产生4个split (每个最大128M),会启动四个mapTask
②. 文件个数 例如: 读取文件夹,文件夹下包含4个0.1M的文件,会产生4个split,会启动4个
3.由于这个机制带来的问题:
Hadoop适合来处理大量的大数据。对于大量的小数据。由于海量小文件,不仅仅在HDFS存储这些数据的时候会产生大量的元数据存储在NamoNode空间里面占用内存。还会导致在执行MapReduce在计算的时候产生大量的block,导致在分片的时候分出大量的split,导致启动大量的MapTask(一个MapTask占用一个JVM进程)占用大量的内存空间,瞬间挤占服务器资源。
4.优化:
①. HDFS:在HDFS中将多个业务含义相同的数据文件合并成1个文件。
hdfs dfs -getmerge /xxx.log /本地目录 缺点:合并后的文件在linux本地,一般还需要上传回HDFS
②.归档
hadoop archive -archiveName wordcount.har -p /mapreduce/demo7 /mapreduce/demo7_new
特点:对hdfs上的小文件进行归档,但是在namoNode上存储的元数据信息不会改变
③. MapReduce:干预Split 的计算规则,合并多个block为1个split。减少split个数,进而减少MapTask个数。降低服务器运行job的内存占用。
使用CombineTextInputFormat
4.CombineTextInputFormat
原理:
在海量的小数据文件产生海量小block,合并成大的split的过程中,将多个小block合并成1个split处理,设置切片大小。减少split数量,减少MapTask数量,提高MapReduce性能。
实现:
job.setInputFormatClass(CombineTextInputFormat.class);// 设置格式化输出类。
CombineTextInputFormat.setMaxInputSplitSize(job,10485760);//10M,只要加起来不超过10M的block数据,都会合并成1个split处理。
CombineTextInputFormat.addInputPath(job,new Path("/hdfs/目录"));//设置读取文件的路径