1.MapReduce
Hadoop MapReduce是一个软件框架,基于该框架能够容易地编写应用程序,这些应用程序能够运行在由上千个商用机器组成的大集群上,并以一种可靠的,具有容错能力的方式并行地处理上TB级别的海量数据集。这个定义里面有着这些关键词,一是软件框架,二是并行处理,三是可靠且容错,四是大规模集群,五是海量数据集。
2.编写MapReduce
(1)编写Map
/**
* Mapper
*/
public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//创建一个Text对象(Key)
Text keyOut = new Text();
//创建一个IntWritable对象(Value)
IntWritable valueOut = new IntWritable();
String[] arr = value.toString().split("");//split() 方法根据匹配给定的正则表达式来拆分字符串。
for (String s : arr){
keyOut.set(s);
valueOut.set(1);
context.write(keyOut,valueOut);
}
}
}
(2)编写Reduce
/**
* Reduce
*/
public class WCReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable iw : values){
count = count + iw.get();
}
context.write(key,new IntWritable(count));
}
}
(3)编写main方法
public class WCApp {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "file:///");
Job job = Job.getInstance(conf);
//设置job的各种属性
job.setJobName("WCApp"); //作业名称
job.setJarByClass(WCApp.class); //搜索类
job.setInputFormatClass(TextInputFormat.class); //设置输入格式
//设置输出格式类
//job.setOutputFormatClass(SequenceFileOutputFormat.class);
//添加输入路径
FileInputFormat.addInputPath(job,new Path(args[0]));
//设置输出路径
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//设置最大切片数
//FileInputFormat.setMaxInputSplitSize(job,13);
//最小切片数
//FileInputFormat.setMinInputSplitSize(job,1L);
//设置分区类
job.setPartitionerClass(MyPartitioner.class); //设置自定义分区
//设置合成类
job.setCombinerClass(WCReducer.class); //设置combiner类
job.setMapperClass(WCMapper.class); //mapper类
job.setReducerClass(WCReducer.class); //reducer类
job.setNumReduceTasks(3); //reduce个数
job.setMapOutputKeyClass(Text.class); //
job.setMapOutputValueClass(IntWritable.class); //
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); //
job.waitForCompletion(true);
}
}
3.Local模式运行MR流程
(1)创建外部Job(mapreduce.Job),设置配置信息
(2)通过jobsubmitter将job.xml + split等文件写入临时目录
(3)通过jobSubmitter提交job给localJobRunner,
(4)LocalJobRunner将外部Job 转换成成内部Job
(5)内部Job线程,开启分线程执行job
(6)job执行线程分别计算Map和reduce任务信息并通过线程池孵化新线程执行MR任务。
4.在hadoop集群上运行mr
(1)将maven工程生成jar包
(2)上传到hadoop
(3)运行hadoop jar命令(生成的jar包+类路径+hdfs路径)
$>hadoop jar HdfsDemo-1.0-SNAPSHOT.jar com.gao.hdfs.mr.WCApp hdfs://s200/user/gao/wc/data hdfs://s200/user/gao/wc/out
5.hdfs 切片计算方式
getFormatMinSplitSize() = 1
//最小值(>=1) 1 0
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
//最大值(<= Long.Max , mapreduce.input.fileinputformat.split.maxsize=)
long maxSize = getMaxSplitSize(job);
//得到block大小
long blockSize = file.getBlockSize();
//minSplit maxSplit blockSize
//Math.max(minSize, Math.min(maxSize, blockSize));
6.压缩
7.远程调试
8.