零、回顾
- 小Tips
- Google发表的一系列文章:GoogleFileSystem、MapReduce、BigTables、Spanner
BigTables是Google设计的分布式数据存储系统,用来处理海量的数据的一种非关系型的数据库。
Spanner(Spanner是谷歌公司研发的、可扩展的、多版本、全球分布式、同步复制数据库。它是第一个把数据分布在全球范围内的系统,并且支持外部一致性的分布式事务。)
TensorFlow
TensorFlom是一个基于数据流编程(dataflow programming)的符号数学系统,被广泛应用于各类机器学习(machine learning)算法的编程实现,其前身是谷歌的神经网络算法库DistBelief。
一、InputFormat
看框架的设计,不建议先看已实现的本类,优先看顶层的接口,再看抽象类。“逐本叙源”。
//②设置处理数据格式(读入/写出)
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
- 关注InputFormat
TextInputFormat的顶级父类是InputFormat
/**
* <code>InputFormat</code> describes the input-specification for a
* Map-Reduce job. #描述MapReduce Job本身
*
* <p>The Map-Reduce framework relies on the <code>InputFormat</code> of the
* job to:<p>
* <ol>
* <li>
* Validate the input-specification of the job. #验证输入的job【数据校验】
* <li>
* Split-up the input file(s) into logical {@link InputSplit}s, each of
* which is then assigned to an individual {@link Mapper}.
*
* #将输入的文件切分成逻辑片段,并分给独立的mapper【逻辑切割】
* </li>
* <li>
* Provide the {@link RecordReader} implementation to be used to glean
* input records from the logical <code>InputSplit</code> for processing by
* the {@link Mapper}.
*
* #为mapTask读切片,收集片段的数据,并分配给mapper【切片分配】
* </li>
* </ol>
* @see InputSplit
* @see RecordReader
* @see FileInputFormat
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class InputFormat<K, V> {
/**
* getSplits
*/
public abstract
List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;
/**
* createRecordReader
*/
public abstract
RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException;
}
所以,对于InputFormat,我们要考虑的两个问题点是:怎么切?怎么读?
- 关注FileInputFormat
public abstract class FileInputFormat<K, V> extends InputFormat<K, V>
/**
* 若干方法中,我们重点看getSplits方法
*/
public List<InputSplit> getSplits(JobContext job) throws IOException {
Stopwatch sw = new Stopwatch().start(); //秒表,日志用
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); //
long maxSize = getMaxSplitSize(job); //
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>(); //当前目录下所以文件
List<FileStatus> files = listStatus(job);
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (isSplitable(job, path)) { //397行,若干分支后,筛到可切文件
long blockSize = file.getBlockSize(); //128MB
long splitSize = computeSplitSize(blockSize, minSize, maxSize);//
long bytesRemaining = length; //当前一个文件的总长度
// 文件大小/128 >常量1.1 【 结论:切片的范围 0~140.8(128*1.1) 】
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
//进行一次切割
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,blkLocations[blkIndex].getHosts(),blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}
//文件大小不满足上边while条件,直接成为一个切片片段
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,blkLocations[blkIndex].getHosts(),blkLocations[blkIndex].getCachedHosts()));
}} else { // not splitable 文件不可且
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),blkLocations[0].getCachedHosts()));
}
- 关注TextInputFormat
TextInputFormat类里没有getSplits方法,split 方法已经在FileInputFormat中实现完毕。
TextInputFormat类里,对应的是createRecordReader方法
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
String delimiter = context.getConfiguration().get(
"textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
return new LineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
final CompressionCodec codec =
new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
if (null == codec) {
return true;
}
return codec instanceof SplittableCompressionCodec;
}
}
思考:MapReduce 框架是否适合处理小文件?
不建议,因为海量小文件需要进行海量逻辑切片切分,这样消耗的内存多是启动资源的内存。当然,可以人为干预,将小文件转换为大文件。
三、常见InputFormat和OutPutFormat 结构
常见的InputFormat | 常见的OutputFormat |
---|---|
CompositeInputFormat | DBOutputFormat √ |
DelegatingInputFormat | FileOutputFormat |
ComposableInputFormat | ┕ TextOutputFormat √ |
DBInputFormat (RDBMS数据库访问) | ┕ SquenceFileOutputFormat |
FileInputFormat (HDFS数据) | |
┕ TextInputFormat √ | MultipleOutputFormat |
┕ NLineInputFormat | TableOutputFormat (HBase) √ |
┕ CombineInputFormat √ | |
┕ FixedLengthInputFormat | |
┕ KeyValueTextInputFormat | |
┕ LSequenceInputFormat | |
MultipleInputFormat (最常用) √ | |
TableInputFormat (HBase) √ |
四、部分InputFormat特点
DBInputFormat(RDBMS MySQL/Oracle) `不多`
FileInputFormat(HDFS数据)
TextInputFormat (√)
切片计算 :以文件为单位,按照SplitSize
Key、Value:key当前行在文件中的字节偏移量,value表示一行文本数据
NLineInputFormat
切片计算 :以文件为单位,按照n行且分文件
#N行切割,如果不写N,默认也是一行
Key、Value:key当前行在文件中的字节偏移量,value表示一行文本数据
设置:`conf.set("mapreduce.input.lineinputformat.linespermap","4");`
KeyValueTextInputFormat
切片计算 :以文件为单位,按照SplitSize
#读一行数据,分隔符前作为key,剩余作为value
Key、Value:key+‘分隔符’+value=line
设置:`conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",">");`
SequenceFileInputFormat
CombineTextInputFormat (√)
切片计算 :按照SplitSize
#TextInputFormat的优化,整合文件(文件格式必须一样),按照SplitSize分割
Key、Value:key当前行在文件中的字节偏移量,value表示一行文本数据
前提:所有被合并文本小文件格式必须一致。
MultipleInputs(√)
#注意:添加有变化,如下
MultipleInputs.addInputPath(job,new Path("file:///D:/demo/sys1"), TextInputFormat.class,Sys1Mapper.class);
MultipleInputs.addInputPath(job,new Path("file:///D:/demo/sys2"), TextInputFormat.class,Sys2Mapper.class);
五、部分OutputFormat特点
DBOutputFormat(RDBMS MySQL/Oracle) `很多√`
FileOutputFormat(HDFS数据)
TextOutputFormat (√)
SequenceFileOutputFormat
MultipleOutputs
TableInputFormat(HBase √)
- DBOutputFormat几个关键点
//1.封装job对象
Configuration conf=getConf();
DBConfiguration.configureDB(conf,
"com.mysql.jdbc.Driver",
"jdbc:mysql://CentOS:3306/demo?characterEncoding=utf8",
"root","root"
);
Job job=Job.getInstance(conf);
//3.设置数据的路径信息(读入|写出)
DBOutputFormat.setOutput(job,"t_user_order","uid","items","total");
//5.说明Mapper和Reducer 输出KEY/VALUE
job.setOutputKeyClass(UserOrderDBWritable.class);
import org.apache.hadoop.mapreduce.lib.db.DBWritable;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
public class UserOrderDBWritable implements DBWritable {
private String uid;
private String items;
private double total;
public UserOrderDBWritable(String uid, String items, double total) {
this.uid = uid;
this.items = items;
this.total = total;
}
public void write(PreparedStatement statement) throws SQLException {//类似jdbc
statement.setString(1,uid);
statement.setString(2,items);
statement.setDouble(3,total);
}
public void readFields(ResultSet resultSet) throws SQLException {//写数据
}
}
- 问题