Day5.Hadoop学习笔记3（偏向于实战）

最新推荐文章于 2024-07-23 12:00:20 发布

大竹薙子

最新推荐文章于 2024-07-23 12:00:20 发布

阅读量160

点赞数

分类专栏：我的大数据文章标签： hadoop

本文链接：https://blog.csdn.net/weixin_42838993/article/details/84891313

版权

我的大数据专栏收录该内容

18 篇文章 0 订阅

订阅专栏

零、回顾

小Tips

Google发表的一系列文章：GoogleFileSystem、MapReduce、BigTables、Spanner

BigTables是Google设计的分布式数据存储系统，用来处理海量的数据的一种非关系型的数据库。

Spanner（Spanner是谷歌公司研发的、可扩展的、多版本、全球分布式、同步复制数据库。它是第一个把数据分布在全球范围内的系统，并且支持外部一致性的分布式事务。）

TensorFlow
TensorFlom是一个基于数据流编程（dataflow programming）的符号数学系统，被广泛应用于各类机器学习（machine learning）算法的编程实现，其前身是谷歌的神经网络算法库DistBelief。

一、InputFormat

看框架的设计，不建议先看已实现的本类，优先看顶层的接口，再看抽象类。“逐本叙源”。

 		//②设置处理数据格式（读入/写出）
        job.setInputFormatClass(TextInputFormat.class); 
        job.setOutputFormatClass(TextOutputFormat.class);

关注InputFormat
TextInputFormat的顶级父类是InputFormat

/**
 * <code>InputFormat</code> describes the input-specification for a
 * Map-Reduce job.      #描述MapReduce Job本身
 *
 * <p>The Map-Reduce framework relies on the <code>InputFormat</code> of the
 * job to:<p>
 * <ol>
 *   <li>
 *   Validate the input-specification of the job.   #验证输入的job【数据校验】
 *   <li>
 *   Split-up the input file(s) into logical {@link InputSplit}s, each of
 *   which is then assigned to an individual {@link Mapper}.
 *
 *   #将输入的文件切分成逻辑片段，并分给独立的mapper【逻辑切割】
 *   </li>
 *   <li>
 *   Provide the {@link RecordReader} implementation to be used to glean
 *   input records from the logical <code>InputSplit</code> for processing by
 *   the {@link Mapper}.
 *   
 *   #为mapTask读切片，收集片段的数据，并分配给mapper【切片分配】
 *   </li>
 * </ol>
 * @see InputSplit
 * @see RecordReader
 * @see FileInputFormat
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class InputFormat<K, V> {
    /**
    * getSplits
    */
    public abstract
    List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;
    /**
    * createRecordReader
    */
    public abstract
    RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) 
    										throws IOException, InterruptedException;
}

所以，对于InputFormat，我们要考虑的两个问题点是：怎么切？怎么读？

关注FileInputFormat

public abstract class FileInputFormat<K, V> extends InputFormat<K, V> 
/**
 * 若干方法中，我们重点看getSplits方法
 */
public List<InputSplit> getSplits(JobContext job) throws IOException {
        Stopwatch sw = new Stopwatch().start();	//秒表，日志用
        long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));	//
        long maxSize = getMaxSplitSize(job);	//

        // generate splits
        List<InputSplit> splits = new ArrayList<InputSplit>();	//当前目录下所以文件
        List<FileStatus> files = listStatus(job);
        for (FileStatus file: files) {
            Path path = file.getPath();
            long length = file.getLen();

            if (isSplitable(job, path)) {	//397行，若干分支后，筛到可切文件
                 long blockSize = file.getBlockSize();	//128MB
                
                 long splitSize = computeSplitSize(blockSize, minSize, maxSize);//
                
                 long bytesRemaining = length;	//当前一个文件的总长度

                // 文件大小/128 >常量1.1	【 结论：切片的范围 0～140.8(128*1.1) 】
                 while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
                     //进行一次切割
                    int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
                    splits.add(makeSplit(path, length-bytesRemaining, splitSize,blkLocations[blkIndex].getHosts(),blkLocations[blkIndex].getCachedHosts()));
                    bytesRemaining -= splitSize;
        }

                //文件大小不满足上边while条件，直接成为一个切片片段
            if (bytesRemaining != 0) {
                int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
                splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,blkLocations[blkIndex].getHosts(),blkLocations[blkIndex].getCachedHosts()));

            }} else { // not splitable 文件不可且
                splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),blkLocations[0].getCachedHosts()));
            }

关注TextInputFormat
TextInputFormat类里没有getSplits方法，split 方法已经在FileInputFormat中实现完毕。
TextInputFormat类里，对应的是createRecordReader方法

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    final CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    if (null == codec) {
      return true;
    }
    return codec instanceof SplittableCompressionCodec;
  }

}

思考：MapReduce 框架是否适合处理小文件？

不建议，因为海量小文件需要进行海量逻辑切片切分，这样消耗的内存多是启动资源的内存。当然，可以人为干预，将小文件转换为大文件。

三、常见InputFormat和OutPutFormat 结构

图 5

常见的InputFormat	常见的OutputFormat
CompositeInputFormat	DBOutputFormat √
DelegatingInputFormat	FileOutputFormat
ComposableInputFormat	┕ TextOutputFormat √
DBInputFormat (RDBMS数据库访问)	┕ SquenceFileOutputFormat
FileInputFormat (HDFS数据)
┕ TextInputFormat √	MultipleOutputFormat
┕ NLineInputFormat	TableOutputFormat (HBase) √
┕ CombineInputFormat √
┕ FixedLengthInputFormat
┕ KeyValueTextInputFormat
┕ LSequenceInputFormat

MultipleInputFormat (最常用) √
TableInputFormat (HBase) √

四、部分InputFormat特点

DBInputFormat(RDBMS MySQL/Oracle) `不多`
FileInputFormat(HDFS数据)
	TextInputFormat （√）
		切片计算   ：以文件为单位，按照SplitSize
		Key、Value：key当前行在文件中的字节偏移量，value表示一行文本数据
	NLineInputFormat
		切片计算   ：以文件为单位，按照n行且分文件 
					#N行切割，如果不写N，默认也是一行
		Key、Value：key当前行在文件中的字节偏移量，value表示一行文本数据
		设置:`conf.set("mapreduce.input.lineinputformat.linespermap","4");`
	KeyValueTextInputFormat
	    切片计算   ：以文件为单位，按照SplitSize
	    			#读一行数据，分隔符前作为key，剩余作为value
		Key、Value：key+‘分隔符’+value=line
		设置:`conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",">");`
	SequenceFileInputFormat
	CombineTextInputFormat （√）
		切片计算   ：按照SplitSize
	    			#TextInputFormat的优化，整合文件(文件格式必须一样)，按照SplitSize分割		
		Key、Value：key当前行在文件中的字节偏移量，value表示一行文本数据
		前提:所有被合并文本小文件格式必须一致。
MultipleInputs（√）
					#注意：添加有变化，如下
        MultipleInputs.addInputPath(job,new Path("file:///D:/demo/sys1"), TextInputFormat.class,Sys1Mapper.class);
        MultipleInputs.addInputPath(job,new Path("file:///D:/demo/sys2"), TextInputFormat.class,Sys2Mapper.class);

五、部分OutputFormat特点

DBOutputFormat(RDBMS MySQL/Oracle) `很多√` 
FileOutputFormat(HDFS数据)
	TextOutputFormat （√）
	SequenceFileOutputFormat
MultipleOutputs
TableInputFormat（HBase √）

DBOutputFormat几个关键点

 		//1.封装job对象
        Configuration conf=getConf();
        DBConfiguration.configureDB(conf,
                "com.mysql.jdbc.Driver",
                "jdbc:mysql://CentOS:3306/demo?characterEncoding=utf8",
                "root","root"
                );
        Job job=Job.getInstance(conf);
		//3.设置数据的路径信息（读入|写出）
		DBOutputFormat.setOutput(job,"t_user_order","uid","items","total");

		//5.说明Mapper和Reducer 输出KEY/VALUE
        job.setOutputKeyClass(UserOrderDBWritable.class);

import org.apache.hadoop.mapreduce.lib.db.DBWritable;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

public class UserOrderDBWritable implements DBWritable {
    private String uid;
    private String items;
    private double total;

    public UserOrderDBWritable(String uid, String items, double total) {
        this.uid = uid;
        this.items = items;
        this.total = total;
    }

    public void write(PreparedStatement statement) throws SQLException {//类似jdbc
        statement.setString(1,uid);
        statement.setString(2,items);
        statement.setDouble(3,total);
    }

    public void readFields(ResultSet resultSet) throws SQLException {//写数据
    }
}

问题

大竹薙子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Day5.Hadoop学习笔记3（偏向于实战）

零、回顾小TipsGoogle发表的一系列文章：GoogleFileSystem、MapReduce、BigTables、SpannerBigTables是Google设计的分布式数据存储系统，用来处理海量的数据的一种非关系型的数据库。Spanner（Spanner是谷歌公司研发的、可扩展的、多版本、全球分布式、同步复制数据库。它是第一个把数据分布在全球范围内的系统，并且支持...
复制链接

扫一扫

专栏目录