Hadoop全部，秒变大神

最新推荐文章于 2020-11-02 11:05:10 发布

深夜的星星

最新推荐文章于 2020-11-02 11:05:10 发布

阅读量444

点赞数

文章标签：大数据 hadoop

本文链接：https://blog.csdn.net/datajunge/article/details/105050889

版权

Hadoop

概况

https://blog.csdn.net/u012926411/article/details/82756100

Hadoop：适合大数据的分布式存储和计算平台

Hadoop不是指具体一个框架或者组件，它是Apache软件基金会下用Java语言开发的一个开源分布式计算平台。实现在大量计算机组成的集群中对海量数据进行分布式计算。适合大数据的分布式存储和计算平台。

Hadoop1.x中包括两个核心组件：MapReduce和Hadoop Distributed File System(HDFS)

其中HDFS负责将海量数据进行分布式存储，而MapReduce负责提供对数据的计算结果的汇总

HDFS：Hadoop Distribute File Sysytem

Map Reduce:Hadoop 中的分布式计算框架实现对海量数据并行分析和计算

Hadoop 生态圈

HDFS：Hadoop Distribute File Sysytem

Map Reduce:Hadoop 中的分布式计算框架实现对海量数据并行分析和计算

HBase：是一款基于列式存储的NOSql

Hive:是一款sql解释引擎，可以将SQL语句翻译成MR 代码，并在集群中运行

Flume：分布式日志收集系统

Kafka: 消息队列 ,分布式的消息系统

Zookeeper：分布式协调服务用于注册中心配置中心集群选举状态监测分布式锁

大数据分析方案

Map Reduce：代表基于磁盘的离线静态大数据批处理

Spark ：代表基于内存的离线静态大数据批处理

Strom、Spark Streaming、Flink、Kafka Stram：实时流处理达到对记录级别数据的毫秒级处理

HDFS

解决存储问题

安装（伪集群）

准备虚拟机

修改IP地址  eth0
删除MAC地址
修改 主机名   /etc/sysconifg/network
reboot

安装JDK 8

# 解压至指定目录，并且配置好环境变量
export JAVA_HOME=/home/java/jdk1.8.0_181
export PATH=$PATH:$JAVA_HOME/bin

配置主机名和IP的映射关系

[root@HadoopNode00 ~]# vi /etc/sysconfig/network     # 配置主机名，如果之前没有配置  记得重启
NETWORKING=yes
HOSTNAME=HadoopNode00

[root@HadoopNode00 ~]# vi /etc/hosts                 # 配置IP和HOSTNAME的映射关系
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.126.10 HadoopNode00

关闭防火墙

[root@HadoopNode00 ~]# service iptables stop  	# 关闭防火墙
[root@HadoopNode00 ~]# chkconfig iptables off 	# 关闭防火墙自启

配置主机SSH免密登陆
SSH 为 Secure Shell 的缩写，由 IETF 的网络小组（Network Working Group）所制定；SSH 为建立在应用层基础上的安全协议。

基于口令的验证：基于用户名和密码

基于密钥的安全验证：需要依靠密钥，在进行连接之前，需要自己创建一对密钥，并且将公钥放在需要访问的服务器上。
在这里插入图片描述

[root@HadoopNode00 .ssh]# ssh-keygen -t rsa    # 生成公私玥 
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
39:e3:a3:d6:5a:19:4f:c5:02:01:e0:00:a9:f0:5c:4a root@HadoopNode00
The key's randomart image is:
+--[ RSA 2048]----+
|.o. ....o.       |
|o Eo.    . .     |
|o+ o.     . o    |
|. +      . o     |
|        S .      |
|       . B       |
|       .= .      |
|      .o..       |
|     .o.         |
+-----------------+
[root@HadoopNode00 .ssh]# ssh-copy-id HadoopNode00  # 复制Hadoopnode00的公钥
The authenticity of host 'hadoopnode00 (192.168.126.10)' can't be established.
RSA key fingerprint is 4d:18:40:3d:24:1a:85:ce:ea:3c:a2:76:85:47:e8:12.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'hadoopnode00,192.168.126.10' (RSA) to the list of known hosts.
root@hadoopnode00's password:
Now try logging into the machine, with "ssh 'HadoopNode00'", and check in:

  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

[root@HadoopNode00 .ssh]# ssh hadoopnode00   # 免密登陆 hadoopnode00
Last login: Thu Oct 24 19:30:44 2019 from 192.168.126.1
[root@HadoopNode00 ~]# exit;
logout
Connection to hadoopnode00 closed.

Hadoop 解压安装

[root@HadoopNode00 ~]# tar -zxvf hadoop-2.6.0.tar.gz -C /home/hadoop/
hadoop-2.6.0/
hadoop-2.6.0/share/

配置Hadoop环境变量

[root@HadoopNode00 ~]# vi .bashrc
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
[root@HadoopNode00 ~]# source .bashrc

HADOOP_HOME环境变量被第三方所依赖，如hbase、hive、flume、spark 在集成Hadoop的时候，是通过读取HADOOP_HOME环境确定hadoop的位置

配置 core-site.xml
在hadoop 安装根目录etc/hadoop/ 下

<configuration>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://HadoopNode00:9000</value>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/hadoop-2.6.0/hadoop-${user.name}</value>
</property>

</configuration>

配置 hdfs-site.xml
在hadoop 安装根目录etc/hadoop/ 下

<configuration>

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

</configuration>

启动HDFS

# 如果是第一次 启动

[root@HadoopNode00 ~]# hdfs namenode -format   # 格式化namenode


[root@HadoopNode00 ~]# tree /home/hadoop/hadoop-2.6.0/hadoop-root   # 查看目录结构
/home/hadoop/hadoop-2.6.0/hadoop-root
└── dfs
    └── name
        └── current
            ├── fsimage_0000000000000000000
            ├── fsimage_0000000000000000000.md5
            ├── seen_txid
            └── VERSION

3 directories, 4 files

[root@HadoopNode00 ~]# start-dfs.sh   # 启动hdfs

[root@HadoopNode00 ~]# jps  # 查看相关进程
13779 DataNode
13669 NameNode
14201 SecondaryNameNode
14413 Jps

[root@HadoopNode00 ~]# stop-dfs.sh  # 关闭hdfs

[root@HadoopNode00 ~]# jps  # 查看相关进程
15323 Jps

如果安装启动成功，可以使用WEB界面看到当前节点一些信息
hostname|IP:50070
在windows下记得配置域名和IP的映射关系
C:\Windows\System32\drivers\etc\ HOSTS
192.168.126.10 HadoopNode00

HDFS Shell

[root@HadoopNode00 ~]# hadoop
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  trace                view and modify Hadoop tracing settings
 or
  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.


[root@HadoopNode00 ~]# hadoop fs
Usage: hadoop fs [generic options]
        [-appendToFile <localsrc> ... <dst>]
        [-cat [-ignoreCrc] <src> ...]
        [-checksum <src> ...]
        [-chgrp [-R] GROUP PATH...]
        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
        [-chown [-R] [OWNER][:[GROUP]] PATH...]
        [-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
        [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-count [-q] [-h] <path> ...]
        [-cp [-f] [-p | -p[topax]] <src> ... <dst>]
        [-createSnapshot <snapshotDir> [<snapshotName>]]
        [-deleteSnapshot <snapshotDir> <snapshotName>]
        [-df [-h] [<path> ...]]
        [-du [-s] [-h] <path> ...]
        [-expunge]
        [-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-getfacl [-R] <path>]
        [-getfattr [-R] {-n name | -d} [-e en] <path>]
        [-getmerge [-nl] <src> <localdst>]
        [-help [cmd ...]]
        [-ls [-d] [-h] [-R] [<path> ...]]
        [-mkdir [-p] <path> ...]
        [-moveFromLocal <localsrc> ... <dst>]
        [-moveToLocal <src> <localdst>]
        [-mv <src> ... <dst>]
        [-put [-f] [-p] [-l] <localsrc> ... <dst>]
        [-renameSnapshot <snapshotDir> <oldName> <newName>]
        [-rm [-f] [-r|-R] [-skipTrash] <src> ...]
        [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
        [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
        [-setfattr {-n name [-v value] | -x name} <path>]
        [-setrep [-R] [-w] <rep> <path> ...]
        [-stat [format] <path> ...]
        [-tail [-f] <file>]
        [-test -[defsz] <path>]
        [-text [-ignoreCrc] <src> ...]
        [-touchz <path> ...]
        [-usage [cmd ...]]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

上传文件

[root@HadoopNode00 ~]# hadoop fs -put 1.txt /

文件下载

[root@HadoopNode00 ~]# hadoop fs -get /1.txt /root/2.txt

显示文件

[root@HadoopNode00 ~]# hadoop fs -ls /
Found 1 items
-rw-r--r--   1 root supergroup       8917 2019-10-24 23:44 /1.txt

复制文件

[root@HadoopNode00 ~]# hadoop fs -cp /1.txt /2.txt

创建文件夹

[root@HadoopNode00 ~]# hadoop fs -mkdir /baizhi

从本地移动文件

[root@HadoopNode00 ~]# hadoop fs -moveFromLocal 1.txt.tar.gz /

从hdfs复制到本地

[root@HadoopNode00 ~]# hadoop fs -copyToLocal  /1.txt /root/3.txt

删除文件

[root@HadoopNode00 ~]# hadoop fs -rm -r -f /1.txt

删除文件夹

[root@HadoopNode00 ~]# hadoop fs -rmdir /baizhi

显示文件内容

[root@HadoopNode00 ~]# hadoop fs -cat /2.txt

显示文件最新内容

[root@HadoopNode00 ~]# hadoop fs -tail -f /2.txt

追加文件内容

[root@HadoopNode00 ~]# hadoop fs -appendToFile  1.txt /2.txt

回收站机制

core-site.xml

<property>
  <name>fs.trash.interval</name>
  <value>1</value>
 </property>

记得重启

[root@HadoopNode00 ~]# hadoop fs -rm -r -f /1.txt # 删除文件 一分钟后删除
19/10/25 00:16:59 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://HadoopNode00:9000/1.txt' to trash at: hdfs://HadoopNode00:9000/user/root/.Trash/Current

[root@HadoopNode00 ~]# hadoop fs -ls /user/root/.Trash/191025001700   # 能够在对应文件夹中看到文件
Found 1 items
-rw-r--r--   1 root supergroup       8917 2019-10-25 00:16 /user/root/.Trash/191025001700/1.txt

[root@HadoopNode00 ~]# hadoop fs -ls /user/root/.Trash/191025001700   # 一分钟后就无法看到文件了
ls: `/user/root/.Trash/191025001700': No such file or directory

替换本地库

如果在启动和使用hdfs的时候，出现了报警
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/hadoop/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
19/10/25 00:06:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

将已经编译好的hadoop2.6.0中的lib中native中的所有文件替换掉原来的已经配置好的hadoop中的相应文件

HDFS JAVA API

依赖

 <dependencies>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.6.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.6.0</version>
        </dependency>
    </dependencies>

此时使用的Hadoop需要与当前安装的版本相对应

Windows Hadoop 环境配置

将Hadoop解压至对应的文件夹中
将winutils.exe和hadoop.dll 放入 hadoop 的bin目录下
配置windows下hadoop 环境变量

权限不足解决方案

org.apache.hadoop.security.AccessControlException: Permission denied: user=Administrator, access=WRITE, inode="/":root:supergroup:drwxr-xr-x
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)

方案1

//在代码中加上下列系统设置参数
System.setProperty("HADOOP_USER_NAME","root");

方案2

//为虚拟机机上-D 参数
-DHADOOP_USER_NAME=root

方案3

关闭权限检

hdfs-site.xml

<property>
  <name>dfs.permissions.enabled</name>
  <value>false</value>
  <description>
    If "true", enable permission checking in HDFS.
    If "false", permission checking is turned off,
    but all other behavior is unchanged.
    Switching from one parameter value to the other does not change the mode,
    owner or group of files or directories.
  </description>
</property>

获取客户端

private Configuration conf;
    private FileSystem fileSystem;
    @Before
    public void getClient() throws Exception {
        //System.setProperty("HADOOP_USER_NAME","root");
        conf = new Configuration();
        conf.addResource("core-site.xml");
        conf.addResource("hdfs-site.xml");
        fileSystem = FileSystem.newInstance(conf);
    }
 @After
    public void close() throws Exception {
        fileSystem.close();
    }

上传操作

 @Test
    public void testUpload01() throws Exception {

        /*
         * 从本地拷贝文件到hdfs  上传操作
         * */
        Path localFile = new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\Hadoop.md");
        Path hdfsFile = new Path("/3.md");

        fileSystem.copyFromLocalFile(localFile, hdfsFile);
    }

public void testUpload02() throws Exception{

        /*
        *
        * 本地文件   输入流
        * */
        FileInputStream inputStream = new FileInputStream(new File("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\Hadoop.md"));

        /*
        *  hdfs 文件  输出流
        * */
        FSDataOutputStream outputStream = fileSystem.create(new Path("/4.md"));

        IOUtils.copyBytes(inputStream,outputStream,1024,true);
    }

下载操作

 @Test
    public void testDownload01() throws Exception{

        fileSystem.copyToLocalFile(false,new Path("/4.md"),new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\5.md"),true);

    }

@Test
    public void testDownload02() throws Exception{
        /*
        * 输出流 本地文件
        * */

        FileOutputStream outputStream = new FileOutputStream(new File("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\6.md"));


        FSDataInputStream inputStream = fileSystem.open(new Path("/4.md"));


        IOUtils.copyBytes(inputStream,outputStream,1024,true);

    }

删除

@Test
    public void testDelete() throws Exception{
        boolean delete = fileSystem.delete(new Path("/1.md"), true);
        if (delete){
            System.out.println("删除成功");
        }else {
            System.out.println("删除失败");
        }
    }

文件是否存在

@Test
    public void testExists() throws Exception{

        boolean exists = fileSystem.exists(new Path("/123122.md"));

        if (exists){
            System.out.println("存在");
        }else {

            System.out.println("不存在");
        }
    }

文件列表

 @Test
    public void testListFile() throws Exception{
        RemoteIterator<LocatedFileStatus> remoteIterator = fileSystem.listFiles(new Path("/"), true);

        while (remoteIterator.hasNext()){
            LocatedFileStatus fileStatus = remoteIterator.next();
            System.out.println(fileStatus.getPath());
        }
    }

创建文件夹

 @Test
    public void testMkdir()  throws  Exception{
        boolean is = fileSystem.mkdirs(new Path("/baizhi"));
        if (is) {
            System.out.println("创建成功");
        } else {
            System.out.println("创建失败");
        }
    }

HDFS Architecture

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
在这里插入图片描述
namenode：存储系统的元数据（用于描述数据的数据），例如文件命名空间、block到DataNode的映射，负责管理DataNode

datanode: 用于存储数据的节点负责相应客户端读写请求向NameNode 汇报块信息

block：数据块是对文件拆分的最小单元表示一个默认为128MB的切分尺度，每个数据块副本，默认的副本因子为3，通过dfs.replication进行配置，另外用户还可以通过 dfs.blcoksize 设置块的大小

rack：机架使用机架对存储节点做物理编排用于优化存储和计算

# 查看机架配置
[root@HadoopNode00 ~]# hdfs dfsadmin -printTopology
Rack: /default-rack
   192.168.126.10:50010 (HadoopNode00)

什么是Blcok 块

<property>
  <name>dfs.blocksize</name>
  <value>134217728</value>
  <description>
      The default block size for new files, in bytes.
      You can use the following suffix (case insensitive):
      k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),
      Or provide complete size in bytes (such as 134217728 for 128 MB).
  </description>
</property>

在这里插入图片描述

为什么是128MB？
在Hadoop 1 .x 版本 BlockSize 为64MB （因工业限制）
硬件限制：廉价PC 机械硬盘速度慢
软件优化：通常认为最佳状态为：寻址时间为传输时间的100分之一
Block 块大小能够随意设置
答案是不行，如果Block块设置过小，集群中几百万个小文件造成寻址时间的增加，效率低下
如果太大，会造成空间的浪费，会造成存取时间过长，效率还是低，
适合的才是最好的。

机架感知

当进行文件存储的时候，在默认副本因子为3的情况下，第一个副本存储在本地机架上的本地机器上，第二个存储在本地机架上其他节点上面，第三个存储在除本地机架其他的节点上。
三分之一的文件副本在在某个节点上，三分之二的文件副本在本地机架上，三分之一均分。

在这里插入图片描述
SecondaryNameNode和NameNode的关系（重点）

FSImage：元数据信息的备份，会被加载到内存中

https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html

Edits：Edits文件帮助记录文件增加和更新的操作提高效率

https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html

NameNode在启动的时候需要加载edits文件和fsimage文件，所以在第一次启动的时候需要格式化NameNode。

当用户上传文件或者下载文件的时候，会将记录写入edits文件中，这样edits文件和fsimage文件加起来永远是最新的元数据。

当用户一直进行操作，势必会导致edits文件过大，这就导致了集群在下次启动的时候时间过长（会加载两个文件）

为了解决这个问题，可以将edits文件和fsimage文件进行合并。

在这里NameNode自己不能合并元数据，合并元数据的任务交由SecondaryNameNode。将当前NameNode的edits文件和fsimage上传到自己的节点中，利用自身节点的计算资源进行合并，合并完成后会将最新的fsimage上传到namenode节点中，此时namenode加载最新的元数据。

在进行合并的期间，如果出现外部客户端有新的操作请求，会更改edits文件，但是操作edits文件可能会导致文件数据紊乱。如果和解决这个问题？那就是将新的操作记录写入一个叫做edits-inprogress文件中，等到合并完成，edits-inprogress会更名为当前系统的edits文件。
在这里插入图片描述
检查节点机制
dfs.namenode.checkpoint.period（默认设置为1小时）指定两个连续检查点之间的最大延迟

<property>
  <name>dfs.namenode.checkpoint.period</name>
  <value>3600</value>
  <description>The number of seconds between two periodic checkpoints.
  </description>
</property>

dfs.namenode.checkpoint.txns（默认设置为100万）定义了NameNode上的非检查点事务数，即使尚未达到检查点期限，该事务也会强制执行紧急检查点。

<property>
  <name>dfs.namenode.checkpoint.txns</name>
  <value>1000000</value>
  <description>The Secondary NameNode or CheckpointNode will create a checkpoint
  of the namespace every 'dfs.namenode.checkpoint.txns' transactions, regardless
  of whether 'dfs.namenode.checkpoint.period' has expired.
  </description>
</property>

安全模式

HDFS在启动的时候会默认开启安全模式，当等待发现绝大多数的Block块可用的时候，会自动退出安全模式。
安全模式是HDFS集群的只读模式，但是显式的将HDFS至于安全模式，使用hdfs dfsadmin -safemode命令即可。

[root@HadoopNode00 ~]# hadoop fs -put 1.txt /
[root@HadoopNode00 ~]# hadoop fs -ls /
Found 6 items
-rw-r--r--   1 root          supergroup       8917 2019-10-25 23:27 /1.txt
-rw-r--r--   1 root          supergroup      16946 2019-10-25 18:01 /2.md
-rw-r--r--   1 Administrator supergroup      18279 2019-10-25 18:07 /3.md
-rw-r--r--   1 Administrator supergroup      18279 2019-10-25 18:16 /4.md
drwxr-xr-x   - Administrator supergroup          0 2019-10-25 18:37 /baizhi
drwx------   - root          supergroup          0 2019-10-25 00:15 /user
[root@HadoopNode00 ~]# hdfs  dfsadmin -safemode enter   # 开启安全模式
Safe mode is ON
[root@HadoopNode00 ~]# hadoop fs -put 1.txt /2.txt
put: Cannot create file/2.txt._COPYING_. Name node is in safe mode.
[root@HadoopNode00 ~]# hdfs  dfsadmin -safemode leave   # 关闭安全模式
Safe mode is OFF
[root@HadoopNode00 ~]# hadoop fs -put 1.txt /2.txt

为什么说HDFS 不删长存储小文件？

文件	namenode内存占用	datanode磁盘占用
128MB单个文件	1KB 元数据 \| 1kB	128MB*3
128KB*1000个文件	1000 1KB 元数据 \| 1MB	128MB*3

因为NameNode使用的是单机存储元数据，如果存储过多的小文件，会导致内存紧张

解决小文件存储的问题？

HDFS 写数据流程

在这里插入图片描述
HDFS 读数据流程

DataNode 工作机制

MapReducer

概述

MapReduce是一种编程模型，用于大规模数据集（大于1TB）的并行运算。概念"Map（映射）“和"Reduce（归约）”，是它们的主要思想，都是从函数式编程语言里借来的，还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下，将自己的程序运行在分布式系统上。当前的软件实现是指定一个Map（映射）函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduce（归约）函数，用来保证所有映射的键值对中的每一个共享相同的键组。

MapReduce是一个并行计算框架，将一个任务（Job）拆分成两个阶段一个是Map一个是Reudce，MapReduce充分利用了存储节点的计算资源所在物理主机（CPU/内存/网络/少许的硬盘）进行并行运算。MapReduce在需要在集群中启动Yarn，启动会出现相应的进程，在一个节点启动一个NodeMmanger对当前节点的计算资源的管理和使用，默认情况下NodeManager会将的当前节点的物理主机抽象为8个计算单元，每一个计算单元称之为一个Container，这些NodeManager都必须听从ResourceManager调度。

MapRduce 擅长处理大数据，Map的思想就是“分而治之”

Map 负责“分”,即把复杂的任务分解成若干小任务
- 数据或者计算的规模要比原任务大大缩小
- 是就近计算，即任务会分配到所需节点上进行计算
- 可以并行计算,彼此之间没有干扰和依赖。
Reduce 负责对map阶段的结果进行汇总

MapReduce 优缺点

优点
- 易于编写分布式应用程序
- 有良好的拓展机制
- 适合离线批处理
- 高容错性
缺点
- 不适合流处理
- 不擅长图计算

YARN 环境搭建

什么是YARN ？

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

NodeManager:管理主机上计算资源，是每台机器的框架代理，向RM 汇报自身的状态信息。

ResourceManager：负责集群计算资源的统筹规划，拥有着集群资源的最终决策权。

ApplicationMaster：计算任务的Master，负责申请资源，协调计算任务。

YARNChild：负责最实际的计算任务（MapTask|ReduceTask）

Container：是计算资源的抽象，代表着一组内存、CPU、网络的占用，无论是ApplicationMaster和YARNChild都需要消耗一个Container。
在这里插入图片描述
环境搭建
etc/hadoop/yarn-site.xml

 <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
   <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>HadoopNode00</value>
  </property>

etc/hadoop/mapred-site.xml

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

启动计算任务

[root@HadoopNode00 ~]# start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-root-resourcemanager-HadoopNode00.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-root-nodemanager-HadoopNode00.out
[root@HadoopNode00 ~]# stop-yarn.sh
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop

[root@HadoopNode00 ~]# jps
43856 Jps
43362 ResourceManager
2403 NameNode
43636 NodeManager
2776 SecondaryNameNode
2556 DataNode

访问： http://hadoopnode00:8088/cluster
http://hostname:8088

MR 入门

依赖

<dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.6.0</version>
        </dependency>	

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.6.0</version>
        </dependency>

          <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.6.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.6.0</version>
        </dependency>

Mapper逻辑

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
/*
*
* keyIn  LongWritable  Long   输入文本的偏移量
* valueIn Text         String  输入行文本
* */

public class WCMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        /*
        * 拿到每一个单词
        * */
        String[] words = value.toString().split(" ");
        for (String word : words) {
            context.write(new Text(word),new IntWritable(1));
        }
    }
}

Reduce 逻辑

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
/*
 *
 * (key<Text>,value<IntWrite>)
 *
 * */

public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum+=value.get();
        }
        context.write(key,new IntWritable(sum));
    }
}

Job类

import jdk.nashorn.internal.scripts.JO;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WCJob {
    public static void main(String[] args) throws Exception {
        /*
         * 1 封装Job 对象
         * */
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "WC-JOB01");
        
        job.setJarByClass(WCJob.class);
        /*
         * 2 设置数据的写入和写出格式
         * */
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        /*
         *
         * 3 设置数据读取和写出路径
         * */
        TextInputFormat.setInputPaths(job, new Path("/wcdata.txt"));
        /*
         * 此时需要注意  指定的文件夹 不能存在
         * */
        TextOutputFormat.setOutputPath(job, new Path("/test01"));
        /*
         *
         * 4 设置数据的计算逻辑
         * */
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);

        /*
         * 5 设置Mapper 和Reducer 的输出泛型
         * */
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        /*
         * 6 任务提交
         * */
        //job.submit();
        job.waitForCompletion(true);
    }
}

提交Job

Jar包提交

// 设置jar 类 加载器 
job.setJarByClass(WCJob.class);

通过Maven 打包
细节：
<packaging>jar</packaging>

先clean 再 package

hadoop jar Hadoop_Test-1.0-SNAPSHOT.jar com.baizhi.mr.test01.WCJob
hadoop jar jar包名字  job类名

本地提交
log4j

<dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>

log4j.properties

log4j.rootLogger = info,stdout

log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n

本地进行提交任然不需要修改之前的写代码
但是需要指定文件路径为本地路径
有可能出现将本地文件路径识别成HDFS 上的文件，请在文件前面加上 file:///

跨平台提交

拷贝相关配置文件到Resource目录下面（core|hdfs|mapred|yarn-site.xml）

设置跨平台提交

添加配置到mapred-site.xml

<property>
  <description>If enabled, user can submit an application cross-platform
  i.e. submit an application from a Windows client to a Linux/Unix server or
  vice versa.
  </description>
  <name>mapreduce.app-submission.cross-platform</name>
  <value>true</value>
</property>

添加代码到Job类中

 conf.set("mapreduce.app-submission.cross-platform","true");

conf.addResource("conf2/core-site.xml");
        conf.addResource("conf2/hdfs-site.xml");
        conf.addResource("conf2/mapred-site.xml");
        conf.addResource("conf2/yarn-site.xml");

		// 设置 jar 包的路径
        conf.set(MRJobConfig.JAR,"D:\\大数据\\Code\\BigData\\Hadoop_Test\\target\\Hadoop_Test-1.0-SNAPSHOT.jar");

自定义Bean 对象

为什么需要自定义Bean对象

开发不是一成不变，Hadoop只提供了几种数据类型的序列化对象，往往在开发中这些基础数据类型不能很好应对复杂的开发，所以说可以加入Bean对象去简化开发的流程，但是此时需要对所有涉及到的对象进行序列化的操作。

Hadoop有自己的一套序列化方案，Writeable
为什么不用JDK内置的序列化方案？
因为是重量级的，不是MR计算过程需要高效且快速的传输。

需求

15713770999 12113 123 hn
15713770929 12133 123 zz
15713770909 123 1123 bj
13949158086 13 1213 kf
15713770929 11 12113 xy
15713770999 11113 123 hn
15713770929 123233 123 zz
15713770909 12113 1123 bj
13949158086 13 1213 kf
15713770929 121 12113 xy

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlowBean implements Writable {
    private String phone;
    private Long upFlow;
    private Long downFlow;
    private Long sumFlow;

    public FlowBean() {

    }


    public FlowBean(String phone, Long upFlow, Long downFlow, Long sumFlow) {
        this.phone = phone;
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = sumFlow;
    }

    public String getPhone() {
        return phone;
    }

    public void setPhone(String phone) {
        this.phone = phone;
    }

    public Long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(Long upFlow) {
        this.upFlow = upFlow;
    }

    public Long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(Long downFlow) {
        this.downFlow = downFlow;
    }

    public Long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(Long sumFlow) {
        this.sumFlow = sumFlow;
    }

    @Override
    public String toString() {
        return "FlowBean{" +
                "phone='" + phone + '\'' +
                ", baizhi upFlow=" + upFlow +
                ", downFlow=" + downFlow +
                ", sumFlow=" + sumFlow +
                '}'  ;
    }

    /*
     * 序列化 编码
     * */
    public void write(DataOutput dataOutput) throws IOException {

        if (this.phone!=null) {
            dataOutput.writeUTF(this.phone);
        }
        dataOutput.writeLong(this.upFlow);
        dataOutput.writeLong(this.downFlow);
        dataOutput.writeLong(this.sumFlow);


    }

    /*
     * 反序列化  解码
     * */
    public void readFields(DataInput dataInput) throws IOException {


        if (this.phone!=null) {
            this.phone = dataInput.readUTF();
        }
        this.upFlow = dataInput.readLong();
        this.downFlow = dataInput.readLong();
        this.sumFlow = dataInput.readLong();
    }
}

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class FlowJob {
    public static void main(String[] args) throws Exception {


        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);


        job.setJarByClass(FlowJob.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        TextInputFormat.setInputPaths(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
        //TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
        TextOutputFormat.setOutputPath(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\out6"));
        //TextOutputFormat.setOutputPath(job, new Path("/out312313"));


        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);


        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);


        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(FlowBean.class);


        job.waitForCompletion(true);
    }
}

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/*
 *15713770999 12113 123
 *
 * */
public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        /*
        * 获取到行数据
        * */
        String line = value.toString();

        /*
        * 通过空格进行分割
        * */
        String[] infos = line.split(" ");


        /*
        * 拿到电话号码
        * */
        String phone = infos[0];

        /*
        * 拿到上行流量
        * */
        Long up = Long.valueOf(infos[1]);

        /*
        * 拿到下行流量
        * */
        Long down = Long.valueOf(infos[2]);

        /*
        * key 电话号码
        *
        *value 是 flowbean对象
        * */
        context.write(new Text(phone), new FlowBean(null, up, down, up + down));
    }
}

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class FlowReducer extends Reducer<Text, FlowBean, NullWritable, FlowBean> {
    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {

        /*
         *
         * 15713770999 [bean01,bean02,bean03]
         * */
        Long up = 0L;
        Long down = 0L;
        Long sum = 0L;
        for (FlowBean value : values) {
            up += value.getUpFlow();
            down += value.getDownFlow();
            sum += value.getSumFlow();
        }
        context.write(NullWritable.get(),new FlowBean(key.toString(),up,down,sum));
    }
}

MR计算流程（重点）

在这里插入图片描述

首先通过程序员所编写的MR程序通过命令行提交
开始运行后称之为Job（一个MR程序就是Job），Job会向RM申请注册（会带一些信息）
如果RM注册通过，Job会从共享的文件系统中拷贝相关信息
Job紧着着会向RM提交完整的应用信息
a RM会计算出Job启动所需要的资源
5b.连接到对应的NM启动一个MRAppMaster
6 . MRAppMaster 在启动的时候会初始化Job
初始化Job 后，会去共享文件系统中回去切片
MRAppMaster 向RM 请求计算资源
连接到对应的NM ，消耗Container 启动YARNChild
获取完整的Job资源
计算

Job 提交流程

在这里插入图片描述
Check Space 校验空间

创建资源路径：staging 路径

创建Job 路径
在staging路径下创建以JobID 为名字的文件夹

拷贝相关资源到集群

files
libjar
archives
jobJar

计算切片，生成切片规划文件
在这里插入图片描述
向Staging 路径写入配置文件

MapReduce 组件解析

概述

通过WC案例编写，我们不难发现，其实是按照一定的规则执行程序的输入和输出，最终将作业提交到Hadoop的集群中来。

Hadoop是将数据切分成若干个输入切片（Input Split），并将每一个Split交给一个MapTask处理；MapTask 不断的从对应的Split中解析出一个一个的key、value，并调用map（）函数处理，处理完成后根据Redduce Task个数将结果分解成若干个分片（partition），写到磁盘中。

同时，每个Reduce Task 从每个MapTask的节点上读取属于自己的那个分区（partition）的数据，然后使用基于排序的方法将key相同的数据聚集在一起，调用Reduce函数，并将结果输出到文件中。

通过上面的描述，上面还缺少三个组件：

（1）指定文件格式，将输入数据切分为若干个Split，，且将每个Spalit的数据解析成一个个map（）函数要求的key，value对象

（2）确定Map（）函数产生的新的keyvalue对交给那个ReduceTask 函数处理

（3）指定输出文件格式，即每个keyvalue对以何种形式保存到输出文件中。

所以在MR中，这个三个组件分别就是InputFormat，Partitioner，OutPutFormat,他们均需要用户根据自己业务进行配置，但是对于WC来说，使用的默认的即可。

但是最终，Hadoop其实为我们提供了5个可以进行编程的组件。

InputFormat，Mapper，Partitioner，Reducer，OutPutFormat

另外还有组件叫做Conbiner 这个组件通常用于优化MR程序性能，但是需要根据具体业务场景而定。

InputFormat组件

InputFormat主要用于描述数据输入的格式，它提供了两个功能：

数据切分：按照某种策略将输入数据切分为若干个Split，以确定MapTask的个数和Spilt
为Mapper 提供数据

什么输入切片
在这里插入图片描述
如何切片？

public List<InputSplit> getSplits(JobContext job) throws IOException {
    	// 不用管		
        Stopwatch sw = (new Stopwatch()).start();
    	// 最小大小 为计算splitSize 做准备
        long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job));
      // 最大 大小 为计算splitSize 做准备
         long maxSize = getMaxSplitSize(job);
        // 准备存储切片的集合
        List<InputSplit> splits = new ArrayList();
       // 获取所有文件的集合
        List<FileStatus> files = this.listStatus(job);
       // 迭代
        Iterator i$ = files.iterator();

        while(true) {
            while(true) {
                while(i$.hasNext()) {
                    // 获取文件状态  （到底是hdfs 上的文件还是本地文件）
                    FileStatus file = (FileStatus)i$.next();
                    //  获取文件路径
                    Path path = file.getPath();
                    // 获取文件长度
                    long length = file.getLen();
                    // 如果文件不为空 则 继续执行
                    if (length != 0L) {
                        //准备存储BlockLocation的数组
                    
                        BlockLocation[] blkLocations;
                        // 判断文件是否为本地文件
                        if (file instanceof LocatedFileStatus) {
                            blkLocations = ((LocatedFileStatus)file).getBlockLocations();
                        } else {
                            //获取到fs 对象
                            FileSystem fs = path.getFileSystem(job.getConfiguration());
                            //通过fs 对象获取到文件所有的block的地址（位置）
                            blkLocations = fs.getFileBlockLocations(file, 0L, length);
                        }
	
                        // isSplitable 方法默认可以切分
                        if (this.isSplitable(job, path)) {
                            
                             // i获取到当前的blcoksize大小 一般为128MB
                            long blockSize = file.getBlockSize();
                            /*
                            protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
                                return Math.max(minSize, Math.min(maxSize, blockSize));
                            }
                            上述代码是计算切片策略的大小
                            */
                            long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);

                            
                             //  准备剩余 的字节数的对象
                            long bytesRemaining;
                             //  blockIndex
                            int blkIndex;
                            
                            
                              //  for 循环 剩余大小等于当前的长度  使用剩余大小除切片策略的大小 如果大于1.1 则向下执行 否则退出循环
                            for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) {
                                  // 剩余大小除切片策略的大小 大于1.1 
                                blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
                                // 给定文件路径 ，文件开始的位置  文件长度    如果剩余长度除以splitSize 大于1.1 则进入到次循环
                                splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
                            }
 							//  如果剩余长度除以splitSize 小于1.1 则进入到判断语句
                            if (bytesRemaining != 0L) {
                                blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
                                 // 给定文件路径 ，文件开始的位置  文件长度    如果剩余长度除以splitSize 大于1.1 则进入到次循环
                                splits.add(this.makeSplit(path, length - bytesRemaining, bytesRemaining, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
                            }
                        } else {
                            splits.add(this.makeSplit(path, 0L, length, blkLocations[0].getHosts(), blkLocations[0].getCachedHosts()));
                        }
                    } else {
                        splits.add(this.makeSplit(path, 0L, length, new String[0]));
                    }
                }

                job.getConfiguration().setLong("mapreduce.input.fileinputformat.numinputfiles", (long)files.size());
                sw.stop();
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Total # of splits generated by getSplits: " + splits.size() + ", TimeTaken: " + sw.elapsedMillis());
                }
                return splits;
            }
        }
    }

如何为Mapper 提供数据

TextInputFormat使用RecordReader中org.apache.hadoop.mapreduce.lib.input下的LineRecordReader。这个类中方法，首先会调用initializ方法获取切片的初始化位置和结束位置，以及使用fs对象打开文件的输入流，mapper 的key、value是通过LineRecordReader.nextKeyValue，在此间期间，key被设置成当前文本的偏移量，value被设置成使用RedaLine的readDefaultLine方法读取到每一行文本：

public boolean nextKeyValue() throws IOException {
        if (this.key == null) {
            this.key = new LongWritable();
        }

        this.key.set(this.pos);
        if (this.value == null) {
            this.value = new Text();
        }

        int newSize = 0;

        while(this.getFilePosition() <= this.end || this.in.needAdditionalRecordAfterSplit()) {
            if (this.pos == 0L) {
                newSize = this.skipUtfByteOrderMark();
            } else {
                newSize = this.in.readLine(this.value, this.maxLineLength, this.maxBytesToConsume(this.pos));
                this.pos += (long)newSize;
            }

            if (newSize == 0 || newSize < this.maxLineLength) {
                break;
            }

            LOG.info("Skipped line of size " + newSize + " at pos " + (this.pos - (long)newSize));
        }

        if (newSize == 0) {
            this.key = null;
            this.value = null;
            return false;
        } else {
            return true;
        }
    }

private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException {
        str.clear();
        int txtLength = 0;
        int newlineLength = 0;
        boolean prevCharCR = false;
        long bytesConsumed = 0L;

        do {
            int startPosn = this.bufferPosn;
            if (this.bufferPosn >= this.bufferLength) {
                startPosn = this.bufferPosn = 0;
                if (prevCharCR) {
                    ++bytesConsumed;
                }

                this.bufferLength = this.fillBuffer(this.in, this.buffer, prevCharCR);
                if (this.bufferLength <= 0) {
                    break;
                }
            }

            while(this.bufferPosn < this.bufferLength) {
                if (this.buffer[this.bufferPosn] == 10) {
                    newlineLength = prevCharCR ? 2 : 1;
                    ++this.bufferPosn;
                    break;
                }

                if (prevCharCR) {
                    newlineLength = 1;
                    break;
                }

                prevCharCR = this.buffer[this.bufferPosn] == 13;
                ++this.bufferPosn;
            }

            int readLength = this.bufferPosn - startPosn;
            if (prevCharCR && newlineLength == 0) {
                --readLength;
            }

            bytesConsumed += (long)readLength;
            int appendLength = readLength - newlineLength;
            if (appendLength > maxLineLength - txtLength) {
                appendLength = maxLineLength - txtLength;
            }

            if (appendLength > 0) {
                str.append(this.buffer, startPosn, appendLength);
                txtLength += appendLength;
            }
        } while(newlineLength == 0 && bytesConsumed < (long)maxBytesToConsume);

        if (bytesConsumed > 2147483647L) {
            throw new IOException("Too many bytes before newline: " + bytesConsumed);
        } else {
            return (int)bytesConsumed;
        }
    }

切片与Maptask的关系

MapTask数量由切片数量决定，ReduceTask的数量可以手动设置，默认为1

常用的InpuFormat

FileInputFormat （读取HDFS 文件）
- TextInputFormat
  - key LongWritable 行字节偏移量
  - value Text 当前行文本
    切片：以文件为单位按照 splitsize 切分
- NLineInputFormat
  - key LongWritable 行字节偏移量
  - value Text 当前行文本
    切片：以文件为单位，n行为一个切片；默认一行一个切片可以设置
- CombineTextInputFormat
  - key LongWritable 行字节偏移量
  - value Text 当前行文本
    切片：按照splitsize 进行切分，一个切片可能对应多个文件
DBInputFormat（读取RDBMS）
TableInputFormat（读取Hbase）（重点）

NLineInputFormat

 NLineInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
NLineInputFormat.setNumLinesPerSplit(job,3);

CombineTextInputFormat

CombineTextInputFormat.setMinInputSplitSize(job,1024000);
        CombineTextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\littlewenjian"));

DBInputFormat

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.db.DBConfiguration;
import org.apache.hadoop.mapred.lib.db.DBInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class DBJob {
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/hadoop", "root", "1234");

        Job job = Job.getInstance(conf);

        job.setInputFormatClass(DBInputFormat.class);

        job.setOutputFormatClass(TextOutputFormat.class);


        DBInputFormat.setInput(job, User.class, "select id,name from user", "select count(1) from user");


        TextOutputFormat.setOutputPath(job, new Path("D:\\大数据训练营课程\\Code\\BigData\\Hadoop_Test\\src\\main\\java\\com\\baizhi\\mr\\test05\\out1"));


        job.setMapperClass(DBMapper.class);

        job.setReducerClass(DBReducer.class);


        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);


        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);


        job.waitForCompletion(true);
    }
}

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class DBMapper extends Mapper<LongWritable, User, LongWritable, Text> {

    @Override
    protected void map(LongWritable key, User value, Context context) throws IOException, InterruptedException {

        context.write(key, new Text(value.toString()));

    }
}

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class DBReducer extends Reducer<LongWritable, Text, LongWritable, Text> {

    @Override
    protected void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

        for (Text value : values) {

            context.write(key, value);
        }
    }
}

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

public class User implements Writable, DBWritable {

    private Integer id;
    private String name;

    public User() {
    }

    public User(Integer id, String name) {
        this.id = id;
        this.name = name;
    }

    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    /*
     *  序列化 编码
     * */
    public void write(DataOutput dataOutput) throws IOException {

        dataOutput.writeInt(this.id);
        dataOutput.writeUTF(this.name);
    }

    /*
     * 反序列化  解码
     * */
    public void readFields(DataInput dataInput) throws IOException {

        this.id = dataInput.readInt();
        this.name = dataInput.readUTF();
    }

    /*
     *  序列化 编码
     * */
    public void write(PreparedStatement pstm) throws SQLException {

        pstm.setInt(1, this.id);
        pstm.setString(2, this.name);

    }

    /*
     * 反序列化  解码
     * */
    public void readFields(ResultSet rs) throws SQLException {

        this.id = rs.getInt(1);
        this.name = rs.getString(2);


    }

    @Override
    public String toString() {
        return "User{" +
                "id=" + id +
                ", name='" + name + '\'' +
                '}';
    }
}

本地运行模式
需要加上mysql的依赖到pom文件中
jar包提交
需要将 mysql jar 包拷贝至/home/hadoop/hadoop-2.6.0/share/hadoop/yarn/ 中
远程提交
和之前保持一致即可

自定义InputFormat

解决问题：HDFS中小文件存储

解决小文件存储的问题，将多个文件合并成一个SequenceFile（SequenceFile特指hadoop中一种特殊的文件，这种文件里面存在多个文件，是Hadoop用来存储二进制形式的key-value对的文件格式，SequenceFile 中有路径+文件名为key 文件内容和为Value）

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import java.io.IOException;

public class OwnInputFormat extends FileInputFormat<Text, BytesWritable> {

    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;

    }

    public RecordReader<Text, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

        OwnRecordReader recordReader = new OwnRecordReader();
        recordReader.initialize(inputSplit, taskAttemptContext);
        return recordReader;
    }
}

import com.baizhi.mr.test01.WCJob;
import com.baizhi.mr.test01.WCMapper;
import com.baizhi.mr.test01.WCReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class OwnJob {
    public static void main(String[] args) throws Exception {

        /*
         * 1 封装Job 对象
         * */
        // System.setProperty("HADOOP_USER_NAME", "root");

        Configuration conf = new Configuration();
        /*
         * 设置跨平台提交
         * */
       /* conf.set("mapreduce.app-submission.cross-platform", "true");

        conf.addResource("conf2/core-site.xml");
        conf.addResource("conf2/hdfs-site.xml");
        conf.addResource("conf2/mapred-site.xml");
        conf.addResource("conf2/yarn-site.xml");


         conf.set(MRJobConfig.JAR,"D:\\大数据训练营课程\\Code\\BigData\\Hadoop_Test\\target\\Hadoop_Test-1.0-SNAPSHOT.jar");
*/
        Job job = Job.getInstance(conf, "WC-JOB01");

        job.setJarByClass(OwnJob.class);




        /*
         * 2 设置数据的写入和写出格式
         * */

        job.setInputFormatClass(OwnInputFormat.class);
        job.setOutputFormatClass(SequenceFileOutputFormat.class);


        /*
         *
         * 3 设置数据读取和写出路径
         * */

        OwnInputFormat.addInputPath(job,new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\littlewenjian"));
        //TextInputFormat.setInputPaths(job, new Path("/access.tmp2019-05-19-10-28.log"),new Path("/4.md"));
        //TextInputFormat.setInputPaths(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\littlewenjian"));
        //CombineTextInputFormat.setMinInputSplitSize(job,1024000);
        //CombineTextInputFormat.setInputPaths(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\littlewenjian"));
        //NLineInputFormat.setInputPaths(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
        //NLineInputFormat.setNumLinesPerSplit(job,3);
        /*
         * 此时需要注意  指定的文件夹 不能存在
         * */
        //TextOutputFormat.setOutputPath(job, new Path("/outtes1t1112111311"));
        //TextOutputFormat.setOutputPath(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\out2"));
        SequenceFileOutputFormat.setOutputPath(job,new Path("D:\\大数据训练营课程\\Code\\BigData\\Hadoop_Test\\src\\main\\java\\com\\baizhi\\mr\\test06\\out4"));


        /*
         *
         * 4 设置数据的计算逻辑
         * */

        job.setMapperClass(OwnMapper.class);
        job.setReducerClass(OwnReducer.class);


        /*
         * 5 设置Mapper 和Reducer 的输出泛型
         * */

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(BytesWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);


        /*
         * 6 任务提交
         * */

        //job.submit();
        job.waitForCompletion(true);

    }
}

import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class OwnMapper extends Mapper<Text, BytesWritable, Text, BytesWritable> {
    @Override
    protected void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException {

        context.write(key, value);

    }
}

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

public class OwnRecordReader extends RecordReader<Text, BytesWritable> {


    private FileSplit fileSplit;
    private Configuration conf;

    private Text key = new Text();
    private BytesWritable value = new BytesWritable();


    boolean isProgress = true;


    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

        /*
         *初始化
         * */
        // 获得到文件切片对象
        this.fileSplit = (FileSplit) inputSplit;

        // 获取到 Configuration 对象
        conf = taskAttemptContext.getConfiguration();


    }

    public boolean nextKeyValue() throws IOException, InterruptedException {


        if (isProgress) {
            // 设置 存储 文件 内容的二进制数组
            byte[] bytes = new byte[(int) fileSplit.getLength()];

            // 获取文件路径
            Path path = fileSplit.getPath();

            FileSystem fileSystem = path.getFileSystem(conf);

            FSDataInputStream fsDataInputStream = fileSystem.open(path);


            IOUtils.readFully(fsDataInputStream, bytes, 0, bytes.length);

            /*
             * 设置 key 的值为文件路径
             * */
            key.set(path.toString());
            /*
             * 设置 value 的值为文件内容
             * */
            value.set(bytes, 0, bytes.length);

            isProgress = false;

            return true;
        }
        return false;
    }

    public Text getCurrentKey() throws IOException, InterruptedException {
        return this.key;
    }

    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return this.value;
    }

    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }

    public void close() throws IOException {

    }
}

import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class OwnReducer extends Reducer<Text, BytesWritable, Text, BytesWritable> {
    @Override
    protected void reduce(Text key, Iterable<BytesWritable> values, Context context) throws IOException, InterruptedException {

        for (BytesWritable value : values) {
            context.write(key, value);
        }
    }
}

Partitiner 组件

默认的HashPartitiner使用哈希值去计算key的分区
在这里插入图片描述
需求：
将流量数据输出到不同的文件中

public class App {
    public static void main(String[] args) {


        String hn = "hn";

        String zz = "zz";

        String xy = "xy";

        String kf = "kf";

        String bj = "bj";
        System.out.println((hn.hashCode()& 2147483647) % 1);
        System.out.println((zz.hashCode()& 2147483647) % 1);
        System.out.println((xy.hashCode()& 2147483647) % 1);
        System.out.println((kf.hashCode()& 2147483647) % 1);
        System.out.println((bj.hashCode()& 2147483647) % 1);

    }
}

import com.baizhi.mr.test04.FlowJob;
import com.baizhi.mr.test04.FlowMapper;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class AreaJob {
    public static void main(String[] args) throws Exception {


        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);


        job.setJarByClass(AreaJob.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        job.setPartitionerClass(OwnPartatiner.class);

        TextInputFormat.setInputPaths(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
        //TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
        TextOutputFormat.setOutputPath(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\out112311111"));
        //TextOutputFormat.setOutputPath(job, new Path("/out312313"));


        job.setMapperClass(AreaMapper.class);
        job.setReducerClass(AreaReducer.class);


        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);


        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        job.setNumReduceTasks(100);

        job.waitForCompletion(true);
    }
}

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class AreaMapper extends Mapper<LongWritable, Text, Text, FlowBean> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String[] line = value.toString().split(" ");


        String area = line[3];

        String phone = line[0];
        Long up = Long.valueOf(line[1]);

        Long down = Long.valueOf(line[2]);

        Long sum = up + down;

        context.write(new Text(area), new FlowBean(phone, up, down, sum));

    }
}

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class AreaReducer extends Reducer<Text, FlowBean, Text, FlowBean> {

    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {

        for (FlowBean value : values) {
            context.write(key, value);
        }
    }
}

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlowBean implements Writable {
    private String phone;
    private Long upFlow;
    private Long downFlow;
    private Long sumFlow;

    public FlowBean() {

    }


    public FlowBean(String phone, Long upFlow, Long downFlow, Long sumFlow) {
        this.phone = phone;
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = sumFlow;
    }

    public String getPhone() {
        return phone;
    }

    public void setPhone(String phone) {
        this.phone = phone;
    }

    public Long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(Long upFlow) {
        this.upFlow = upFlow;
    }

    public Long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(Long downFlow) {
        this.downFlow = downFlow;
    }

    public Long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(Long sumFlow) {
        this.sumFlow = sumFlow;
    }

    @Override
    public String toString() {
        return "FlowBean{" +
                "phone='" + phone + '\'' +
                ", baizhi upFlow=" + upFlow +
                ", downFlow=" + downFlow +
                ", sumFlow=" + sumFlow +
                '}'  ;
    }

    /*
     * 序列化 编码
     * */
    public void write(DataOutput dataOutput) throws IOException {


        dataOutput.writeUTF(this.phone);

        dataOutput.writeLong(this.upFlow);
        dataOutput.writeLong(this.downFlow);
        dataOutput.writeLong(this.sumFlow);


    }

    /*
     * 反序列化  解码
     * */
    public void readFields(DataInput dataInput) throws IOException {
        this.phone = dataInput.readUTF();
        this.upFlow = dataInput.readLong();
        this.downFlow = dataInput.readLong();
        this.sumFlow = dataInput.readLong();
    }
}

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

import java.util.HashMap;

public class OwnPartatiner extends Partitioner<Text, FlowBean> {

    private static HashMap<String, Integer> areaMap = new HashMap<String, Integer>();

    static {
        areaMap.put("hn", 0);
        areaMap.put("xy", 1);
        areaMap.put("kf", 2);
        areaMap.put("bj", 3);
        areaMap.put("zz", 4);
    }


    public int getPartition(Text key, FlowBean value, int i) {


        /*if (areaMap.get(key.toString()) != null) {
            Integer integer = areaMap.get(key.toString());
            return integer;
        } else {

            return 0;
        }
*/
        return areaMap.get(key.toString())==null ? 0 :areaMap.get(key.toString());
    }
}

OutputFormat

自定义OutputFormat

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

public class MyJob {
    public static void main(String[] args) throws Exception {


        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);


        job.setJarByClass(MyJob.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(OwnOutputFormat.class);


        TextInputFormat.setInputPaths(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
        //TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
        OwnOutputFormat.setOutputPath(job, new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\asdjk1ha1"));
        //TextOutputFormat.setOutputPath(job, new Path("/out312313"));


        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);


        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);


        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);


        job.waitForCompletion(true);
    }
}

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        context.write(key, value);
    }
}

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyReducer extends Reducer<LongWritable, Text, LongWritable, Text> {

    @Override
    protected void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {


        for (Text value : values) {

            context.write(key, value);
        }
    }
}

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import java.io.IOException;

public class OenRecordWriter extends RecordWriter<LongWritable, Text> {
    private FSDataOutputStream outputStream;

    public OenRecordWriter(TaskAttemptContext context) throws Exception {

        FileSystem fileSystem = FileSystem.get(context.getConfiguration());


        outputStream = fileSystem.create(new Path("D:\\大数据训练营课程\\Code\\BigData\\Hadoop_Test\\src\\main\\java\\com\\baizhi\\mr\\test08\\out1\\1.txt"));

    }

    public void write(LongWritable longWritable, Text text) throws IOException, InterruptedException {

        outputStream.write((text.toString()+"\n").getBytes());

    }

    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

        IOUtils.closeStream(outputStream);

    }
}

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class OwnOutputFormat extends FileOutputFormat<LongWritable, Text> {
    public RecordWriter<LongWritable, Text> getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

        try {
            return new OenRecordWriter(taskAttemptContext);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
}

Combiner 组件

（1）Combiner是MR程序中Mapper和Reducer之外的一种组件

（2）Combiner 组件的父类就是Reducer

（3）Combiner和Reducer 的区别在于运行位置

Combiner 是在每一个Task所在的节点上运行
Reducer 是在接收全局所有的Mapper的输出结果

（4）Combiner的意义在于对与每一个MapTask的输出做局部汇总，以减少网络使用量
（5）Combiner的使用不能影响最终的业务结果

进行累加 可以
求平均值   0 20 10 25 15 的平均数  14
         （0,10,20） 10   （25,15）20  （10,20）15 
         
 适合累加 不适合求平均数

（6）Combiner输出的KV 要与Reducer 的KV相对应

（7）使用

新建CombinerClass 继承Reducer
直接使用Redcuer

（8）配置

job.setCombinerClass(WCReducer.class);

在这里插入图片描述

MR 过程

在这里插入图片描述

为了让Reduce可以并行处理Map结果，需要对Map输出结果进行一定分区（Partition），排序（Sort），合并（Combine），归并（Merge）等操作，得到<key,value-list>类型的中间结果，再交给对应的Reduce进行处理，这个过程称之为Shuffle，从无序<key,value>变成有序的< key-valuelist >

Shuffle

Shuffle过程是MR的核心，描述着MapTask输出和ReduceTask输入的这段过程。在大数据处理过程中，由于使用的是集群的计算模式，并且节点与节点之前采取并行处理的模式，而且也有可能运行着多个Job，此时节点与节点传输数据的效率会影响最终计算的效率。由于MR框架采取磁盘作为溢写文件夹的存储介质，所以也会影响最终计算的效率。

所以，从以上分析，Shuffle的过程基本要求：

完整的从MapTask节点拉取数据到ReduceTask
在拉取数据的过程中，尽可能的减少网络资源的消耗
尽可能较少磁盘IO的对task执行效率的影响

总结：Shuffle是对Map输出结果进行分区，排序，合并等处理并且交由Reduce的过程。分为Map端和Reduce端。
Map端的Shuffle

 org.apache.hadoop.mapred;
 --MapTask
   --MapOutputBuffer

</property> 
<name>mapreduce.task.io.sort.mb</name>
  <value>100</value>
  <description>The total amount of buffer memory to use while sorting 
  files, in megabytes.  By default, gives each merge stream 1MB, which
  should minimize seeks.</description>
</property>

<property>
  <name>mapreduce.map.sort.spill.percent</name>
  <value>0.80</value>
  <description>The soft limit in the serialization buffer. Once reached, a
  thread will begin to spill the contents to disk in the background. Note that
  collection will not block if this threshold is exceeded while a spill is
  already in progress, so spills may be larger than this threshold when it is
  set to less than .5</description>
</property>

在这里插入图片描述
Map端的Shuffle输出首先是将结果缓存到内存中（环状缓冲区），默认情况下如果达到80mb ，则溢写到磁盘中。在溢写到磁盘之前，要对磁盘中的数据进行分区和排序，之后再写入磁盘，每次溢写操作会生成新的磁盘文件，随着MapTask的执行，会产生很多小文件，最终当Map端计算结束之后，这些小的文件会被合并成大小的文件，之后通知相应的ReduceTask领取属于自己的数据。

map输入结果写入缓冲区
缓冲区达到阈值溢写到磁盘
分区内排序合并成大文件（key，value[]）

Reduce端的Shuffle

领取数据
归并数据
数据输入给Reduce任务

在这里插入图片描述

MR 优化策略

（1）小文件计算优化-CombineTextInputFormat-干预切片计算逻辑

（2）实现Partitiner策略，防止数据倾斜

（3）适当调整YarnChild内存参数，可以参照YARN 参数配置手册

（4）适当调整溢写缓冲区的大小阈值

（5）适当调整合并文件并行度mapreduce.task.io.sort.factor

（6）对Map端输出溢写文件使用GZIP压缩，节省网络带宽

<property>
  <name>mapreduce.map.output.compress</name>
  <value>false</value>
  <description>Should the outputs of the maps be compressed before being
               sent across the network. Uses SequenceFile compression.
  </description>
</property>


<property>
  <name>mapreduce.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  <description>If the map outputs are compressed, how should they be 
               compressed?
  </description>
</property>

conf.setClass("mapreduce.map.output.compress.codec", GzipCodec.class, CompressionCodec.class);
conf.set("mapreduce.map.output.compress","true");

深夜的星星

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop全部，秒变大神

Hadoop概况https://blog.csdn.net/u012926411/article/details/82756100Hadoop：适合大数据的分布式存储和计算平台Hadoop不是指具体一个框架或者组件，它是Apache软件基金会下用Java语言开发的一个开源分布式计算平台。实现在大量计算机组成的集群中对海量数据进行分布式计算。适合大数据的分布式存储和计算平台。Hadoop...
复制链接

扫一扫