Apache Hadoop

Author: Lijb
Email: lijb1121@163.com
WeChat: ljb1121

大数据(bigData)

  • 数据量级大,处理GB/TB/PB级别数据(存储、分析)
  • 时效性,需要在一定的时间范围内计算出结果(几个小时以内)
  • 数据多维度(多样性),存在形式:传感器采集信息、web运行日志、用户的行为数据。
  • 数据可疑性,数据要有价值。需要对采集的数据做数据清洗、降噪

大数据解决问题?

  • 存储

打破单机存储瓶颈(数量有限,数据不安全),读写效率低下(顺序化读写)。大数据提出以分布式存储做为大数据存储和指导思想,通过构建集群,实现对硬件做水平扩展提升系统存储能力。目前为止常用海量数据存储的解决方案:Hadoop HDFS、FastDFS/GlusterFS(传统文件系统)、MongoDB GridFS、S3等

  • 计算

单机计算所能计算的数据量有限,而且所需时间无法控制。大数据提出一种新的思维模式,讲计算拆分成n个小计算任务,然后将计算任务分发给集群中各个计算节点,由各个计算几点合作完成计算,我们将该种计算模式称为分布式计算。目前主流的计算模式:离线计算、近实时计算、在线实时计算等。其中离线计算以Hadoop的MapReduce为代表、近实时计算以Spark内存计算为代表、在线实时计算以Storm、KafkaStream、SparkStream为代表。

总结: 以上均是以分布式的计算思维去解决问题,因为垂直提升硬件成本高,一般企业优先选择做分布式水平扩展。

Hadoop 诞生

Hadoop由 Apache Software Foundation 公司于 2005 年秋天作为Lucene的子项目Nutch的一部分正式引入。它受到最先由 Google Lab 开发的 Map/Reduce 和 Google File System(GFS) 的启发。人们重构Nutch项目中的存储模块和计算模块。2006 年Yahoo团队加入Nutch工程尝试将Nutch存储模块和计算模块剥离因此就产生Hadoop1.x版本。2010年yahoo重构了hadoop的计算模块,进一步优化Hadoop中MapReduce计算框架的管理模型,后续被称为Hadoop-2.x版本。hadoop-1.x和hadoop-2.x版本最大的区别是计算框架重构。因为Hadoop-2.x引入了Yarn作为hadoop2计算框架的资源调度器,使得MapReduce框架扩展性更好,运行管理更加可靠。

  • HDFS 存储
  • MapReduce 计算 (MR2 引入YARN资源管理器)

大数据生态圈

hadoop生态-2006(物资文明):

  • hdfs hadoop的分布式文件存储系统
  • mapreduce指的是hadoop的分布式计算
  • hbase 基于hdfs的一款nosql数据库
  • flume 分布式日志采集系统
  • kafka 分布式消息队列
  • hive 一款做数据仓库ETL(Extract-Transform-Load)工具,可以将用户编写的SQL翻译成MapReduce程序。
  • Storm 一款流计算框架,实现数据在线实时处理。
  • Mahout:象夫(骑大象的人),基于MapReduce实现一套算法库,通常用于机器学习和人工智能。

计算层面-2010年(精神文明)

  • Spark Core:吸取MapReduce框架设计的经验,有加州伯克利实验室使用Scala语言设计出了一个基于内存计算的分布式计算框架,计算性能几乎是hadoop的MapReduce的10~100倍。
  • Spark SQL:使用SQL语句实现Spark的统计计算逻辑,增强交互性。
  • Spark Stream:Spark流计算框架。
  • Spark MLib :基于Spark计算实现的机器学习语言库。
  • Spark GraphX:图形化关系存储

大数据应用场景?

  • 交通管理
  • 电商系统(日志分析)
  • 互联网医疗
  • 金融信息
  • 国家电网

Hadoop Distribute FileSystem

HDFS 架构

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware(商用硬件).Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

  • Namenode和DataNode

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

namenode:命名节点,用于管理集群中元数据(描述了数据块到datanode的映射关系以及块的副本信息)以及DataNode。控制数据块副本数(副本因子)可以配置dfs.replication参数,如果是单机模式需要配置1,其次nenodenode的client访问入口fs.defaultFS。在第一次搭建hdfs的时候需要执行hdfs namenode -foramt作用就是为namenode创建元数据初始化文件fsimage.

dataNode:存储block数据,响应客户端对block读写请求,执行namenode指令,同时向namenode汇报自身的状态信息。

block:文件的切割单位默认是128MB,可以通过配置dfs.blocksize

Rack: 机架,用于优化存储和计算,标识datanode所在物理主机的位置信息,可以通过hdfs dfsadmin -printTopology

  • namenode & secondarynamenode

阅读:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Hadoop HDFS(分布式存储)搭建

  • 配置主机名和IP的映射关系
    [root[@CentOS](https://my.oschina.net/u/1241776) ~]# vi /etc/hosts
    127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
    ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
    192.168.29.128 CentOS
  • 关闭系统防火墙
    [root[@CentOS](https://my.oschina.net/u/1241776) ~]# clear
    [root[@CentOS](https://my.oschina.net/u/1241776) ~]# service iptables stop
    iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
    iptables: Flushing firewall rules:                         [  OK  ]
    iptables: Unloading modules:                               [  OK  ]
    [root[@CentOS](https://my.oschina.net/u/1241776) ~]# chkconfig iptables off
    [root[@CentOS](https://my.oschina.net/u/1241776) ~]# chkconfig --list | grep iptables
    iptables        0:off   1:off   2:off   3:off   4:off   5:off   6:off
    [root@CentOS ~]# 
    
  • 配置本机SSH免密码登录
    [root@CentOS ~]# ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/root/.ssh/id_rsa): 
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again: 
    Your identification has been saved in /root/.ssh/id_rsa.
    Your public key has been saved in /root/.ssh/id_rsa.pub.
    The key fingerprint is:
    de:83:c8:81:77:d7:db:f2:79:da:97:8b:36:d5:78:01 root@CentOS
    The key's randomart image is:
    +--[ RSA 2048]----+
    |                 |
    |             E   |
    |              .  |
    |     .     .   . |
    |    . o S . .  .o|
    |     o = +   o..o|
    |      o o o o o..|
    |           . =.+o|
    |            ..=++|
    +-----------------+
    [root@CentOS ~]# ssh-copy-id CentOS
    root@centos's password: ****
    Now try logging into the machine, with "ssh 'CentOS'", and check in:
    
      .ssh/authorized_keys
    
    to make sure we haven't added extra keys that you weren't expecting
    [root@CentOS ~]# ssh CentOS
    Last login: Mon Oct 15 19:47:46 2018 from 192.168.29.1
    [root@CentOS ~]# exit
    logout
    Connection to CentOS closed.
  • 配置JAVA开发环境
    [root@CentOS ~]# rpm -ivh jdk-8u171-linux-x64.rpm
    Preparing...                ########################################### [100%]
       1:jdk1.8                 ########################################### [100%]
    Unpacking JAR files...
            tools.jar...
            plugin.jar...
            javaws.jar...
            deploy.jar...
            rt.jar...
            jsse.jar...
            charsets.jar...
            localedata.jar...
            
     [root@CentOS ~]# vi .bashrc 
    JAVA_HOME=/usr/java/latest
    PATH=$PATH:$JAVA_HOME/bin
    CLASSPATH=.
    export JAVA_HOME
    export PATH
    export CLASSPATH
    [root@CentOS ~]# source .bashrc 
    [root@CentOS ~]# jps
    1674 Jps
  • 配置安装hadoop hdfs
    [root@CentOS ~]# tar -zxf hadoop-2.6.0_x64.tar.gz -C /usr/
    HADOOP_HOME=/usr/hadoop-2.6.0
    JAVA_HOME=/usr/java/latest
    PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    CLASSPATH=.
    export JAVA_HOME
    export PATH
    export CLASSPATH
    export HADOOP_HOME
    [root@CentOS ~]# source .bashrc 
    [root@CentOS ~]# hadoop classpath
    /usr/hadoop-2.6.0/etc/hadoop:/usr/hadoop-2.6.0/share/hadoop/common/lib/*:/usr/hadoop-2.6.0/share/hadoop/common/*:/usr/hadoop-2.6.0/share/hadoop/hdfs:/usr/hadoop-2.6.0/share/hadoop/hdfs/lib/*:/usr/hadoop-2.6.0/share/hadoop/hdfs/*:/usr/hadoop-2.6.0/share/hadoop/yarn/lib/*:/usr/hadoop-2.6.0/share/hadoop/yarn/*:/usr/hadoop-2.6.0/share/hadoop/mapreduce/lib/*:/usr/hadoop-2.6.0/share/hadoop/mapreduce/*:/usr/hadoop-2.6.0/contrib/capacity-scheduler/*.jar
  • 修改core-site.xml文件

[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/core-site.xml

    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://CentOS:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/hadoop-2.6.0/hadoop-${user.name}</value>
    </property>
  • 修改hdfs-site.xml

[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/hdfs-site.xml

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
  • 修改slaves文本文件
[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/slaves
CentOS
  • 初始化HDFS
    [root@CentOS ~]# hdfs namenode -format
    ...
    18/10/15 20:06:03 INFO namenode.NNConf: ACLs enabled? false
    18/10/15 20:06:03 INFO namenode.NNConf: XAttrs enabled? true
    18/10/15 20:06:03 INFO namenode.NNConf: Maximum size of an xattr: 16384
    18/10/15 20:06:03 INFO namenode.FSImage: Allocated new BlockPoolId: BP-665637298-192.168.29.128-1539605163213
    `18/10/15 20:06:03 INFO common.Storage: Storage directory /usr/hadoop-2.6.0/hadoop-root/dfs/name has been successfully formatted.`
    18/10/15 20:06:03 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
    18/10/15 20:06:03 INFO util.ExitUtil: Exiting with status 0
    18/10/15 20:06:03 INFO namenode.NameNode: SHUTDOWN_MSG: 
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at CentOS/192.168.29.128
    ************************************************************/
    [root@CentOS ~]# ls /usr/hadoop-2.6.0/
    bin  `hadoop-root`  lib      LICENSE.txt  README.txt  share
    etc  include      libexec  NOTICE.txt   sbin

只需要在第一次启动HDFS的时候执行,以后重启无需执行该命令

  • 启动HDFS服务
    [root@CentOS ~]# start-dfs.sh 
    Starting namenodes on [CentOS]
    CentOS: starting namenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-namenode-CentOS.out
    CentOS: starting datanode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-datanode-CentOS.out
    Starting secondary namenodes [0.0.0.0]
    The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
    RSA key fingerprint is 7c:95:67:06:a7:d0:fc:bc:fc:4d:f2:93:c2:bf:e9:31.
    Are you sure you want to continue connecting (yes/no)? yes
    0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.
    0.0.0.0: starting secondarynamenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-CentOS.out
    [root@CentOS ~]# jps
    2132 SecondaryNameNode
    1892 NameNode
    2234 Jps
    1998 DataNode

用户可以通过浏览器访问:http://192.168.29.128:50070

HDFS Shell

    [root@CentOS ~]# hadoop fs -help | hdfs dfs -help
    Usage: hadoop fs [generic options]
    	[-appendToFile <localsrc> ... <dst>]
    	[-cat [-ignoreCrc] <src> ...]
    	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
    	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
    	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
    	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
    	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
    	[-help [cmd ...]]
    	[-ls [-d] [-h] [-R] [<path> ...]]
    	[-mkdir [-p] <path> ...]
    	[-moveFromLocal <localsrc> ... <dst>]
    	[-moveToLocal <src> <localdst>]
    	[-mv <src> ... <dst>]
    	[-put [-f] [-p] [-l] <localsrc> ... <dst>]
    	[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
    	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
    	[-tail [-f] <file>]
    	[-text [-ignoreCrc] <src> ...]
    	[-touchz <path> ...]
    	[-usage [cmd ...]]
    [root@CentOS ~]# hadoop fs -copyFromLocal /root/hadoop-2.6.0_x64.tar.gz /
    [root@CentOS ~]# hdfs dfs -copyToLocal /hadoop-2.6.0_x64.tar.gz ~/

参考:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html

Java API 操作HDFS

  • Windows开发环境搭建
    • 解压hadoop的安装包到C:/
    • 配置HADOOP_HOME环境变量
    • 拷贝winutils.exe和hadoop.dll文件到hadoop安装目录的bin目录下
    • windows上配置CentOS主机名和IP的映射关系
    • 重启IDEA,以便系统识别HADOOP_HOME环境变量
  • 导入Maven依赖
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.17</version>
    </dependency>
  • 关闭hdfs权限
    org.apache.hadoop.security.AccessControlException: Permission denied: user=HIAPAD, access=WRITE, inode="/":root:supergroup:drwxr-xr-x
    	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
    ...

配置hdfs-site.xml(方案1)

    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>

修改JVM启动参数配置-DHADOOP_USER_NAME=root(方案2)

java xxx -DHADOOP_USER_NAME=root

执行chmod修改目录读写权限

    [root@CentOS ~]# hdfs dfs -chmod -R 777 /
  • java API 案例
    private FileSystem fileSystem;
        private Configuration conf;
        @Before
        public void before() throws IOException {
            conf=new Configuration();
            conf.set("fs.defaultFS","hdfs://CentOS:9000");
            conf.set("dfs.replication","1");
            conf.set("fs.trash.interval","1");
            fileSystem=FileSystem.get(conf);
            assertNotNull(fileSystem);
        }
        @Test
        public void testUpload01() throws IOException {
            InputStream is=new FileInputStream("C:\\Users\\HIAPAD\\Desktop\\买家须知.txt");
            Path path = new Path("/bb.txt");
            OutputStream os=fileSystem.create(path);
            IOUtils.copyBytes(is,os,1024,true);
            /*byte[] bytes=new byte[1024];
            while (true){
                int n=is.read(bytes);
                if(n==-1) break;
                os.write(bytes,0,n);
            }
            os.close();
            is.close();*/
        }
        @Test
        public void testUpload02() throws IOException {
            Path src=new Path("file:///C:\\Users\\HIAPAD\\Desktop\\买家须知.txt");
            Path dist = new Path("/dd.txt");
            fileSystem.copyFromLocalFile(src,dist);
        }
        @Test
        public void testDownload01() throws IOException {
            Path dist=new Path("file:///C:\\Users\\HIAPAD\\Desktop\\11.txt");
            Path src = new Path("/dd.txt");
            fileSystem.copyToLocalFile(src,dist);
            //fileSystem.copyToLocalFile(false,src,dist,true);
        }
        @Test
        public void testDownload02() throws IOException {
            OutputStream os=new FileOutputStream("C:\\Users\\HIAPAD\\Desktop\\22.txt");
            InputStream is = fileSystem.open(new Path("/dd.txt"));
            IOUtils.copyBytes(is,os,1024,true);
        }
        @Test
        public void testDelete() throws IOException {
            Path path = new Path("/dd.txt");
            fileSystem.delete(path,true);
        }
        @Test
        public void testListFiles() throws IOException {
            Path path = new Path("/");
            RemoteIterator<LocatedFileStatus> files = fileSystem.listFiles(path, true);
            while(files.hasNext()){
                LocatedFileStatus file = files.next();
                System.out.println(file.getPath()+" "+file.getLen()+" "+file.isFile());
            }
        }
        @Test
        public void testDeleteWithTrash() throws IOException {
            Trash trash=new Trash(conf);
            trash.moveToTrash(new Path("/aa.txt"));
        }
        @After
        public void after() throws IOException {
          fileSystem.close();
       }

Hadoop MapReduce

MapReduce是一种编程模型,用于大规模数据集的并行运算。概念"Map(映射)“和"Reduce(归约)”,是它们的主要思想,都是从函数式编程语言(适合在网络中传递方法)里借来的,还有从矢量编程语言里借来的特性。

Hadoop中MapReduce计算框架充分的利用了存储节点所在物理主机的内存、CPU、网络、少许磁盘完成对大数据集的分布式计算。框架一般会在所有的DataNode所在的物理主机上启动NodeManager服务,NodeManager服务用于管理该服务运行的物理节点的计算资源。除此之外系统一般会启动一个ResourceManager用于统筹整个计算过程中的资源调度问题。

MapReduce计算核心思想是将一个大的计算任务,拆分成若干个小任务,每个小任务独立运行,并且得到计算结果,一般是将计算结果存储在本地。当第一批次任务执行结束,系统会启动第二批次任务,第二批次的任务作用是将第一批次的计算临时结果通过网路下载汇集到本地,然后在本地执行最终的汇总计算。

可以理解为当使用MapReduce执行大数据统计分析时,系统会将分析数据进行切分,我们将切分信息 称为任务切片(实质是对分析目标数据的一种逻辑区间映射)。任务在执行的时候会更具任务切片的数目决定Map阶段计算的并行度。也就意味着Map阶段完成的是数据的局部计算。一个Map任务就代表着一个计算资源。当所有的Map任务都完成了对应区间的数据的局部计算后,Map任务会将计算结果存储在本地磁盘上。紧接着系统会按照系统预设的汇总并行度启动多个Reduce任务对Map阶段计算结果进行汇总,并且将结果内容输出到HDFS、MySQL、NoSQL中。

Map Reduce 2 架构

ResourceManager:统筹管理计算资源

NodeManager:启动计算资源(Container)例如:MRAppMaster、YarnChild同时NM连接RM汇报自身一些资源占用信息。

MRAppMaster:应用的Master负责任务计算过程中的任务监控、故障转移,每个Job只有一个。

YARNChild:表示MapTask、ReduceTask的总称,表示一个计算进程。

Yarn架构图

构建MR运行环境

  • etc/hadoop/mapred-site.xml

[root@CentOS ~]# cp /usr/hadoop-2.6.0/etc/hadoop/mapred-site.xml.template /usr/hadoop-2.6.0/etc/hadoop/mapred-site.xml

    <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
    </property>
  • etc/hadoop/yarn-site.xml
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>CentOS</value>
    </property>
  • 启动YARN
    [root@CentOS ~]# start-yarn.sh 
    [root@CentOS ~]# jps
    5250 Jps
    4962 ResourceManager
    5043 NodeManager
    3075 DataNode
    3219 SecondaryNameNode
    2959 NameNode

MR 第一个HelloWorld

INFO 192.168.0.1 wx com.baizhi.service.IUserService#login
INFO 192.168.0.4 wx com.baizhi.service.IUserService#login
INFO 192.168.0.2 wx com.baizhi.service.IUserService#login
INFO 192.168.0.3 wx com.baizhi.service.IUserService#login
INFO 192.168.0.1 wx com.baizhi.service.IUserService#login

SQL

    create table t_access(
        level varchar(32),
        ip  varchar(64),
        app varchar(64),
        server varchar(128)
    )
    select ip,sum(1) from t_access group by ip;
        reduce(ip,[1,1,1,..])      map(ip,1)


  • Mapper逻辑
    public class AccessMapper extends Mapper<LongWritable,Text,Text,IntWritable> {
        /**
         *INFO 192.168.0.1 wx com.baizhi.service.IUserService#login
         * @param key : 行字节偏移量
         * @param value:一行文本数据
         * @param context
         * @throws IOException
         * @throws InterruptedException
         */
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] tokens=value.toString().split(" ");
            String ip=tokens[1];
            context.write(new Text(ip),new IntWritable(1));
        }
    }
  • Reducer逻辑
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    import java.io.IOException;
    
    public class AccessReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int total=0;
            for (IntWritable value : values) {
                total+=value.get();
            }
            context.write(key,new IntWritable(total));
        }
    }
  • 任务提交
    public class CustomJobSubmitter extends Configured implements Tool {
    
        public int run(String[] args) throws Exception {
            //1.封装job
            Configuration conf=getConf();
            Job job=Job.getInstance(conf);
    
            //2.设置分析|输出数据格式类型
            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);
    
            //3.设置数据读入和写出路径
            Path src = new Path("/demo/access");
            TextInputFormat.addInputPath(job,src);
            Path dst = new Path("/demo/res");//必须为null
            TextOutputFormat.setOutputPath(job,dst);
    
            //4.设置Mapper和Reducer逻辑
            job.setMapperClass(AccessMapper.class);
            job.setReducerClass(AccessReducer.class);
    
            //5.设置Mapper和Reducer的输出k/v类型
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);
    
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
    
            //6.提交任务
            job.waitForCompletion(true);
            return 0;
        }
    
        public static void main(String[] args) throws Exception {
            ToolRunner.run(new CustomJobSubmitter(),args);
        }
    }
  • 任务提交
    • jar任务发布
       job.setJarByClass(CustomJobSubmitter.class);
       
    [root@CentOS ~]# hadoop jar mr01-1.0-SNAPSHOT.jar com.baizhi.mr.CustomJobSubmitter
    
    • 本地仿真
      Exception in thread “main” java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO W i n d o w s . a c c e s s 0 ( L j a v a / l a n g / S t r i n g ; I ) Z a t o r g . a p a c h e . h a d o o p . i o . n a t i v e i o . N a t i v e I O Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIO Windows.access0(Ljava/lang/String;I)Zatorg.apache.hadoop.io.nativeio.NativeIOWindows.access0(Native Method)
      at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:557)
      覆盖NativeIO文件,将557行代码修改如下
         public static boolean access(String path, AccessRight desiredAccess)
                        throws IOException {
             return true;
         }
         ~~~
    - 跨平台发布
      - 将core|hdfs|yarn|mapred-site.xml拷贝项目的resources目录
      - 在mapred-site.xml中开启跨平台提交
    
        <property>
            <name>mapreduce.app-submission.cross-platform</name>
            <value>true</value>
        </property>
  • 在任务提交代码中添加以下代码片段
        conf.addResource("core-site.xml");
        conf.addResource("hdfs-site.xml");
        conf.addResource("mapred-site.xml");
        conf.addResource("yarn-site.xml");
        conf.set("mapreduce.job.jar","file:///xxxx.jar");

本地提交时,由于需要拷贝计算资源到HDFS所有可能会报以下异常:

    xception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=HIAPAD, access=EXECUTE, inode="/tmp":root:supergroup:drwx------
    	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
    	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
    	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:208)
    	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:171)
以上异常解决方案,参考上一个章节关闭hdfs权限!

InputFormat/OutputFormat

  • InputFormat
    • FileInputFormat
      • TextInputFormat
        key/Value :key表示行字节的偏移量、value表示一行文本数据
        切片计算规则 :以文件为单位,以SpliSize做切割
      • NlineInputFormat
        key/Value :key表示行字节的偏移量、value表示一行文本数据
        切片计算规则 :以文件为单位,以n行作为一个切片
        mapreduce.input.lineinputformat.linespermap=1000控制一个切片的数据量
      • KeyValueTextInputFormat
        key/Value :key 、value使用\t分割一行数据(默认)
        切片计算规则 :以文件为单位,以SpliSize做切割
        mapreduce.input.keyvaluelinerecordreader.key.value.separator=\t
      • SequenceFileInputFormat
      • CombineTextInputFormat
        key/Value :key表示行字节的偏移量、value表示一行文本数据
        切片计算规则 :以SpliSize做切割
        可以解决小文件计算,优化MR任务。要求所有小文件格式必须一致
    • MultipleInputs
      解决不同格式输入的数据,要求Mapper端输出的key/value必须一致。
    • DBInputFormat
    • TableInputFormat(第三方)
  • OutputFormat
    • FileOutputFormat
      • TextOutputFormat
      • SequenceFileOutputFormat
      • LazyOutputFormat
    • MultipleOutputs
    • DBOutputFormat
    • TableOutputFormat(第三方)

Map Reduce Shuffle(洗牌)

  • 掌握MR任务提交源码流程
  • 解决MR任务计算过程中的Jar包依赖
    JobSubmitter(DFS|YARNRunner)
        submitJobInternal(Job.this, cluster);
        	checkSpecs(job);#检查输出目录是否存在
        	JobID jobId = submitClient.getNewJobID();#获取jobid
        	#构建资源目录
        	Path submitJobDir = new Path(jobStagingArea, jobId.toString());
        	#代码jar|上传第三方资源jar|配置文件
        	copyAndConfigureFiles(job, submitJobDir);
        	int maps = writeSplits(job, submitJobDir);#任务切片
        	writeConf(conf, submitJobFile);#上传任务上`下文信息`job.xml
        	status = submitClient.submitJob(#提交任务给ResourceManager
              jobId, submitJobDir.toString(), job.getCredentials());

解决程序在运行期间的jar包依赖问题

  • 提交时依赖,配置HADOOP_CLASSPATH
    HADOOP_CLASSPATH=/root/mysql-connector-java-xxxx.jar
    HADOOP_HOME=/usr/hadoop-2.6.0
    JAVA_HOME=/usr/java/latest
    PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    CLASSPATH=.
    export JAVA_HOME
    export PATH
    export CLASSPATH
    export HADOOP_HOME
    export HADOOP_CLASSPATH

一般是在任务提交初期,需要连接第三方数据库,计算任务切片。

  • 任务计算时依赖
    • 可以通过程序设置
      conf.set(“tmpjars”,“file://jar路径,…”);
    • 如果是hadoop jar发布任务
      hadoop jar xxx.jar 主类名 -libjars jar路径,…
  • Map Shuffle和Reduce Shuffle
  • 只有Mapper、没有Reducer(数据清洗|降噪)

使用正则表达式提取子串

112.116.25.18 - [11/Aug/2017:09:57:24 +0800] "POST /json/startheartbeatactivity.action HTTP/1.1" 200 234 "http://wiki.wang-inc.com/pages/resumedraft.action?draftId=12058756&draftShareId=014b0741-df00-4fdc-90ca-4bf934f914d1" Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 - 0.023 0.023 12.129.120.121:8090 200
    String regex="^(\\d{3}\\.\\d{3}\\.\\d{1,3}\\.\\d{1,3})\\s-\\s\\[(.*)\\]\\s\".*\"\\s(\\d+)\\s(\\d+)\\s.*";
    Pattern pattern = Pattern.compile(regex);
     Matcher matcher = pattern.matcher(input);
    if(matcher.matches()){
        String value= matcher.group(1);//获取第一个()匹配的内容
    }
  • 自定义WritableComparable
    • Map端输出key类型,必须实现WritableComparable(key需要参与排序)
    • Map端输出value类型,必须实现Writable即可
      Reduce端输出的key/value需要注意哪些事项?252
  • 如何干预MR程序的分区策略
    public class CustomPartitioner<K, V> extends Partitioner<K, V> {
    
      /** Use {@link Object#hashCode()} to partition. */
      public int getPartition(K key, V value,
                              int numReduceTasks) {
        return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
      }
    
    }
    job.setPartitionerClass(CustomPartitioner.class);
    //或者
    conf.setClass("mapreduce.job.partitioner.class",CustomPartitioner.class,Partitioner.class);

为何在做MR计算的时候,会产生数据倾斜?

因为不合理的KEY,导致了数据的分布不均匀。选择合适的key作为统计依据,使得数据能够在爱分区均匀分布。一般需要程序员对分析的数据有一定的预判!

  • 开启Map端压缩(只能在实际的环境下测试)

可以有效,减少Reduce Shuffle过程的网络带宽占用。可能在计算过程中需要消耗额外的CPU进行数据的压缩和解压缩。

conf.setBoolean("mapreduce.map.output.compress", true);
conf.setClass("mapreduce.map.output.compress.codec",
GzipCodec.class, CompressionCodec.class);

如果是本地仿真可能会抛出not a gzip file错误,因此推荐大家在集群环境下测试!

  • combiner

job.setCombinerClass(CombinerReducer.class);

CombinerReducer实际就是一个Class extends Reducer,combiner一般发生在溢写阶段和溢写文件合并阶段。

HDFS|YRAN HA

环境准备

  • CentOS-6.5 64 bit
  • jdk-7u79-linux-x64.rpm
  • hadoop-2.6.0.tar.gz
  • zookeeper-3.6.4.tar.gz

安装CentOS主机-物理节点

CentOSA CentOSB CentOSC
192.168.29.129 192.168.29.130 192.168.29.131

基础配置

  • 主机名和IP映射关系
    [root@CentOSX ~]# clear
    [root@CentOSX ~]# vi /etc/hosts
    
    127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
    ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
    192.168.128.133 CentOSA
    192.168.128.134 CentOSB
    192.168.128.135 CentOSC      
  • 关闭防火墙
    [root@CentOSX ~]# service iptables stop
    iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
    iptables: Flushing firewall rules:                         [  OK  ]
    iptables: Unloading modules:                               [  OK  ]
    [root@CentOSX ~]# chkconfig iptables off
  • SSH免密码登录

    [root@CentOSX ~]# ssh-keygen -t rsa
    [root@CentOSX ~]# ssh-copy-id CentOSA
    [root@CentOSX ~]# ssh-copy-id CentOSB
    [root@CentOSX ~]# ssh-copy-id CentOSC
  • 同步所有物理节点的时钟
    [root@CentOSA ~]# date -s '2018-09-16 11:28:00'
    Sun Sep 16 11:28:00 CST 2018
    [root@CentOSA ~]# clock -w
    [root@CentOSA ~]# date 
    Sun Sep 16 11:28:13 CST 2018

安装JDK配置JAVA_HOME环境变量

    [root@CentOSX ~]# rpm -ivh jdk-8u171-linux-x64.rpm 
    [root@CentOSX ~]# vi .bashrc 
    
    JAVA_HOME=/usr/java/latest
    CLASSPATH=.
    PATH=$PATH:$JAVA_HOME/bin
    export JAVA_HOME
    export CLASSPATH
    export PATH
    [root@CentOSX ~]# source .bashrc 

安装zookeeper&启动Zookeeper

    [root@CentOSX ~]# tar -zxf zookeeper-3.4.6.tar.gz -C /usr/
    [root@CentOSX ~]# vi /usr/zookeeper-3.4.6/conf/zoo.cfg
    tickTime=2000
    dataDir=/root/zkdata
    clientPort=2181
    initLimit=5
    syncLimit=2
    server.1=CentOSA:2887:3887
    server.2=CentOSB:2887:3887
    server.3=CentOSC:2887:3887
    [root@CentOSX ~]# mkdir /root/zkdata
    [root@CentOSA ~]# echo 1 >> zkdata/myid
    [root@CentOSB ~]# echo 2 >> zkdata/myid
    [root@CentOSC ~]# echo 3 >> zkdata/myid
    [root@CentOSX zookeeper-3.4.6]# ./bin/zkServer.sh start zoo.cfg
    JMX enabled by default
    Using config: /usr/zookeeper-3.4.6/bin/../conf/zoo.cfg
    Starting zookeeper ... STARTED
    [root@CentOSX zookeeper-3.4.6]# ./bin/zkServer.sh status zoo.cfg

Hadoop配置与安装

    [root@CentOSX ~]# tar -zxf hadoop-2.6.0_x64.tar.gz -C /usr/
    [root@CentOSX ~]# vi .bashrc 
    
    HADOOP_HOME=/usr/hadoop-2.6.0
    JAVA_HOME=/usr/java/latest
    CLASSPATH=.
    PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    export JAVA_HOME
    export CLASSPATH
    export PATH
    export HADOOP_HOME
    [root@CentOSX ~]# source .bashrc 
  • core-site.xml
    <property>		
          <name>fs.defaultFS</name>		
          <value>hdfs://mycluster</value>	
    </property>
    <property>		
         <name>hadoop.tmp.dir</name>		
         <value>/usr/hadoop-2.6.0/hadoop-${user.name}</value>    
    </property>
    <property>		
         <name>fs.trash.interval</name>		
         <value>30</value>    
    </property>
    <property>		
         <name>net.topology.script.file.name</name>		
         <value>/usr/hadoop-2.6.0/etc/hadoop/rack.sh</value>    
    </property>

创建机架脚本文件,该脚本可以根据IP判断机器所处的物理位置

    [root@CentOSX ~]# vi /usr/hadoop-2.6.0/etc/hadoop/rack.sh
    while [ $# -gt 0 ] ; do
    	  nodeArg=$1
    	  exec</usr/hadoop-2.6.0/etc/hadoop/topology.data
    	  result="" 
    	  while read line ; do
    		ar=( $line ) 
    		if [ "${ar[0]}" = "$nodeArg" ] ; then
    		  result="${ar[1]}"
    		fi
    	  done 
    	  shift 
    	  if [ -z "$result" ] ; then
    		echo -n "/default-rack"
    	  else
    		echo -n "$result "
    	  fi
    done
   [root@CentOSX ~]# chmod u+x /usr/hadoop-2.6.0/etc/hadoop/rack.sh
   [root@CentOSX ~]# vi /usr/hadoop-2.6.0/etc/hadoop/topology.data
   
   192.168.128.133 /rack1
   192.168.128.134 /rack1
   192.168.128.135 /rack2
   
   [root@CentOSX ~]# /usr/hadoop-2.6.0/etc/hadoop/rack.sh 192.168.128.133
   /rack1 
  • hdfs-site.xml
    <property>
    	<name>dfs.replication</name>
    	<value>3</value>
    </property> 
    <property>
    	<name>dfs.ha.automatic-failover.enabled</name>
    	<value>true</value>
    </property>
    <property>   
    	<name>ha.zookeeper.quorum</name>
    	<value>CentOSA:2181,CentOSB:2181,CentOSC:2181</value> 
    </property>
    <property>
    	<name>dfs.nameservices</name>
    	<value>mycluster</value>
    </property>
    <property>
    	<name>dfs.ha.namenodes.mycluster</name>
    	<value>nn1,nn2</value>
    </property>
    <property>
    	<name>dfs.namenode.rpc-address.mycluster.nn1</name>
    	<value>CentOSA:9000</value>
       </property>
    <property>
    	 <name>dfs.namenode.rpc-address.mycluster.nn2</name>
    	 <value>CentOSB:9000</value>
    </property>
    <property>
      <name>dfs.namenode.shared.edits.dir</name>
      <value>qjournal://CentOSA:8485;CentOSB:8485;CentOSC:8485/mycluster</value>
    </property>
    <property>
    	<name>dfs.client.failover.proxy.provider.mycluster</name>
    	<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    <property>
         <name>dfs.ha.fencing.methods</name>
         <value>sshfence</value>
    </property>
    <property>
         <name>dfs.ha.fencing.ssh.private-key-files</name>
         <value>/root/.ssh/id_rsa</value>
    </property>
  • slaves
    CentOSC[root@CentOSX ~]# vi /usr/hadoop-2.6.0/etc/hadoop/slaves 
    
    CentOSA
    CentOSB
    CentOSC

HDFS启动

    [root@CentOSX ~]# hadoop-daemon.sh start journalnode //等上10秒钟,再进行下一步操作
    [root@CentOSA ~]# hdfs namenode -format
    [root@CentOSA ~]# hadoop-daemon.sh start namenode
    [root@CentOSB ~]# hdfs namenode -bootstrapStandby (下载active的namenode元数据)
    [root@CentOSB ~]# hadoop-daemon.sh start namenode
    [root@CentOSA|B ~]# hdfs zkfc -formatZK (可以在CentOSA或者CentOSB任意一台注册namenode信息)
    [root@CentOSA ~]# hadoop-daemon.sh start zkfc (哨兵)
    [root@CentOSB ~]# hadoop-daemon.sh start zkfc (哨兵)
    [root@CentOSX ~]# hadoop-daemon.sh start datanode

查看机架

    [root@CentOSA ~]# hdfs dfsadmin -printTopology
    Rack: /rack1
       192.168.29.129:50010 (CentOSA)
       192.168.29.130:50010 (CentOSB)
    
    Rack: /rack2
       192.168.29.131:50010 (CentOSC)

集群启动和关闭

[root@CentOSA ~]# start|stop-dfs.sh #任意一台都可以执行

如果重启过程中,因为journalnode初始化过慢,导致namenode启动失败,请在执行失败的namenode节点上执行hadoop-daemon.sh start namenode

构建Yarn的集群

  • 修改mapred-site.xml
    <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
    </property>
  • 修改yarn-site.xml
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
      <name>yarn.resourcemanager.ha.enabled</name>
      <value>true</value>
    </property>
    <property>
      <name>yarn.resourcemanager.cluster-id</name>
      <value>cluster1</value>
    </property>
    <property>
      <name>yarn.resourcemanager.ha.rm-ids</name>
      <value>rm1,rm2</value>
    </property>
    <property>
      <name>yarn.resourcemanager.hostname.rm1</name>
      <value>CentOSB</value>
    </property>
    <property>
      <name>yarn.resourcemanager.hostname.rm2</name>
      <value>CentOSC</value>
    </property>
    <property>
      <name>yarn.resourcemanager.zk-address</name>
      <value>CentOSA:2181,CentOSB:2181,CentOSC:2181</value>
    </property>
  • 启动YARN
    [root@CentOSB ~]# yarn-daemon.sh start resourcemanager
    [root@CentOSC ~]# yarn-daemon.sh start resourcemanager
    [root@CentOSX ~]# yarn-daemon.sh start nodemanager

查看ResourceManager HA状态

    [root@CentOSA ~]# yarn rmadmin -getServiceState rm1
    active
    [root@CentOSA ~]# yarn rmadmin -getServiceState rm2
    standby
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值