以Hadoop入门大数据

 

以Hadoop入门大数据

一、Hadoop简介

1.什么是Hadoop-

  • Apache™ Hadoop® project 生产出的用于高可靠、可扩展、分布式计算的开源软件,它允许通过集群的方式使用简单的编程模型分布式处理大数据,它可以从单一的服务器扩展到成千上万的机器,每一台机器都能提供本地计算和存储。Hadoop认为集群中节点故障为常态,它可以自动检测和处理故障节点,所以它不依赖硬件来提供高可用性。
  • 简单的说,我们可以用Hadoop分布式存储大量数据,然后根据自己的业务对海量数据进行分布式计算。例如:
    • 淘宝网昨天24小时的用户访问量折线图,不同地区、不同时段、不同终端
    • 中国
  • Apache™ Hadoop® project 包含的模块
    • Hadoop Common :The common utilities that support the other Hadoop modules.
    • Hadoop Distributed File System (HDFS™): 高吞吐分布式文件系统
    • Hadoop YARN: 作业(Job)安排和资源调度平台/框架
    • Hadoop MapReduce: 运行在yarn上的大数据并行计算模型
  • 其他相关project
    • Ambari
      HBase
      Hive
      Spark
      Zookeeper

2.Hadoop产生背景

  • Doug Cutting是Lucene、Nutch 、Hadoop等项目的发起人
  • Hadoop最早起源于Nutch。Nutch的设计目的是构建一个大型的全网搜索引擎,包括网页抓取、索引、查询等功能,但随着抓取网页数量的增加,遇到了严重的可扩展性问题:如何解决数十亿网页的存储和索引问题。
  • 2003、2004年谷歌发表的两篇论文为该问题提供了可行的解决方案。
    • 分布式文件系统(GFS),可用于处理海量网页存储
    • 分布式计算框架(MapReduce),可用于海量网页的索引计算问题
  • 大数据学习加群:868847735
  •  
  • 谷歌在03到06年间连续发表了三篇很有影响力的文章,分别是03年SOSP的GFS,04年OSDI的MapReduce,和06年OSDI的BigTable。SOSP和OSDI都是操作系统领域的顶级会议,在计算机学会推荐会议里属于A类。SOSP在单数年举办,而OSDI在双数年举办。
  • Nutch开发人员完成了相应的开源实现HDFS和MapReduce,并从Nutch中剥离成为独立项目Hadoop,直到2008年,Hadoop成为Apache顶级项目,之后快速发展。
  • 到现在hadoop已经形成了完善的生态圈,

3.Hadoop架构

  • 1.分布式架构简介

    • 单机的问题

      • 存储能力有限
      • 计算能力有限
      • 有单点故障
      • ...
    • 分布式架构解决了单机问题

    • 经典分布式主从架构

      • Master(老大)负责管理,可以有多个,防止单节点故障发生
      • Slave(小弟)负责干活,有多个,可以动态添加删除
    1. Hadoop架构
    • Hadoop2.0
      • HDFS:NameNode(老大),DateNode(小弟)
      • YARN:ResourceManager(老大),NodeManager(小弟)

4.Hadoop生态

  • Hadoop生态圈:
      • Sqoop(发音:skup)是一款开源的工具,主要用于在Hadoop(Hive)与传统的数据库(mysql、postgresql...)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
      • Apache Ambari是一种基于Web的工具,支持Apache Hadoop集群的供应、管理和监控。Ambari目前已支持大多数Hadoop组件,包括HDFS、MapReduce、Hive、Pig、 Hbase、Zookeper、Sqoop和Hcatalog等。
      • ZooKeeper是一个分布式的,开放源码的分布式应用程序协调服务,是Google的Chubby一个开源的实现,是Hadoop和Hbase的重要组件。它是一个为分布式应用提供一致性服务的软件,提供的功能包括:配置维护、域名服务、分布式同步、组服务等。
      • hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。
      • Pig是一个基于Hadoop的大规模数据分析平台,它提供的SQL-LIKE语言叫Pig Latin,该语言的编译器会把类SQL的数据分析请求转换为一系列经过优化处理的MapReduce运算。Pig为复杂的海量数据并行计算提供了一个简单的操作和编程接口。
      • Mahout 是 Apache Software Foundation(ASF) 旗下的一个开源项目,提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现,包括聚类、分类、推荐过滤、频繁子项挖掘。此外,通过使用 Apache Hadoop 库,Mahout 可以有效地扩展到云中。
      • HBase是Apache的Hadoop项目的子项目。HBase不同于一般的关系数据库,它是一个适合于非结构化数据存储的数据库。另一个不同的是HBase基于列的而不是基于行的模式。
    • 大数据本身是个很宽泛的概念,Hadoop生态圈(或者泛生态圈)基本上都是为了处理超过单机尺度的数据处理而诞生的。你可以把它比作一个厨房所以需要的各种工具。锅碗瓢盆,各有各的用处,互相之间又有重合。你可以用汤锅直接当碗吃饭喝汤,你可以用小刀或者刨子去皮。但是每个工具有自己的特性,虽然奇怪的组合也能工作,但是未必是最佳选择。

 

Hadoop环境搭建

准备环境

  • 准备软件
    1. CentOS-7-x86_64-Minimal-1611.iso
    2. jdk-8u121-linux-x64.tar
    3. hadoop-2.7.3.tar
    4. VirtualBox
  • 技能准备

    1. Virtual新建虚拟机,网络配置:桥接、host only ...
    2. Linux常用命令:cd mv cp scp vi cat ...
    3. Linux网络卡配置,以centos7为例:编辑网卡配置 vi /etc/sysconfig/network-scripts/ifcfg-enp0s3 ...

       
      TYPE=Ethernet
      BOOTPROTO=static
      IPADDR=192.168.254.222
      NETMASK=255.255.255.0
      GATEWAY=192.168.254.1
      DEFROUTE=yes
      PEERDNS=yes
      PEERROUTES=yes
      IPV4_FAILURE_FATAL=no
      IPV6INIT=no
      IPV6_AUTOCONF=no
      IPV6_DEFROUTE=no
      IPV6_PEERDNS=no
      IPV6_PEERROUTES=no
      IPV6_FAILURE_FATAL=no
      NAME=enp0s3
      UUID=76a635c2-9600-437b-8cfb-57e9569f68da
      DEVICE=enp0s3
      ONBOOT=yes
      DNS1=114.114.114.114
      
    4. Linux服务管理,以centos7为例:

      • 重启网络服务 systemctl restart network
      • 关闭防火墙 systemctl stop firewalld
      • 禁用开启自启防火墙 systemctl disable firewalld
    5. ssh、及ssh免密登录

      • ssh-keygen 回车... : 在主机名为hadoop01机器上生成公钥、私钥(~/.ssh/idrsa、idrsa.pub)
      • cat ~/.ssh/idrsa.pub >> ~/.ssh/authorizedkeys : 设置hadoop01免密登录自己
      • 将idrsa.pub追加到Hadoop02上~/.ssh/authorizedkeys文件中就可以免密登录hadoop02了:快捷命令 ssh-copy-id hadoop02
  • 本例中的虚拟机环境

    1. 三台centos7 mini版虚拟机 hadoop01、hadoop02、hadoop03
    2. /etc/hosts 文件三台机器均已配好,完全相同
    3. 宿主机器hosts中也已配置了三台虚拟机的主机名映射 ###单节点安装配置
  • Standalone单机

    • 什么是单机模式?没有任何daemons(后台程序),不能使用HDFS,在用hadoop jar命令运行MapReduce任务时才启动进程模拟MapReduce运算,一般用与程序测试调试
    • 实例:
    $ cp etc/http://7xq0aq.com1.z0.glb.clouddn.com/*.xml input
    $ bin/hadoop jar share/http://7xq0aq.com1.z0.glb.clouddn.com/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
    $ cat output/* 
    
  • Pseudo-Distributed伪分布

    • 什么是伪分布模式?Hadoop daemons运行在不同的进程中。NamoNode、DataNode、ResourceManageer、NodeManager、SecondaryNameNode等进程分布式情况下是运行在集群中不同机器上的,现在都运行在一台机器上,只是用不同Java进程来分隔开,故称伪分布。
    • 修改配置文件

      1. hadoop-env.sh 25行 设置JAVA_HOME
      1 # Licensed to the Apache Software Foundation (ASF) under one
      2 # or more contributor license agreements.  See the NOTICE file
      3 # distributed with this work for additional information
      4 # regarding copyright ownership.  The ASF licenses this file
      5 # to you under the Apache License, Version 2.0 (the
      6 # "License"); you may not use this file except in compliance
      7 # with the License.  You may obtain a copy of the License at
      8 #
      9 #     http://www.apache.org/licenses/LICENSE-2.0
      10 #
      11 # Unless required by applicable law or agreed to in writing, software
      12 # distributed under the License is distributed on an "AS IS" BASIS,
      13 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      14 # See the License for the specific language governing permissions and
      15 # limitations under the License.
      16
      17 # Set Hadoop-specific environment variables here.
      18
      19 # The only required environment variable is JAVA_HOME.  All others are
      20 # optional.  When running a distributed configuration it is best to
      21 # set JAVA_HOME in this file, so that it is correctly defined on
      22 # remote nodes.
      23
      24 # The java implementation to use.
      25 export JAVA_HOME=/root/jdk1.8
      26
      27 # The jsvc implementation to use. Jsvc is required to run secure datanodes
      28 # that bind to privileged ports to provide authentication of data transfer
      
      1. core-site.xml 20-23行 指定NamdeNode运行在集群中哪台机器上
      1 <?xml version="1.0" encoding="UTF-8"?>
      2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      3 <!--
      4   Licensed under the Apache License, Version 2.0 (the "License");
      5   you may not use this file except in compliance with the License.
      6   You may obtain a copy of the License at
      7
      8     http://www.apache.org/licenses/LICENSE-2.0
      9
      10   Unless required by applicable law or agreed to in writing, software
      11   distributed under the License is distributed on an "AS IS" BASIS,
      12   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      13   See the License for the specific language governing permissions and
      14   limitations under the License. See accompanying LICENSE file.
      15 -->
      16
      17 <!-- Put site-specific property overrides in this file. -->
      18
      19 <configuration>
      20          <property>
      21                 <name>fs.defaultFS</name>
      22                 <value>hdfs://hadoop01:9000</value>
      23         </property>
      24 </configuration>
      
      1. hdfs-site.xml 设置副本数、NameNodo和DataNode存储目录
      1 <?xml version="1.0" encoding="UTF-8"?>
      2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      3 <!--
      4   Licensed under the Apache License, Version 2.0 (the "License");
      5   you may not use this file except in compliance with the License.
      6   You may obtain a copy of the License at
      7
      8     http://www.apache.org/licenses/LICENSE-2.0
      9
      10   Unless required by applicable law or agreed to in writing, software
      11   distributed under the License is distributed on an "AS IS" BASIS,
      12   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      13   See the License for the specific language governing permissions and
      14   limitations under the License. See accompanying LICENSE file.
      15 -->
      16
      17 <!-- Put site-specific property overrides in this file. -->
      18
      19 <configuration>
      20         <property>
      21                 <name>dfs.replication</name>
      22                 <value>1</value>
      23         </property>
      24         <property>
      25                 <name>dfs.name.dir</name>
      26                 <value>/usr/local/http://7xq0aq.com1.z0.glb.clouddn.com/hdfs/name</value>
      27         </property>
      28
      29 <property>
      30                 <name>dfs.data.dir</name>
      31                 <value>/usr/local/http://7xq0aq.com1.z0.glb.clouddn.com/hdfs/data</value>
      32         </property>
      33 <property>
      34   <name>dfs.namenode.secondary.http-address</name>
      35   <value>hadoop02:9001</value>
      36 </property>
      37 </configuration>
      
      1. mapred-site.xml 设置MapReduce的资源调度框架为Yarn
      1 <?xml version="1.0"?>
      2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      3 <!--
      4   Licensed under the Apache License, Version 2.0 (the "License");
      5   you may not use this file except in compliance with the License.
      6   You may obtain a copy of the License at
      7
      8     http://www.apache.org/licenses/LICENSE-2.0
      9
      10   Unless required by applicable law or agreed to in writing, software
      11   distributed under the License is distributed on an "AS IS" BASIS,
      12   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      13   See the License for the specific language governing permissions and
      14   limitations under the License. See accompanying LICENSE file.
      15 -->
      16
      17 <!-- Put site-specific property overrides in this file. -->
      18
      19 <configuration>
      20 <property>
      21         <name>mapreduce.framework.name</name>
      22         <value>yarn</value>
      23     </property>
      24 </configuration>
      ~
      
      1. yarn-site.xml
       1 <?xml version="1.0"?>
      2 <!--
      3   Licensed under the Apache License, Version 2.0 (the "License");
      4   you may not use this file except in compliance with the License.
      5   You may obtain a copy of the License at
      6
      7     http://www.apache.org/licenses/LICENSE-2.0
      8
      9   Unless required by applicable law or agreed to in writing, software
      10   distributed under the License is distributed on an "AS IS" BASIS,
      11   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      12   See the License for the specific language governing permissions and
      13   limitations under the License. See accompanying LICENSE file.
      14 -->
      15 <configuration>
      16
      17 <!-- Site specific YARN configuration properties -->
      18  <property>
      19         <name>yarn.nodemanager.aux-services</name>
      20         <value>mapreduce_shuffle</value>
      21     </property>
      22 </configuration>
      

集群安装配置

  • 完全分布式安装配置

    1. hadoop-env.sh 同伪分布
    2. core-site.xml 同伪分布
    3. hdfs-site.xml 同伪分布
    4. mapred-site.xml 同伪分布
    5. yarn-site.xml 需要修改
      1 <?xml version="1.0"?>
      2 <!--
      3   Licensed under the Apache License, Version 2.0 (the "License");
      4   you may not use this file except in compliance with the License.
      5   You may obtain a copy of the License at
      6
      7     http://www.apache.org/licenses/LICENSE-2.0
      8
      9   Unless required by applicable law or agreed to in writing, software
     10   distributed under the License is distributed on an "AS IS" BASIS,
     11   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     12   See the License for the specific language governing permissions and
     13   limitations under the License. See accompanying LICENSE file.
     14 -->
     15 <configuration>
     16
     17 <!-- Site specific YARN configuration properties -->
     18    <property>
     19                 <name>yarn.resourcemanager.hostname</name>
     20                 <value>hadoop01</value>
     21         </property>
     22         <property>
     23                 <name>yarn.nodemanager.aux-services</name>
     24                 <value>mapreduce_shuffle</value>
     25         </property>
     26         <property>
     27                 <name>yarn.log-aggregation-enable</name>
     28                 <value>true</value>
     29         </property>
     30         <property>
     31                 <name>yarn.log-aggregation.retain-seconds</name>
     32                 <value>604800</value>
     33         </property>
     34 </configuration>
    
    1. slaves 配置集群的slave节点
    hadoop02
    hadoop03
    
  • 启动测试

    • 格式化HDFS文件系统

       
      $ bin/hdfs namenode -format 
      
    • 启动HDFS

       
      $ sbin/start-dfs.sh
      
    • HDFS操作

      • 创建目录

         
        $ bin/hdfs dfs -mkdir /user
        $ bin/hdfs dfs -mkdir /user/root
        
      • 上传文件

         
        $ bin/hdfs dfs -put etc/http://7xq0aq.com1.z0.glb.clouddn.com/* input
        
      • 下载文件

         
        $ bin/hdfs dfs -get input ~/input_from_hdfs
        
      • 删除文件

         
        $ bin/hdfs dfs -rm -r input
        
    • 启动Yarn

       
      $ sbin/start-yarn.sh
      
    • 在Yarn上运行MapReduce

    •  
      [root@hadoop01 hadoop-2.7.3]# bin/hdfs dfs -mkdir input
      [root@hadoop01 hadoop-2.7.3]# bin/hdfs dfs -put etc/http://7xq0aq.com1.z0.glb.clouddn.com/* input
      [root@hadoop01 hadoop-2.7.3]# bin/hadoop jar share/http://7xq0aq.com1.z0.glb.clouddn.com/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount input output1
      17/03/14 16:56:31 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.254.222:8032
      17/03/14 16:56:32 INFO input.FileInputFormat: Total input paths to process : 30
      17/03/14 16:56:32 INFO mapreduce.JobSubmitter: number of splits:30
      17/03/14 16:56:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1489480515993_0004
      17/03/14 16:56:32 INFO impl.YarnClientImpl: Submitted application application_1489480515993_0004
      17/03/14 16:56:32 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1489480515993_0004/
      17/03/14 16:56:32 INFO mapreduce.Job: Running job: job_1489480515993_0004
      17/03/14 16:56:40 INFO mapreduce.Job: Job job_1489480515993_0004 running in uber mode : false
      17/03/14 16:56:40 INFO mapreduce.Job:  map 0% reduce 0%
      17/03/14 16:57:04 INFO mapreduce.Job:  map 17% reduce 0%
      17/03/14 16:57:05 INFO mapreduce.Job:  map 20% reduce 0%
      17/03/14 16:57:11 INFO mapreduce.Job:  map 23% reduce 0%
      17/03/14 16:57:12 INFO mapreduce.Job:  map 37% reduce 0%
      17/03/14 16:57:13 INFO mapreduce.Job:  map 40% reduce 0%
      17/03/14 16:57:14 INFO mapreduce.Job:  map 47% reduce 0%
      17/03/14 16:57:29 INFO mapreduce.Job:  map 67% reduce 0%
      17/03/14 16:57:41 INFO mapreduce.Job:  map 77% reduce 0%
      17/03/14 16:57:44 INFO mapreduce.Job:  map 80% reduce 0%
      17/03/14 16:57:45 INFO mapreduce.Job:  map 97% reduce 0%
      17/03/14 16:57:46 INFO mapreduce.Job:  map 100% reduce 100%
      17/03/14 16:57:47 INFO mapreduce.Job: Job job_1489480515993_0004 completed successfully
      17/03/14 16:57:47 INFO mapreduce.Job: Counters: 50
          File System Counters
              FILE: Number of bytes read=75846
              FILE: Number of bytes written=3829330
              FILE: Number of read operations=0
              FILE: Number of large read operations=0
              FILE: Number of write operations=0
              HDFS: Number of bytes read=82217
              HDFS: Number of bytes written=36886
              HDFS: Number of read operations=93
              HDFS: Number of large read operations=0
              HDFS: Number of write operations=2
          Job Counters
              Killed map tasks=1
              Launched map tasks=31
              Launched reduce tasks=1
              Data-local map tasks=31
              Total time spent by all maps in occupied slots (ms)=762524
              Total time spent by all reduces in occupied slots (ms)=32420
              Total time spent by all map tasks (ms)=762524
              Total time spent by all reduce tasks (ms)=32420
              Total vcore-milliseconds taken by all map tasks=762524
              Total vcore-milliseconds taken by all reduce tasks=32420
              Total megabyte-milliseconds taken by all map tasks=780824576
              Total megabyte-milliseconds taken by all reduce tasks=33198080
          Map-Reduce Framework
              Map input records=2124
              Map output records=8032
              Map output bytes=107076
              Map output materialized bytes=76020
              Input split bytes=3548
              Combine input records=8032
              Combine output records=4044
              Reduce input groups=1586
              Reduce shuffle bytes=76020
              Reduce input records=4044
              Reduce output records=1586
              Spilled Records=8088
              Shuffled Maps =30
              Failed Shuffles=0
              Merged Map outputs=30
              GC time elapsed (ms)=10690
              CPU time spent (ms)=14790
              Physical memory (bytes) snapshot=6156226560
              Virtual memory (bytes) snapshot=64364421120
              Total committed heap usage (bytes)=4090552320
          Shuffle Errors
              BAD_ID=0
              CONNECTION=0
              IO_ERROR=0
              WRONG_LENGTH=0
              WRONG_MAP=0
              WRONG_REDUCE=0
          File Input Format Counters
              Bytes Read=78669
          File Output Format Counters
              Bytes Written=36886
      [root@hadoop01 hadoop-2.7.3]#
      [root@hadoop01 hadoop-2.7.3]#
      [root@hadoop01 hadoop-2.7.3]# bin/hdfs dfs -cat output1/*
      ...
      users   27
      users,wheel".   18
      uses    2
      using   14
      value   45
      value="20"/>    1
      value="30"/>    1
      values  4
      variable    4
      variables   4
      version 1
      version="1.0"   6
      version="1.0">  1
      version="1.0"?> 7
      via 3
      view,   1
      viewing 1
      w/  1
      want    1
      warnings.   1
      when    9
      where   4
      which   7
      while   1
      who 6
      ...
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值