以Hadoop入门大数据

最新推荐文章于 2020-07-11 08:25:01 发布

铁拳虎

最新推荐文章于 2020-07-11 08:25:01 发布

阅读量232

点赞数

分类专栏：大数据学习大数据开发大数据入门数据分析 Hadoop 大数据大数据技术人工智能 spark Linux 文章标签：大数据大数据学习大数据开发数据分析人工智能

本文链接：https://blog.csdn.net/juan189/article/details/83832901

版权

大数据学习同时被 3 个专栏收录

115 篇文章 14 订阅

订阅专栏

大数据入门

115 篇文章 3 订阅

订阅专栏

数据分析

115 篇文章 0 订阅

订阅专栏

以Hadoop入门大数据

一、Hadoop简介

1.什么是Hadoop-

Apache™ Hadoop® project 生产出的用于高可靠、可扩展、分布式计算的开源软件，它允许通过集群的方式使用简单的编程模型分布式处理大数据，它可以从单一的服务器扩展到成千上万的机器，每一台机器都能提供本地计算和存储。Hadoop认为集群中节点故障为常态，它可以自动检测和处理故障节点，所以它不依赖硬件来提供高可用性。
简单的说，我们可以用Hadoop分布式存储大量数据，然后根据自己的业务对海量数据进行分布式计算。例如：
- 淘宝网昨天24小时的用户访问量折线图，不同地区、不同时段、不同终端
- 中国
Apache™ Hadoop® project 包含的模块
- Hadoop Common :The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS™): 高吞吐分布式文件系统
- Hadoop YARN: 作业(Job)安排和资源调度平台/框架
- Hadoop MapReduce: 运行在yarn上的大数据并行计算模型
其他相关project
- Ambari
  HBase
  Hive
  Spark
  Zookeeper

2.Hadoop产生背景

Doug Cutting是Lucene、Nutch 、Hadoop等项目的发起人
Hadoop最早起源于Nutch。Nutch的设计目的是构建一个大型的全网搜索引擎，包括网页抓取、索引、查询等功能，但随着抓取网页数量的增加，遇到了严重的可扩展性问题：如何解决数十亿网页的存储和索引问题。
2003、2004年谷歌发表的两篇论文为该问题提供了可行的解决方案。
- 分布式文件系统（GFS），可用于处理海量网页存储
- 分布式计算框架（MapReduce），可用于海量网页的索引计算问题
大数据学习加群：868847735
谷歌在03到06年间连续发表了三篇很有影响力的文章，分别是03年SOSP的GFS，04年OSDI的MapReduce，和06年OSDI的BigTable。SOSP和OSDI都是操作系统领域的顶级会议，在计算机学会推荐会议里属于A类。SOSP在单数年举办，而OSDI在双数年举办。
Nutch开发人员完成了相应的开源实现HDFS和MapReduce，并从Nutch中剥离成为独立项目Hadoop，直到2008年，Hadoop成为Apache顶级项目，之后快速发展。
到现在hadoop已经形成了完善的生态圈，

3.Hadoop架构

1.分布式架构简介
- 单机的问题
  - 存储能力有限
  - 计算能力有限
  - 有单点故障
  - ...
- 分布式架构解决了单机问题
- 经典分布式主从架构
  - Master（老大）负责管理，可以有多个，防止单节点故障发生
  - Slave（小弟）负责干活，有多个，可以动态添加删除
1. Hadoop架构
- Hadoop2.0
  - HDFS：NameNode（老大），DateNode（小弟）
  - YARN：ResourceManager（老大），NodeManager（小弟）

4.Hadoop生态

Hadoop生态圈：
- - Sqoop(发音：skup)是一款开源的工具，主要用于在Hadoop(Hive)与传统的数据库(mysql、postgresql...)间进行数据的传递，可以将一个关系型数据库（例如： MySQL ,Oracle ,Postgres等）中的数据导进到Hadoop的HDFS中，也可以将HDFS的数据导进到关系型数据库中。
  - Apache Ambari是一种基于Web的工具，支持Apache Hadoop集群的供应、管理和监控。Ambari目前已支持大多数Hadoop组件，包括HDFS、MapReduce、Hive、Pig、 Hbase、Zookeper、Sqoop和Hcatalog等。
  - ZooKeeper是一个分布式的，开放源码的分布式应用程序协调服务，是Google的Chubby一个开源的实现，是Hadoop和Hbase的重要组件。它是一个为分布式应用提供一致性服务的软件，提供的功能包括：配置维护、域名服务、分布式同步、组服务等。
  - hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能，可以将sql语句转换为MapReduce任务进行运行。其优点是学习成本低，可以通过类SQL语句快速实现简单的MapReduce统计，不必开发专门的MapReduce应用，十分适合数据仓库的统计分析。
  - Pig是一个基于Hadoop的大规模数据分析平台，它提供的SQL-LIKE语言叫Pig Latin，该语言的编译器会把类SQL的数据分析请求转换为一系列经过优化处理的MapReduce运算。Pig为复杂的海量数据并行计算提供了一个简单的操作和编程接口。
  - Mahout 是 Apache Software Foundation（ASF）旗下的一个开源项目，提供一些可扩展的机器学习领域经典算法的实现，旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现，包括聚类、分类、推荐过滤、频繁子项挖掘。此外，通过使用 Apache Hadoop 库，Mahout 可以有效地扩展到云中。
  - HBase是Apache的Hadoop项目的子项目。HBase不同于一般的关系数据库，它是一个适合于非结构化数据存储的数据库。另一个不同的是HBase基于列的而不是基于行的模式。
- 大数据本身是个很宽泛的概念，Hadoop生态圈（或者泛生态圈）基本上都是为了处理超过单机尺度的数据处理而诞生的。你可以把它比作一个厨房所以需要的各种工具。锅碗瓢盆，各有各的用处，互相之间又有重合。你可以用汤锅直接当碗吃饭喝汤，你可以用小刀或者刨子去皮。但是每个工具有自己的特性，虽然奇怪的组合也能工作，但是未必是最佳选择。

Hadoop环境搭建

准备环境

准备软件
1. CentOS-7-x86_64-Minimal-1611.iso
2. jdk-8u121-linux-x64.tar
3. hadoop-2.7.3.tar
4. VirtualBox
技能准备
1. Virtual新建虚拟机，网络配置：桥接、host only ...
2. Linux常用命令：cd mv cp scp vi cat ...
3. Linux网络卡配置，以centos7为例：编辑网卡配置 vi /etc/sysconfig/network-scripts/ifcfg-enp0s3 ...
```
TYPE=Ethernet
BOOTPROTO=static
IPADDR=192.168.254.222
NETMASK=255.255.255.0
GATEWAY=192.168.254.1
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=no
IPV6_AUTOCONF=no
IPV6_DEFROUTE=no
IPV6_PEERDNS=no
IPV6_PEERROUTES=no
IPV6_FAILURE_FATAL=no
NAME=enp0s3
UUID=76a635c2-9600-437b-8cfb-57e9569f68da
DEVICE=enp0s3
ONBOOT=yes
DNS1=114.114.114.114
```
4. Linux服务管理，以centos7为例：
  - 重启网络服务 systemctl restart network
  - 关闭防火墙 systemctl stop firewalld
  - 禁用开启自启防火墙 systemctl disable firewalld
5. ssh、及ssh免密登录
  - ssh-keygen 回车... : 在主机名为hadoop01机器上生成公钥、私钥（~/.ssh/idrsa、idrsa.pub）
  - cat ~/.ssh/idrsa.pub >> ~/.ssh/authorizedkeys : 设置hadoop01免密登录自己
  - 将idrsa.pub追加到Hadoop02上~/.ssh/authorizedkeys文件中就可以免密登录hadoop02了：快捷命令 ssh-copy-id hadoop02
本例中的虚拟机环境
1. 三台centos7 mini版虚拟机 hadoop01、hadoop02、hadoop03
2. /etc/hosts 文件三台机器均已配好，完全相同
3. 宿主机器hosts中也已配置了三台虚拟机的主机名映射 ###单节点安装配置
Standalone单机
- 什么是单机模式？没有任何daemons(后台程序)，不能使用HDFS，在用hadoop jar命令运行MapReduce任务时才启动进程模拟MapReduce运算，一般用与程序测试调试
- 实例：
```
$ cp etc/http://7xq0aq.com1.z0.glb.clouddn.com/*.xml input
$ bin/hadoop jar share/http://7xq0aq.com1.z0.glb.clouddn.com/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
$ cat output/* 
```

Pseudo-Distributed伪分布

什么是伪分布模式？Hadoop daemons运行在不同的进程中。NamoNode、DataNode、ResourceManageer、NodeManager、SecondaryNameNode等进程分布式情况下是运行在集群中不同机器上的，现在都运行在一台机器上，只是用不同Java进程来分隔开，故称伪分布。

修改配置文件

hadoop-env.sh 25行设置JAVA_HOME

1 # Licensed to the Apache Software Foundation (ASF) under one
2 # or more contributor license agreements.  See the NOTICE file
3 # distributed with this work for additional information
4 # regarding copyright ownership.  The ASF licenses this file
5 # to you under the Apache License, Version 2.0 (the
6 # "License"); you may not use this file except in compliance
7 # with the License.  You may obtain a copy of the License at
8 #
9 #     http://www.apache.org/licenses/LICENSE-2.0
10 #
11 # Unless required by applicable law or agreed to in writing, software
12 # distributed under the License is distributed on an "AS IS" BASIS,
13 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 # See the License for the specific language governing permissions and
15 # limitations under the License.
16
17 # Set Hadoop-specific environment variables here.
18
19 # The only required environment variable is JAVA_HOME.  All others are
20 # optional.  When running a distributed configuration it is best to
21 # set JAVA_HOME in this file, so that it is correctly defined on
22 # remote nodes.
23
24 # The java implementation to use.
25 export JAVA_HOME=/root/jdk1.8
26
27 # The jsvc implementation to use. Jsvc is required to run secure datanodes
28 # that bind to privileged ports to provide authentication of data transfer

core-site.xml 20-23行指定NamdeNode运行在集群中哪台机器上

1 <?xml version="1.0" encoding="UTF-8"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <!--
4   Licensed under the Apache License, Version 2.0 (the "License");
5   you may not use this file except in compliance with the License.
6   You may obtain a copy of the License at
7
8     http://www.apache.org/licenses/LICENSE-2.0
9
10   Unless required by applicable law or agreed to in writing, software
11   distributed under the License is distributed on an "AS IS" BASIS,
12   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13   See the License for the specific language governing permissions and
14   limitations under the License. See accompanying LICENSE file.
15 -->
16
17 <!-- Put site-specific property overrides in this file. -->
18
19 <configuration>
20          <property>
21                 <name>fs.defaultFS</name>
22                 <value>hdfs://hadoop01:9000</value>
23         </property>
24 </configuration>

hdfs-site.xml 设置副本数、NameNodo和DataNode存储目录

1 <?xml version="1.0" encoding="UTF-8"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <!--
4   Licensed under the Apache License, Version 2.0 (the "License");
5   you may not use this file except in compliance with the License.
6   You may obtain a copy of the License at
7
8     http://www.apache.org/licenses/LICENSE-2.0
9
10   Unless required by applicable law or agreed to in writing, software
11   distributed under the License is distributed on an "AS IS" BASIS,
12   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13   See the License for the specific language governing permissions and
14   limitations under the License. See accompanying LICENSE file.
15 -->
16
17 <!-- Put site-specific property overrides in this file. -->
18
19 <configuration>
20         <property>
21                 <name>dfs.replication</name>
22                 <value>1</value>
23         </property>
24         <property>
25                 <name>dfs.name.dir</name>
26                 <value>/usr/local/http://7xq0aq.com1.z0.glb.clouddn.com/hdfs/name</value>
27         </property>
28
29 <property>
30                 <name>dfs.data.dir</name>
31                 <value>/usr/local/http://7xq0aq.com1.z0.glb.clouddn.com/hdfs/data</value>
32         </property>
33 <property>
34   <name>dfs.namenode.secondary.http-address</name>
35   <value>hadoop02:9001</value>
36 </property>
37 </configuration>

mapred-site.xml 设置MapReduce的资源调度框架为Yarn

1 <?xml version="1.0"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <!--
4   Licensed under the Apache License, Version 2.0 (the "License");
5   you may not use this file except in compliance with the License.
6   You may obtain a copy of the License at
7
8     http://www.apache.org/licenses/LICENSE-2.0
9
10   Unless required by applicable law or agreed to in writing, software
11   distributed under the License is distributed on an "AS IS" BASIS,
12   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13   See the License for the specific language governing permissions and
14   limitations under the License. See accompanying LICENSE file.
15 -->
16
17 <!-- Put site-specific property overrides in this file. -->
18
19 <configuration>
20 <property>
21         <name>mapreduce.framework.name</name>
22         <value>yarn</value>
23     </property>
24 </configuration>
~

yarn-site.xml

 1 <?xml version="1.0"?>
2 <!--
3   Licensed under the Apache License, Version 2.0 (the "License");
4   you may not use this file except in compliance with the License.
5   You may obtain a copy of the License at
6
7     http://www.apache.org/licenses/LICENSE-2.0
8
9   Unless required by applicable law or agreed to in writing, software
10   distributed under the License is distributed on an "AS IS" BASIS,
11   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12   See the License for the specific language governing permissions and
13   limitations under the License. See accompanying LICENSE file.
14 -->
15 <configuration>
16
17 <!-- Site specific YARN configuration properties -->
18  <property>
19         <name>yarn.nodemanager.aux-services</name>
20         <value>mapreduce_shuffle</value>
21     </property>
22 </configuration>

集群安装配置

完全分布式安装配置

hadoop-env.sh 同伪分布
core-site.xml 同伪分布
hdfs-site.xml 同伪分布
mapred-site.xml 同伪分布
yarn-site.xml 需要修改

  1 <?xml version="1.0"?>
  2 <!--
  3   Licensed under the Apache License, Version 2.0 (the "License");
  4   you may not use this file except in compliance with the License.
  5   You may obtain a copy of the License at
  6
  7     http://www.apache.org/licenses/LICENSE-2.0
  8
  9   Unless required by applicable law or agreed to in writing, software
 10   distributed under the License is distributed on an "AS IS" BASIS,
 11   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12   See the License for the specific language governing permissions and
 13   limitations under the License. See accompanying LICENSE file.
 14 -->
 15 <configuration>
 16
 17 <!-- Site specific YARN configuration properties -->
 18    <property>
 19                 <name>yarn.resourcemanager.hostname</name>
 20                 <value>hadoop01</value>
 21         </property>
 22         <property>
 23                 <name>yarn.nodemanager.aux-services</name>
 24                 <value>mapreduce_shuffle</value>
 25         </property>
 26         <property>
 27                 <name>yarn.log-aggregation-enable</name>
 28                 <value>true</value>
 29         </property>
 30         <property>
 31                 <name>yarn.log-aggregation.retain-seconds</name>
 32                 <value>604800</value>
 33         </property>
 34 </configuration>

slaves 配置集群的slave节点

hadoop02
hadoop03

启动测试

格式化HDFS文件系统
```
$ bin/hdfs namenode -format 
```
启动HDFS
```
$ sbin/start-dfs.sh
```

HDFS操作

创建目录

$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/root

上传文件

$ bin/hdfs dfs -put etc/http://7xq0aq.com1.z0.glb.clouddn.com/* input

下载文件

$ bin/hdfs dfs -get input ~/input_from_hdfs

删除文件
```
$ bin/hdfs dfs -rm -r input
```

启动Yarn
```
$ sbin/start-yarn.sh
```
在Yarn上运行MapReduce

[root@hadoop01 hadoop-2.7.3]# bin/hdfs dfs -mkdir input
[root@hadoop01 hadoop-2.7.3]# bin/hdfs dfs -put etc/http://7xq0aq.com1.z0.glb.clouddn.com/* input
[root@hadoop01 hadoop-2.7.3]# bin/hadoop jar share/http://7xq0aq.com1.z0.glb.clouddn.com/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount input output1
17/03/14 16:56:31 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.254.222:8032
17/03/14 16:56:32 INFO input.FileInputFormat: Total input paths to process : 30
17/03/14 16:56:32 INFO mapreduce.JobSubmitter: number of splits:30
17/03/14 16:56:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1489480515993_0004
17/03/14 16:56:32 INFO impl.YarnClientImpl: Submitted application application_1489480515993_0004
17/03/14 16:56:32 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1489480515993_0004/
17/03/14 16:56:32 INFO mapreduce.Job: Running job: job_1489480515993_0004
17/03/14 16:56:40 INFO mapreduce.Job: Job job_1489480515993_0004 running in uber mode : false
17/03/14 16:56:40 INFO mapreduce.Job:  map 0% reduce 0%
17/03/14 16:57:04 INFO mapreduce.Job:  map 17% reduce 0%
17/03/14 16:57:05 INFO mapreduce.Job:  map 20% reduce 0%
17/03/14 16:57:11 INFO mapreduce.Job:  map 23% reduce 0%
17/03/14 16:57:12 INFO mapreduce.Job:  map 37% reduce 0%
17/03/14 16:57:13 INFO mapreduce.Job:  map 40% reduce 0%
17/03/14 16:57:14 INFO mapreduce.Job:  map 47% reduce 0%
17/03/14 16:57:29 INFO mapreduce.Job:  map 67% reduce 0%
17/03/14 16:57:41 INFO mapreduce.Job:  map 77% reduce 0%
17/03/14 16:57:44 INFO mapreduce.Job:  map 80% reduce 0%
17/03/14 16:57:45 INFO mapreduce.Job:  map 97% reduce 0%
17/03/14 16:57:46 INFO mapreduce.Job:  map 100% reduce 100%
17/03/14 16:57:47 INFO mapreduce.Job: Job job_1489480515993_0004 completed successfully
17/03/14 16:57:47 INFO mapreduce.Job: Counters: 50
    File System Counters
        FILE: Number of bytes read=75846
        FILE: Number of bytes written=3829330
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=82217
        HDFS: Number of bytes written=36886
        HDFS: Number of read operations=93
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Killed map tasks=1
        Launched map tasks=31
        Launched reduce tasks=1
        Data-local map tasks=31
        Total time spent by all maps in occupied slots (ms)=762524
        Total time spent by all reduces in occupied slots (ms)=32420
        Total time spent by all map tasks (ms)=762524
        Total time spent by all reduce tasks (ms)=32420
        Total vcore-milliseconds taken by all map tasks=762524
        Total vcore-milliseconds taken by all reduce tasks=32420
        Total megabyte-milliseconds taken by all map tasks=780824576
        Total megabyte-milliseconds taken by all reduce tasks=33198080
    Map-Reduce Framework
        Map input records=2124
        Map output records=8032
        Map output bytes=107076
        Map output materialized bytes=76020
        Input split bytes=3548
        Combine input records=8032
        Combine output records=4044
        Reduce input groups=1586
        Reduce shuffle bytes=76020
        Reduce input records=4044
        Reduce output records=1586
        Spilled Records=8088
        Shuffled Maps =30
        Failed Shuffles=0
        Merged Map outputs=30
        GC time elapsed (ms)=10690
        CPU time spent (ms)=14790
        Physical memory (bytes) snapshot=6156226560
        Virtual memory (bytes) snapshot=64364421120
        Total committed heap usage (bytes)=4090552320
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=78669
    File Output Format Counters
        Bytes Written=36886
[root@hadoop01 hadoop-2.7.3]#
[root@hadoop01 hadoop-2.7.3]#
[root@hadoop01 hadoop-2.7.3]# bin/hdfs dfs -cat output1/*
...
users   27
users,wheel".   18
uses    2
using   14
value   45
value="20"/>    1
value="30"/>    1
values  4
variable    4
variables   4
version 1
version="1.0"   6
version="1.0">  1
version="1.0"?> 7
via 3
view,   1
viewing 1
w/  1
want    1
warnings.   1
when    9
where   4
which   7
while   1
who 6
...