1. 硬件环境
主机名 | IP地址 | 操作系统 |
---|---|---|
master | 172.16.34.101 | CentOS Linux release 7.3.1611 |
slave01 | 172.16.34.102 | CentOS Linux release 7.3.1611 |
slave03 | 172.16.34.103 | CentOS Linux release 7.3.1611 |
2. 软件版本
软件名称 | 版本 |
---|---|
hadoop | 2.7.7 |
hive | 1.2.2 |
spark | 2.3.4 |
zookeeper | 3.4.9 |
hbase | 1.3.6 |
jdk | 1.8+ |
注解
- Hadoop:Hadoop是由java语言编写的,在分布式服务器集群上存储海量数据并运行分布式分析应用的开源框架,其核心部件是HDFS与MapReduce。HDFS为海量的数据提供了存储,则MapReduce为海量的数据提供了计算。
- HDFS:hadoop的分布式文件系统,引入存放文件元数据信息的服务器Namenode和实际存放数据的服务器Datanode,对数据进行分布式储存和读取。可以把HDFS理解为一个分布式的,有冗余备份的,可以动态扩展的用来存储大规模数据的大硬盘。
- MapReduce:是一个计算框架,MapReduce的核心思想是把计算任务分配给集群内的服务器里执行。通过对计算任务的拆分(Map计算/Reduce计算)再根据任务调度器(JobTracker)对任务进行分布式计算。可以把MapReduce理解成为一个计算引擎,按照MapReduce的规则编写Map计算/Reduce计算的程序,可以完成计算任务。
- Hbase:是一种由java语言编写的高可靠、高性能、面向列、可伸缩的分布式存储的nosql数据库,运行于HDFS文件系统之上,可以容错地存储海量非结构化和半结构化的松散数据。
- Hive:基于hadoop的数据仓库工具,可以将结构化的数据文件(或者非结构化的数据)映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。
- Spark:Spark 是使用scala实现的基于内存计算的大数据并行计算框架,可用于构建大型的、低延迟的数据分析应用程序。Spark的计算模式也属于MapReduce,但不局限于Map和Reduce操作,还提供了多种数据集操作类型,编程模型比MapReduce更灵活,Spark将计算数据、中间结果都存储在内存中,大大减少了IO开销,Spark并不能完全替代Hadoop,主要用于替代Hadoop中的MapReduce计算模型,它可以借助于hadoop yarn实现资源调度管理,借助于HDFS实现分布式存储。
3. 所有节点上关闭防火墙
setenforce 0
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
systemctl stop firewalld
systemctl disable firewalld
4. 所有节点配置NTP时钟同步
略
5. 所有节点上添加hosts
cat >>/etc/hosts <<EOF
172.16.34.101 master
172.16.34.102 slave01
172.16.34.103 slave02
EOF
6. 所有节点上面创建hadoop用户
groupadd hadoop
useradd -m -g hadoop hadoop
# 给hadoop用户设置密码为“hadoop”
echo "hadoop" |passwd --stdin hadoop
7. 在所有节点上安装jdk
7.1 在Oracle官网下载linux64位的jdk:https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
7.2 使用root用户安装jdk
mkdir -p /usr/java
tar -zxf jdk-8u191-linux-x64.tar.gz -C /usr/java
注意:如无特殊说明,以下操作在master节点的hadoop用户下进行
8. 设置master节点到slave节点的免密登录
# 切换到hadoop用户
[root@master ~]# su - hadoop
Last login: Thu Apr 16 17:12:59 CST 2020 on pts/1
# 生成公私钥对
[hadoop@master ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
2b:70:a0:b3:9a:3a:60:98:4d:a9:a1:67:a3:12:61:2e hadoop@master
The key's randomart image is:
+--[ RSA 2048]----+
| |
| |
| .. |
|o.o. . |
|=Bo . . S |
|Eo=o o . |
|+=.. . . |
|+o . |
|*. |
+-----------------+
# 将公钥发送到各节点(包括master自己)
[hadoop@master ~]$ ssh-copy-id master
The authenticity of host 'master (172.16.34.101)' can't be established.
ECDSA key fingerprint is 19:d0:5f:f0:7e:bd:96:0d:3b:5c:f7:c5:3d:fb:61:d5.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@master's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'master'"
and check to make sure that only the key(s) you wanted were added.
[hadoop@master ~]$ ssh-copy-id slave01
The authenticity of host 'slave01 (172.16.34.102)' can't be established.
ECDSA key fingerprint is 19:d0:5f:f0:7e:bd:96:0d:3b:5c:f7:c5:3d:fb:61:d5.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@slave01's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'slave01'"
and check to make sure that only the key(s) you wanted were added.
[hadoop@master ~]$ ssh-copy-id slave02
The authenticity of host 'slave02 (172.16.34.103)' can't be established.
ECDSA key fingerprint is 19:d0:5f:f0:7e:bd:96:0d:3b:5c:f7:c5:3d:fb:61:d5.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@slave02's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'slave02'"
and check to make sure that only the key(s) you wanted were added.
9. 准备所需软件包
mkdir -p /home/hadoop/bigdata
curl -o /home/hadoop/bigdata/hadoop-2.7.7.tar.gz http://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
curl -o /home/hadoop/bigdata/apache-hive-1.2.2-bin.tar.gz http://archive.apache.org/dist/hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz
curl -o /home/hadoop/bigdata/spark-2.3.4-bin-hadoop2.7.tgz http://archive.apache.org/dist/spark/spark-2.3.4/spark-2.3.4-bin-hadoop2.7.tgz
curl -o /home/hadoop/bigdata/zookeeper-3.4.9.tar.gz http://archive.apache.org/dist/zookeeper/zookeeper-3.4.9/zookeeper-3.4.9.tar.gz
curl -o /home/hadoop/bigdata/hbase-1.3.6-bin.tar.gz http://archive.apache.org/dist/hbase/hbase-1.3.6/hbase-1.3.6-bin.tar.gz
curl -o /home/hadoop/bigdata/scala-2.10.4.tgz https://www.scala-lang.org/files/archive/scala-2.10.4.tgz
10. 解压各软件包
cd /home/hadoop/bigdata
tar -zxf hadoop-2.7.7.tar.gz
tar -zxf apache-hive-1.2.2-bin.tar.gz
tar -zxf spark-2.2.1-bin-hadoop2.7.tgz
tar -zxf scala-2.10.4.tgz
tar -zxf zookeeper-3.4.9.tar.gz
tar -zxf hbase-1.3.6-bin.tar.gz
mv hadoop-2.7.7 hadoop
mv apache-hive-1.2.2 hive
mv spark-2.2.1-bin-hadoop2.7 spark
mv scala-2.10.4 scala
mv zookeeper-3.4.10 zk
mv hbase-1.2.6 hbase
11. 设置环境变量
cat >>~/.bashrc <<'EOF'
export JAVA_HOME=/usr/java/jdk1.8.0_191
export HADOOP_HOME=/home/hadoop/bigdata/hadoop
export HIVE_HOME=/home/hadoop/bigdata/hive
export HADOOP_USER_NAME=hadoop
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$PATH
export SCALA_HOME=/home/hadoop/bigdata/scala
export PATH=$PATH:$SCALA_HOME/bin
export SPARK_HOME=/home/hadoop/bigdata/spark
export PATH=$PATH:$SPARK_HOME/bin
export ZK_HOME=/home/hadoop/bigdata/zk
export PATH=$PATH:$ZK_HOME/bin
export HBASE_HOME=/home/hadoop/bigdata/hbase
export PATH=$PATH:$HBASE_HOME/bin
EOF
注意:以下配置为基本参数,实际情况可根据业务调整
12. 配置hadoop
12.1 创建数据目录
mkdir -p /home/hadoop/bigdata/hadoop/tmp
mkdir -p /home/hadoop/bigdata/hadoop/hdfs/{datanode,namenode}
12.2 修改配置文件
12.2.1 /home/hadoop/bigdata/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/bigdata/data/hadoop/tmp</value>
</property>
</configuration>
12.2.2 /home/hadoop/bigdata/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:9001</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/bigdata/data/hadoop/hdfs/datanode</value>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>file:/home/hadoop/bigdata/data/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
12.2.3 /home/hadoop/bigdata/hadoop/etc/hadoop/mapred-site.xml
使用mapred-site.xml.template模板复制出来修改。
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
12.2.4 /home/hadoop/bigdata/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
12.2.5 /home/hadoop/bigdata/hadoop/etc/hadoop/slaves
slave01
slave02
13. 配置hive
13.1 根据模板复制出配置文件
cd /home/hadoop/bigdata/hive/conf
cp hive-default.xml.template hive-site.xml
cp hive-env.sh.template hive-env.sh
cp hive-log4j.properties.template hive-log4j.properties
13.2 修改配置文件
13.2.1 /home/hadoop/bigdata/hive/conf/hive-env.sh
export HADOOP_HOME=/home/hadoop/bigdata/hadoop
export HIVE_HOME=/home/hadoop/bigdata/hive
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$PATH
13.2.2 /home/hadoop/bigdata/hive/conf/hive-log4j.properties
# Define some default values that can be overridden by system properties
hive.log.threshold=ALL
hive.root.logger=INFO,DRFA
#hive.log.dir=${java.io.tmpdir}/${user.name}
hive.log.dir=/home/hadoop/bigdata/hive/log
hive.log.file=hive.log
13.2.3 /home/hadoop/bigdata/hive/conf/hive-site.xml
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://master:9000/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>hdfs://master:9000/user/hive/scratchdir</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/hadoop/bigdata/hive/tmp</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/home/hadoop/bigdata/hive/tmp</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/home/hadoop/bigdata/hive/tmp</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/home/hadoop/bigdata/hive/logs</value>
<description>Location of Hive run time structured log file</description>
</property>
<!-- 注意:下面的mysql数据库事先准备好 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://172.16.34.25:3306/hivemeta?createDatabaseIfNotExist=true&useSSL=false</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<!-- 指定驱动类,从mysql官网下载jar包放入/home/hadoop/bigdata/hive/lib目录,如mysql-connector-java-5.1.39.jar -->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<!-- mysql数据库账号 -->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
<!-- mysql数据库密码 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://127.0.0.1:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
13.3 修改启动脚本
sed -i "s#lib\/spark-assembly-\*.jar#jars\/\*.jar#g" /home/hadoop/bigdata/hive/bin/hive
14. 配置spark
14.1 根据模板复制出配置文件
cd /home/hadoop/bigdata/spark/conf
cp spark-env.sh.template spark-env.sh
cp slaves.template slaves
14.2 修改配置文件
14.2.1 /home/hadoop/bigdata/spark/conf/spark-env.sh
export SCALA_HOME=/home/hadoop/bigdata/scala
export JAVA_HOME=/usr/java/jdk1.8.0_191
export HADOOP_HOME=/home/hadoop/bigdata/hadoop
export HADOOP_CONF_DIR=/home/hadoop/bigdata/hadoop/etc/hadoop
SPARK_MASTER_IP=master
SPARK_LOCAL_DIRS=/home/hadoop/bigdata/spark
SPARK_EXECUTOR_MEMORY=1G
SPARK_DRIVER_MEMORY=3G
14.2.2 /home/hadoop/bigdata/spark/conf/slaves
slave01
slave02
15. 配置zookeeper
15.1 根据模板复制出配置文件
cd /home/hadoop/bigdata/zk/conf/
cp zoo_sample.cfg zoo.cfg
15.2 修改配置文件
15.2.1 /home/hadoop/bigdata/zk/conf/zoo.cfg
dataDir=/home/hadoop/bigdata/zk/zkdata
dataLogDir=/home/hadoop/bigdata/zk/zkdatalog
server.1=master:2888:3888
server.2=slave01:2888:3888
server.3=slave02:2888:3888
15.2.2 /home/hadoop/bigdata/zk/zkdata/myid
1
16. 配置hbase
16.1 /home/hadoop/bigdata/hbase/conf/hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master:2181,slave01:2181,slave02:2181</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/bigdata/zk/zkdata</value>
</property>
</configuration>
16.2 /home/hadoop/bigdata/hbase/conf/regionservers
slave01
slave02
17. 将文件从master节点复制到slave
cd /home/hadoop
scp -pr .bashrc slave01:`pwd`
scp -pr .bashrc slave02:`pwd`
scp -pr bigdata slave01:`pwd`
scp -pr bigdata slave02:`pwd`
18. 使环境变量生效
在所有节点上执行:
su - hadoop
source .bashrc
19. 启动应用
注意:以下启动过程,如无特殊说明,只需要在master节点的hadoop用户下进行
19.1 hadoop
19.1.1 启动
# 启动之前先初始化
hdfs namenode -format
# 启动
cd /home/hadoop/bigdata/hadoop/sbin
./start-dfs.sh
./start-yarn.sh
19.1.2 hadoop相关页面展示
http://172.16.34.101:50070/
http://172.16.34.101:8088/
19.2 hive
19.2.1 创建hdfs路径
hadoop fs -mkdir -p /user/hive/scratchdir
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -chmod 777 /user/hive/scratchdir
hadoop fs -chmod 777 /user/hive/warehouse
19.2.2 初始化数据库
schematool -initSchema -dbType mysql
19.2.3 启动
cd /home/hadoop/bigdata/hive/bin
nohup ./hive --service metastore &
nohup ./hive --service hiveserver2 &
19.2.4 验证
进入hive,做一些简单的操作:
[hadoop@master ~]$ hive
Logging initialized using configuration in file:/home/hadoop/bigdata/hive/conf/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 0.924 seconds, Fetched: 1 row(s)
hive> create database test;
OK
Time taken: 0.146 seconds
hive> show databases;
OK
default
test
Time taken: 0.011 seconds, Fetched: 2 row(s)
hive> use test;
OK
Time taken: 0.022 seconds
hive> create table tb_test(id int, name string);
OK
Time taken: 0.151 seconds
hive> show tables;
OK
tb_test
Time taken: 0.023 seconds, Fetched: 1 row(s)
hive> insert into tb_test values(1,'a'),(2,'b');
Query ID = hadoop_20200417152249_eeb27aa3-4df2-49e1-a5c5-b0759ee45b0c
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1586939367269_0003, Tracking URL = http://master:8088/proxy/application_1586939367269_0003/
Kill Command = /home/hadoop/bigdata/hadoop/bin/hadoop job -kill job_1586939367269_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-04-17 15:22:55,572 Stage-1 map = 0%, reduce = 0%
2020-04-17 15:22:59,770 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1586939367269_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://master:9000/user/hive/warehouse/test.db/tb_test/.hive-staging_hive_2020-04-17_15-22-49_699_1452428249531261911-1/-ext-10000
Loading data to table test.tb_test
Table test.tb_test stats: [numFiles=1, numRows=2, totalSize=8, rawDataSize=6]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.98 sec HDFS Read: 3597 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 980 msec
OK
Time taken: 12.665 seconds
hive> select * from tb_test;
OK
1 a
2 b
Time taken: 0.132 seconds, Fetched: 2 row(s)
hive> exit;
[hadoop@master ~]$
19.3 spark
19.3.1 启动
cd /home/hadoop/bigdata/spark/sbin
./start-all.sh
19.3.2 spark相关页面展示
http://172.16.34.101:8080/
19.4 zookeeper
19.4.1 修改/home/hadoop/bigdata/zk/zkdata/myid
# 使用hadoop用户登录slave01执行:
echo 2 > /home/hadoop/bigdata/zk/zkdata/myid
# 使用hadoop用户登录slave02执行:
echo 3 > /home/hadoop/bigdata/zk/zkdata/myid
19.4.2 启动
# 分别以hadoop用户登录master、slave01、slave02执行:
cd /home/hadoop/bigdata/zk/bin/
./zkServer.sh start
# 查看状态
./zkServer.sh status
19.5 hbase
19.5.1 启动
cd /home/hadoop/bigdata/hbase/bin
./start-all.sh
19.5.2 hbase相关页面展示
http://172.16.34.101:16010/