1、所需软件
- 所需要的环境包括
java
,ssh
必须保证sshd
一直运行, 以便用Hadoop
脚本管理远端Hadoop
守护进程
Windows下的附加软件需求
Cygwin
提供上述软件之外的shell支持。
2、安装软件
sudo apt-get install ssh
sudo apt-get install rsync
- 由于
hadoop
是基于java
编写的,因此需要安装jdk
3、下载安装
参考资料:https://www.jianshu.com/p/cdae5bab030f
- 为了获取
Hadoop
的发行版,从Apache的某个镜像服务器上下载最近的 稳定发行版
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/stable/hadoop-3.3.0.tar.gz
tar -xvf hadoop-3.3.0.tar.gz -C /usr/local
cd /usr/local
mv hadoop-3.3.0 hadoop
- 给
hadoop
配置环境变量
vim /etc/profile
结合之前安装的jdk1.8
在文件末尾添加如下内容
export JAVA_HOME=/usr/local/jdk1.8
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
测试是否安装成功
hadoop version
root@iZuf63fv674pbylkkxs48qZ:/usr/local# hadoop version
Hadoop 3.3.0
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r aa96f1871bfd858f9bac59cf2a81ec470da649af
Compiled by brahma on 2020-07-06T18:44Z
Compiled with protoc 3.7.1
From source with checksum 5dc29b802d6ccd77b262ef9d04d19c4
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.0.jar
root@iZuf63fv674pbylkkxs48qZ:/usr/local#
4、修改配置文件
sudo vim /usr/local/hadoop/etc/hadoop/core-site.xml
添加如下内容
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
说明以上配置文件的内容
<!-- 指定HDFS老大(namenode)的通信地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://0.0.0.0:9000</value>
</property>
<!-- 指定hadoop运行时产生文件的存储路径 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
- 也是在相同的路径下,修改
hdfs-site.xml
中添加
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop/hdfs/data</value>
<description>datanode上数据块的物理存储位置</description>
</property>
<!-- 设置hdfs副本数量 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
-
在
hadoop-env.sh
中更改JAVA-HOME
,注释掉export JAVA_HOME=${JAVA_HOME}
添加修改为
export JAVA_HOME=/usr/local/jdk1.8
5、测试,启动
下面的操作均是在Hadoop
的安装路径下
/usr/local/hadoop
-
格式化
namenode
:/usr/local/hadoop# ./bin/hdfs namenode -format
-
启动
hdfs
。 开启NameNode
和DataNode
守护进程 -
如果报错如下
root@iZuf63fv674pbylkkxs48qZ:/usr/local/hadoop# ./sbin/start-dfs.sh
Starting namenodes on [localhost]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting secondary namenodes [iZuf63fv674pbylkkxs48qZ]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
解决方案
在/hadoop/sbin
路径下
在start-dfs.sh
,stop-dfs.sh
文件的顶部添加:
HDFS_DATANODE_USER=root
HADOOP_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
在 start-yarn.sh
,stop-yarn.sh
顶部添加:
#!/usr/bin/env bash
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
在sbin
目录下重新执行./start-all.sh
,即可启动
- 重启的过程中,遇到下面的问题
root@iZuf63fv674pbylkkxs48qZ:/usr/local/hadoop/sbin# sudo ./start-dfs.sh
WARNING: HADOOP_SECURE_DN_USER has been replaced by HDFS_DATANODE_SECURE_USER. Using value of HADOOP_SECURE_DN_USER.
Starting namenodes on [localhost]
localhost: root@localhost: Permission denied (publickey,password).
Starting datanodes
localhost: root@localhost: Permission denied (publickey,password).
Starting secondary namenodes [iZuf63fv674pbylkkxs48qZ]
iZuf63fv674pbylkkxs48qZ: root@izuf63fv674pbylkkxs48qz: Permission denied (publickey,password).
5.1 在虚拟机上搭建Hadoop
集群
- 在安装的时候,为每一台主机配置静态
ip
。配置静态ip
的自我配置教程如下
http://note.youdao.com/s/dDpr8UkW
- 更改
Ubunut
系统的下载源
sudo vim /etc/apt/source.list
将里面的 http://archive.ubuntu.com/ubuntu/
修改为 http://mirrors.aliyun.com/ubuntu/
- 首先安装好一台
Ubunut
系统后,作为master
,在系统中安装配置好静态ip
,安装jdk
,hadoop
后,按照相同的配置克隆master
得到node1,node2
。此处提到的名称需要使用如下命令修改
sudo vim /etc/hostname
- 修改
host
文件
sudo vim /etc/hosts
在文件末尾添加如下内容,
192.168.8.6 master
192.168.8.7 node1
192.168.8.8 node2
-
配置ssh无密码登录
- 输入
cd ~
回到根目录 - 使用
ssh-keygen
,一直回车得到下面的结果
helloful@master:~$ cd ~ helloful@master:~$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/helloful/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/helloful/.ssh/id_rsa Your public key has been saved in /home/helloful/.ssh/id_rsa.pub The key fingerprint is: SHA256:tGRzFOZansyT58ahQZyWOaIxESIVbPSkzYowY7AQyYM helloful@master The key's randomart image is: +---[RSA 3072]----+ |*o.+=.+. +. | |E+ .oB . = + | |=.... * * % | |.+ . . B & + | | . . . S O o | | B . | | . + | | . | | | +----[SHA256]-----+
-
输入
cd .ssh
-
输入
cat ./id_rsa.pub >> authorized_keys
helloful@master:~$ cd .ssh helloful@master:~/.ssh$ cat ./id_rsa.pub >> authorized_keys helloful@master:~/.ssh$ cat authorized_keys ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDffnfOM4rgcxtm8lkBzPojolSX1zz26r5+hOd0Iy5lgS7atDZgZqQ7JITShpwaENNJ7N8qumsjwnyulBsP5DSRGa0oXzJTafO+Drj47p5V+bI4Nejl+SjXrB6X5RIFD8VmuIrXNMtRx4bQQ4oZQyAF/qSa4wcnsBz8gMPuY3JAnArlsm9MCHfhvTg/zeVTbJjjbyc+8tGXVsa0AVmL5lcrxOcBPc0bP53/agwzPMHuBtlTbvpX2X57XxvKFov8WngSbMZYRWALsW9EvvBZg1oyPVEXo16WK80hWRlZKWiQANJgdWF3sFIiac22ml12NoH7KzmmDEDigd0pqAPaBOlcLvCzWigOJf22hmW8UDTP68kvjR8M4JPDjkwDC5UjO4mzRQUEukeXqGMOxM7drHlyqKpoVE1/zi9rKFSroCnd59a5HIv+0pobMkjwQATh8ZUBEGeEK7yXNBnQTvxFvA8qmJZ62WzGguaty4AWDDQ9HMTkA1twvmlCqBksFSQOpFM= helloful@master helloful@master:~/.ssh$
-
对每一台主机都执行上面的生成秘钥操作,然后把
node1.node2
的秘钥复制到master
中的authorized_keys
文件中,同理,把master,node2
的复制到node1
。以此类推。
- 输入
-
修改
hadoop
的配置文件,需要修改的文件为-
cd ~ cd /usr/local/hadoop/etc/hadoop
-
sudo vim core-site.xml
添加如下内容
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property> </configuration>
-
sudo vim hadoop-env.sh
添加一行,内容如下,为自己安装的
jdk
的路径export JAVA_HOME=/usr/local/jdk1.8
-
修改
mapred-site.xml
中的地址为自己的地址sudo vim mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>master:49001</value> </property> <property> <name>mapred.local.dir</name> <value>/usr/local/hadoop/var</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
-
修改
workers
文件master
修改为node1 node2
node1
修改为master node2
node2
修改为master node1
-
修改
yarn-site.xml
文件
<?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <description>The address of the applications manager interface in the RM.</description> <name>yarn.resourcemanager.address</name> <value>${yarn.resourcemanager.hostname}:8032</value> </property> <property> <description>The address of the scheduler interface.</description> <name>yarn.resourcemanager.scheduler.address</name> <value>${yarn.resourcemanager.hostname}:8030</value> </property> <property> <description>The http address of the RM web application.</description> <name>yarn.resourcemanager.webapp.address</name> <value>${yarn.resourcemanager.hostname}:8088</value> </property> <property> <description>The https adddress of the RM web application.</description> <name>yarn.resourcemanager.webapp.https.address</name> <value>${yarn.resourcemanager.hostname}:8090</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>${yarn.resourcemanager.hostname}:8031</value> </property> <property> <description>The address of the RM admin interface.</description> <name>yarn.resourcemanager.admin.address</name> <value>${yarn.resourcemanager.hostname}:8033</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>1024</value> <discription>每个节点可用内存,单位MB,默认8182MB</discription> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1024</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> </configuration>
-
-
启动
hadoop
脚本格式化脚本
cd /usr/local/hadoop/bin sudo ./hadoop namenode -format
启动脚本
- 此处需要特别注意,因为上面的
ssh-keygen
是在用户权限下生成的,因此,运行过程不要加sudo
cd /usr/local/hadoop/sbin ./start-all.sh
-
运行此过程遇到的问题
- 如果提示不能创建文件,在每个主机上执行下面的指令
sudo chmod 777 -R /usr/local/hadoop/
-
使用
web
查看运行 -
下面查看
DFS
在win10下,虚拟机master ip+9870 port 比如,其中192.168.8.6是master的ip
- 此处需要特别注意,因为上面的
192.168.8.6:9870
- 查看`YARN`
192.168.8.6:8088
#### 5.2 Hive的安装
- 安装`mysql`并且进入为`hive`授权
grant all on . to root@’%’ identified by ‘12345678’;
- `Hive`下载安装
- ```
https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.2/
-
sudo tar -xvf hive.tar.gz -C /usr/local
修改下面的文件
sudo vim ~/.bashrc
添加如下内容
export HIVE_HOME=/usr/local/hive export PATH=$PATH:$HIVE_HOME/bin
保存退出后,运行
source ~/.bashrc
使配置立即生效 -
修改
/usr/local/hive/conf
文件夹中的文件,具体如下- 将
hive-default.xml.template
重命名为hive-default.xml
; - 新建一个文件
touch hive-site.xml
,并在hive-site.xml
中粘贴如下配置信息:
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://47.117.137.112:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value> <description> JDBC connect string for a JDBC metastore. ql.jdbc.Driver To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database. </description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore,com.mysql.jdbc.Driver is depricated</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>Username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>12345678</value> <description>password to use against metastore database</description> </property> </configuration>
- 将
上面对应的jdbc
版本为5.
- 安装好
mysql
以后,新建hive
数据库
create database hive
新建好数据库以后,配置数据库权限。配置mysql
允许hive
接入
mysql8
的权限分配不能带密码隐式创建账号了,需要先创建账号再设置权限update
use mysql;
update user set host='%' where user='hive';
grant all privileges on *.* to 'hive'@'%';
alter user hive identifed with mysql_native_password by '123456';
5.3 使用Hive
在HDFS
上创建Hive
所需路径
hadoop fs -mkdir /tmp
hadoop fs -mkdir /user/hive/warehouse
修改上述路径的访问权限,使用用户组具有写入权限
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse
- 启动
hive
- 进入
hive/bin
,执行hive
,出现报错如下
Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path
解决办法
cd hive/conf
cp hive-env.sh.template hive-env.sh
sudo vim hive-env.sh
(加入export HADOOP_HOME=/usr/local/hadoop)
source hive-env.sh
- 运行
hive
报错,提示没有实例化
FAILED: HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
具体解决方法为:执行下面的语句初始化
schematool -dbType mysql -initSchema
此处初始化的时候,是根据hive-site.xml
中的配置信息进行,因此,我们可以不用自己创建数据库,但是需要分配用户权限
grant all on *.* to hive@'%' identified by '12345678';
初始化的过程中,继续报错
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to load driver
可以知道是没有jdbc
驱动
解决方案为把hive/jdbc
下的驱动复制到hive/lib
下面,但是并不行,于是自己去下载mysql-connector-java5.XX.XX.jar
到hive/lib
下面
6、Hadoop
系统学习
参考资料:https://www.zhihu.com/question/333417513