基于Docker搭建Hadoop+Hive

原文件地址:基于Docker搭建Hadoop+Hive - 知乎

为配合生产hadoop使用,在本地搭建测试环境,使用docker环境实现(主要是省事~),拉取阿里云已有hadoop镜像基础上,安装hive组件,参考下面两个专栏文章:

克里斯:基于 Docker 构建 Hadoop 平台
docker上从零开始搭建hadoop和hive环境

由于hadoop与hive等存在版本兼容问题,安装前可以先通过官网确认版本兼容情况:

http://hive.apache.org/downloads.html

本次使用的各版本配置如下:

  • Docker 19.03.8
  • JDK 1.8
  • Hadoop 3.2.0
  • Hive 3.1.2
  • mysql 8.0.1
  • mysql-connector-java-5.1.49.jar
  • hive_jdbc_2.5.15.1040

Hadoop部分:

一、拉取镜像

docker pull registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base

二、运行容器

进入容器看worker里面有三台机子,分别是Master、Slave1、Slave2

关于worker路径,通过etc/profile环境变量配置的文件即可查看hadoop安装目录
#随意建立个容器查看配置情况
docker run -it --name hadopp-test registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base
#查看系统变量路径
vim etc/profile
#查看worker情况
vim /usr/local/hadoop/etc/hadoop/workers
  • 建立hadoop用的内部网络
#指定固定ip号段
docker network create --driver=bridge --subnet=172.19.0.0/16  hadoop

  • 建立Master容器,映射端口

10000端口为hiveserver2端口,后面本地客户端要通过beeline连接hive使用,有其他组件要安装的话可以提前把端口都映射出来,毕竟后面容器运行后再添加端口还是有点麻烦的

docker run -it --network hadoop -h Master --name Master -p 9870:9870 -p 8088:8088 -p 10000:10000 registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base bash
  • 创建Slave1容器
docker run -it --network hadoop -h Slave1 --name Slave1 registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base bash
  • 创建Slave2容器
docker run -it --network hadoop -h Slave2 --name Slave2 registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base bash
  • 三台机器,都修改host vim /etc/hosts
172.20.0.4	Master
172.20.0.3	Slave1
172.20.0.2	Slave2

三、启动hadoop

虽然容器里面已经把hadoop路径配置在系统变量里面,但每次进入需要运行source /etc/profile才能生效使用

  • 进入master,启动hadoop,先格式化hdfs
#进入Master容器
docker exec -it Master bash
#进入后格式化hdfs
root@Master:/# hadoop namenode -format
  • 启动全部,包含hdfs和yarn
root@Master:/usr/local/hadoop/sbin# ./start-all.sh
可以看到服务起来了,本地范围宿主机ip的8088及9870端口可以看到监控信息
Starting namenodes on [Master]
Master: Warning: Permanently added 'master,172.19.0.4' (ECDSA) to the list of known hosts.
Starting datanodes
Slave1: Warning: Permanently added 'slave1,172.19.0.3' (ECDSA) to the list of known hosts.
Slave2: Warning: Permanently added 'slave2,172.19.0.2' (ECDSA) to the list of known hosts.
Slave1: WARNING: /usr/local/hadoop/logs does not exist. Creating.
Slave2: WARNING: /usr/local/hadoop/logs does not exist. Creating.
Starting secondary namenodes [Master]
Starting resourcemanager
Starting nodemanagers

查看分布式文件系统状态
root@Master:/usr/local/hadoop/sbin# hdfs dfsadmin -report

四、运行内置WordCount例子

把license作为需要统计的文件

root@Master:/usr/local/hadoop# cat LICENSE.txt > file1.txt
root@Master:/usr/local/hadoop# ls
LICENSE.txt  NOTICE.txt  README.txt  bin  etc  file1.txt  include  lib  libexec  logs  sbin  share

在 HDFS 中创建 input 文件夹

root@Master:/usr/local/hadoop# hadoop fs -mkdir /input

上传 file1.txt 文件到 HDFS 中

root@Master:/usr/local/hadoop# hadoop fs -put file1.txt /input
2020-09-14 11:02:01,183 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

查看 HDFS 中 input 文件夹里的内容

root@Master:/usr/local/hadoop# hadoop fs -ls /input
Found 1 items
-rw-r--r--   2 root supergroup     150569 2020-09-14 11:02 /input/file1.txt

运作 wordcount 例子程序

root@Master:/usr/local/hadoop# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /input /output

查看 HDFS 中的 /output 文件夹的内容

root@Master:/usr/local/hadoop# hadoop fs -ls /output
Found 2 items
-rw-r--r--   2 root supergroup          0 2020-09-14 11:09 /output/_SUCCESS
-rw-r--r--   2 root supergroup      35324 2020-09-14 11:09 /output/part-r-00000

查看part-r-00000文件的内容,就是运行的结果

root@Master:/usr/local/hadoop# hadoop fs -cat /output/part-r-00000

Hadoop 部分结束了

HIVE部分:

Hive下载后上传到容器目录下:

下载地址:Index of /apache/hive/hive-3.1.2

                  https://mirror-hk.koddos.net/apache/hive/hive-3.1.2/

一、解压安装包

# 拷贝安装包进Master容器
docker cp apache-hive-3.1.2-bin.tar.gz Master:/usr/local
# 进入容器
docker exec -it Master bash
cd /usr/local/
# 解压安装包
tar xvf apache-hive-3.1.2-bin.tar.gz

二、修改配置文件

root@Master:/usr/local/apache-hive-3.1.2-bin/conf# cp hive-default.xml.template hive-site.xml
root@Master:/usr/local/apache-hive-3.1.2-bin/conf# vim hive-site.xml

在最前面添加下面配置:

  <property>
    <name>system:java.io.tmpdir</name>
    <value>/tmp/hive/java</value>
  </property>
  <property>
    <name>system:user.name</name>
    <value>${user.name}</value>
  </property>

三、配置Hive相关环境变量

vim /etc/profile

#文本最后添加
export HIVE_HOME="/usr/local/apache-hive-3.1.2-bin"
export PATH=$PATH:$HIVE_HOME/bin 

配置后执行source /etc/profile 生效

source /etc/profile

四、配置mysql作为元数据库

1.拉取Mysql镜像并生产容器:

#拉取镜像
docker pull mysql:8:0.18
#建立容器
docker run --name mysql_hive -p 4306:3306 --net hadoop --ip 172.19.0.5 -v /root/mysql:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=abc123456 -d mysql:8.0.18
#进入容器
docker exec -it mysql_hive bash
#进入myslq
mysql -uroot -p
#密码上面建立容器时候已经设置abc123456
#建立hive数据库
create database hive;
#修改远程连接权限
ALTER USER 'root'@'%' IDENTIFIED WITH mysql_native_password BY 'abc123456';

ps:容器ip地址类似window的分配机制,如果不设置固定ip容器重启后会自动分配ip,因此数据库相关的容器建议设置为固定ip。

2.回去Master容器,修改关联数据库的配置

docker exec -it Master bash
vim /usr/local/apache-hive-3.1.2-bin/conf/hive-site.xml

搜索关键词修改数据库url、驱动、用户名,url根据上面建容器时候地址。

  #还请注意hive配置文件里面使用&amp;作为分隔,高版本myssql需要SSL验证,在这里设置关闭
  <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>abc123456</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://172.19.0.5:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <property>

3.mysql驱动上传到hive的lib下

#前面已经跟hive安装包一起上传到容器/usr/local目录
root@Master:/usr/local# cp mysql-connector-java-5.1.49.jar /usr/local/apache-hive-3.1.2-bin/lib

4.jar包配置修改

对hive的lib文件夹下的部分文件做修改,不然初始化数据库的时候会报错

#slf4j这个包hadoop及hive两边只能有一个,这里删掉hive这边
root@Master:/usr/local/apache-hive-3.1.2-bin/lib# rm log4j-slf4j-impl-2.10.0.jar

#guava这个包hadoop及hive两边只删掉版本低的那个,把版本高的复制过去,这里删掉hive,复制hadoop的过去
root@Master:/usr/local/hadoop/share/hadoop/common/lib# cp guava-27.0-jre.jar /usr/local/apache-hive-3.1.2-bin/lib
root@Master:/usr/local/hadoop/share/hadoop/common/lib# rm /usr/local/apache-hive-3.1.2-bin/lib/guava-19.0.jar

#把文件hive-site.xml第3225行的特殊字符删除
root@Master: vim /usr/local/apache-hive-3.1.2-bin/conf/hive-site.xml

五、初始化元数据库

root@Master:/usr/local/apache-hive-3.1.2-bin/bin# schematool -initSchema -dbType mysql

成功后提示:

Metastore connection URL:        jdbc:mysql://172.19.0.5:3306/hive?createDatabaseIfNotExist=true&useSSL=false
Metastore Connection Driver :    com.mysql.jdbc.Driver
Metastore connection User:       root
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.mysql.sql
Initialization script completed
schemaTool completed

六、验证

  • 我们先创建一个数据文件放到/usr/local
cd /usr/local
vim test.txt
1,jack
2,hel
3,nack
  • 进入hive交互界面
root@Master:/usr/local# hive
Hive Session ID = 7bec2ab6-e06d-4dff-8d53-a64611875aeb

Logging initialized using configuration in jar:file:/usr/local/apache-hive-3.1.2-bin/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = 5cdee915-5c95-4834-bd3b-bec6f3d90e5b
hive> create table test(
    >     id      int
    >    ,name    string
    > )
    > row format delimited
    > fields terminated by ',';
OK
Time taken: 1.453 seconds
hive> load data local inpath '/usr/local/test.txt' into table test;
Loading data to table default.test
OK
Time taken: 0.63 seconds
hive> select * from test;
OK
1       jack
2       hel
3       nack
Time taken: 1.611 seconds, Fetched: 3 row(s)

hive安装完毕

启动 Hiveserver2

一、修改hadoop的一些权限配置

root@Master:/usr/local# vim /usr/local/hadoop/etc/hadoop/core-site.xml
  • 加入以下配置
<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>
  • 重启hdfs:
root@Master:/usr/local/hadoop/sbin# ./stop-dfs.sh
Stopping namenodes on [Master]
Stopping datanodes
Stopping secondary namenodes [Master]
root@Master:/usr/local/hadoop/sbin# ./start-dfs.sh
Starting namenodes on [Master]
Starting datanodes
Starting secondary namenodes [Master]

二、后台启动hiveserver2

root@Master:/usr/local/hadoop/sbin#  nohup hiveserver2 >/dev/null 2>/dev/null &
[2] 7713

三、验证

通过beeline连接并查询

查看10000端口运行正常,beeline命令,!connect连接,结果查询正常

!connect jdbc:hive2://localhost:10000/default
root@Master:/usr/local/hadoop/sbin# netstat -ntulp |grep 10000
tcp        0      0 0.0.0.0:10000           0.0.0.0:*               LISTEN      7388/java
root@Master:/usr/local/hadoop/sbin# beeline
Beeline version 3.1.2 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000/default
Connecting to jdbc:hive2://localhost:10000/default
Enter username for jdbc:hive2://localhost:10000/default: root
Enter password for jdbc:hive2://localhost:10000/default: *********
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/default> select * from test;
INFO  : Compiling command(queryId=root_20200915075948_0672fa16-435e-449c-9fcd-f71fcdf6841c): select * from test
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test.id, type:int, comment:null), FieldSchema(name:test.name, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20200915075948_0672fa16-435e-449c-9fcd-f71fcdf6841c); Time taken: 2.584 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20200915075948_0672fa16-435e-449c-9fcd-f71fcdf6841c): select * from test
INFO  : Completed executing command(queryId=root_20200915075948_0672fa16-435e-449c-9fcd-f71fcdf6841c); Time taken: 0.004 seconds
INFO  : OK
INFO  : Concurrency mode is disabled, not creating a lock manager
+----------+------------+
| test.id  | test.name  |
+----------+------------+
| 1        | jack       |
| 2        | hel        |
| 3        | nack       |
+----------+------------+
3 rows selected (3.194 seconds)
0: jdbc:hive2://localhost:10000/default>

Hiveserver2配置完毕

本地使用SQL Developer连接Hive

工具oracle官网下载即可,免费,驱动到cloudera官网下,在工具第三方驱动处安装

Oracle SQL Developer 下载
Download Hive JDBC Driver 2.5.15

IP为宿主机的IP,端口则是一开始建立容器映射的10000端口

可以看到前面验证的test的表

镜像已上传阿里云:

registry.cn-shenzhen.aliyuncs.com/rango/myhadoop

更多信息请查看 博客 或者关注公众号:Z技术

  • 3
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
### 回答1: 很抱歉,我是一名语言模型AI,无法亲自进行实际操作。但是,我可以提供一些基于Docker容器快速搭建Hadoop+Spark+Hive+HBase集群的详细过程记录的参考资料,供您参考。 1. 准备工作 首先,需要安装DockerDocker Compose。可以参考官方文档进行安装。 2. 下载镜像 可以从Docker Hub上下载Hadoop、Spark、Hive和HBase的镜像。可以使用以下命令: ``` docker pull bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8 docker pull bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 docker pull bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 docker pull bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 docker pull bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 docker pull bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 docker pull bde2020/hive:2.3.7-postgresql-metastore docker pull bde2020/spark-base:2.4.5-hadoop2.7 docker pull bde2020/spark-master:2.4.5-hadoop2.7 docker pull bde2020/spark-worker:2.4.5-hadoop2.7 docker pull bde2020/hbase:2.2.4-hadoop3.2.1-java8 ``` 3. 编写docker-compose.yml文件 可以编写一个docker-compose.yml文件来定义Hadoop、Spark、Hive和HBase的容器。以下是一个示例: ``` version: '3' services: namenode: image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 container_name: namenode ports: - "9870:9870" volumes: - ./hadoop-data/namenode:/hadoop/dfs/name environment: - CLUSTER_NAME=hadoop-cluster datanode: image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 container_name: datanode volumes: - ./hadoop-data/datanode:/hadoop/dfs/data environment: - CLUSTER_NAME=hadoop-cluster - CORE_CONF_fs_defaultFS=hdfs://namenode:8020 resourcemanager: image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 container_name: resourcemanager ports: - "8088:8088" environment: - CLUSTER_NAME=hadoop-cluster - CORE_CONF_fs_defaultFS=hdfs://namenode:8020 - YARN_CONF_yarn_resourcemanager_hostname=resourcemanager nodemanager: image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 container_name: nodemanager environment: - CLUSTER_NAME=hadoop-cluster - CORE_CONF_fs_defaultFS=hdfs://namenode:8020 - YARN_CONF_yarn_resourcemanager_hostname=resourcemanager historyserver: image: bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 container_name: historyserver ports: - "8188:8188" environment: - CLUSTER_NAME=hadoop-cluster - CORE_CONF_fs_defaultFS=hdfs://namenode:8020 - YARN_CONF_yarn_resourcemanager_hostname=resourcemanager hive-metastore-postgresql: image: bde2020/hive:2.3.7-postgresql-metastore container_name: hive-metastore-postgresql ports: - "5432:5432" environment: - POSTGRES_PASSWORD=hivepassword - POSTGRES_USER=hiveuser - POSTGRES_DB=hivemetastore spark-master: image: bde2020/spark-master:2.4.5-hadoop2.7 container_name: spark-master ports: - "8080:8080" environment: - SPARK_CONF_spark_master_host=spark-master - SPARK_CONF_spark_eventLog_enabled=true - SPARK_CONF_spark_eventLog_dir=/tmp/spark-events - SPARK_CONF_spark_history_fs_logDirectory=hdfs://namenode:8020/spark-logs - SPARK_CONF_spark_history_ui_port=18080 spark-worker-1: image: bde2020/spark-worker:2.4.5-hadoop2.7 container_name: spark-worker-1 environment: - SPARK_CONF_spark_master_url=spark://spark-master:7077 - SPARK_CONF_spark_worker_cores=2 - SPARK_CONF_spark_worker_memory=2g spark-worker-2: image: bde2020/spark-worker:2.4.5-hadoop2.7 container_name: spark-worker-2 environment: - SPARK_CONF_spark_master_url=spark://spark-master:7077 - SPARK_CONF_spark_worker_cores=2 - SPARK_CONF_spark_worker_memory=2g hbase-master: image: bde2020/hbase:2.2.4-hadoop3.2.1-java8 container_name: hbase-master ports: - "16010:16010" environment: - HBASE_CONF_hbase_regionserver_hostname=hbase-master - HBASE_CONF_hbase_master_hostname=hbase-master hbase-regionserver: image: bde2020/hbase:2.2.4-hadoop3.2.1-java8 container_name: hbase-regionserver environment: - HBASE_CONF_hbase_regionserver_hostname=hbase-regionserver - HBASE_CONF_hbase_master_hostname=hbase-master ``` 4. 启动容器 可以使用以下命令启动容器: ``` docker-compose up -d ``` 5. 验证集群 可以使用以下命令验证集群: ``` docker exec -it namenode bash hdfs dfs -mkdir /test hdfs dfs -ls / exit ``` ``` docker exec -it spark-master bash spark-submit --class org.apache.spark.examples.SparkPi --master spark://spark-master:7077 /opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 10 exit ``` ``` docker exec -it hive-metastore-postgresql bash psql -h localhost -U hiveuser -d hivemetastore \dt \q exit ``` ``` docker exec -it hbase-master bash hbase shell create 'test', 'cf' list exit ``` 以上是一个基于Docker容器快速搭建Hadoop+Spark+Hive+HBase集群的详细过程记录。希望对您有所帮助。 ### 回答2: Docker是一种轻量级的虚拟化技术,可以在同一操作系统中运行多个独立的容器,各个容器之间相互隔离。通过Docker容器,快速搭建Hadoop、Spark、Hive和Hbase集群成为了可能。下面是基于Docker容器,快速搭建Hadoop、Spark、Hive和Hbase集群的详细过程记录: 1. 下载Docker技术栈 在运行Docker之前,我们需要先安装DockerDocker Compose。我们可以从官方Docker网站下载DockerDocker Compose: - Docker的下载链接:https://www.docker.com/get-started - Docker Compose的下载链接:https://docs.docker.com/compose/install/ 2. 创建docker-compose.yml文件 在运行Docker之前,我们需要创建一个docker-compose.yml文件,该文件定义了Docker容器的配置和组合。我们将以下容器定义在该文件中: - Hadoop NameNode - Hadoop DataNode - Hadoop ResourceManager - Hadoop NodeManager - Spark Master - Spark Worker - Hive Server - HBase Master 我们可以通过以下命令创建docker-compose.yml文件: ``` version: "2.2" services: namenode: container_name: namenode image: cloudera/quickstart:latest hostname: namenode ports: - "8020:8020" - "50070:50070" - "50075:50075" - "50010:50010" - "50020:50020" volumes: - ~/hadoop-data/namenode:/var/lib/hadoop-hdfs/cache/hdfs/dfs/name environment: SERVICE_PRECONDITION: HDFS_NAMENODE datanode: container_name: datanode image: cloudera/quickstart:latest hostname: datanode ports: - "50075:50075" - "50010:50010" - "50020:50020" volumes: - ~/hadoop-data/datanode:/var/lib/hadoop-hdfs/cache/hdfs/dfs/data environment: SERVICE_PRECONDITION: HDFS_DATANODE resourcemanager: container_name: resourcemanager image: cloudera/quickstart:latest hostname: resourcemanager ports: - "8088:8088" - "8030:8030" - "8031:8031" - "8032:8032" - "8033:8033" environment: SERVICE_PRECONDITION: YARN_RESOURCEMANAGER nodemanager: container_name: nodemanager image: cloudera/quickstart:latest hostname: nodemanager environment: SERVICE_PRECONDITION: YARN_NODEMANAGER sparkmaster: container_name: sparkmaster image: sequenceiq/spark:2.1.0 hostname: sparkmaster ports: - "8081:8081" command: bash -c "/usr/local/spark/sbin/start-master.sh && tail -f /dev/null" sparkworker: container_name: sparkworker image: sequenceiq/spark:2.1.0 hostname: sparkworker environment: SPARK_MASTER_HOST: sparkmaster command: bash -c "/usr/local/spark/sbin/start-worker.sh spark://sparkmaster:7077 && tail -f /dev/null" hiveserver: container_name: hiveserver image: bde2020/hive:2.3.4-postgresql-metastore hostname: hiveserver ports: - "10000:10000" environment: METASTORE_HOST: postgres META_PORT: 5432 MYSQL_DATABASE: hive MYSQL_USER: hive MYSQL_PASSWORD: hive POSTGRES_DB: hive POSTGRES_USER: hive POSTGRES_PASSWORD: hive hbasemaster: container_name: hbasemaster image: harisekhon/hbase hostname: hbasemaster ports: - "16010:16010" - "2181:2181" command: ["bin/start-hbase.sh"] ``` 3. 运行Docker容器 运行Docker容器的第一步是将docker-compose.yml文件放置在合适的路径下。在运行Docker容器之前,我们需要从Docker Hub拉取镜像,并运行以下命令: ``` $ docker-compose up -d ``` 该命令会运行所有定义在docker-compose.yml文件中的容器。 4. 配置集群 在运行Docker之后,我们需要进入相应的容器,例如进入namenode容器: ``` $ docker exec -it namenode bash ``` 我们可以使用以下命令检查Hadoop、Spark、Hive和HBase集群是否正确配置: - Hadoop集群检查: ``` $ hadoop fs -put /usr/lib/hadoop/README.txt / $ hadoop fs -ls / ``` - Spark集群检查: ``` $ spark-shell --master spark://sparkmaster:7077 ``` - Hive集群检查: ``` $ beeline -u jdbc:hive2://localhost:10000 ``` - HBase集群检查: ``` $ hbase shell ``` 5. 关闭Docker容器 在测试完成后,我们可以使用以下命令关闭所有Docker容器: ``` $ docker-compose down --volumes ``` 综上所述,Docker容器是快速搭建Hadoop、Spark、Hive和HBase集群的理想选择。通过docker-compose.yml文件,我们可以轻松配置和管理整个集群。使用这种方法,可以节省大量的时间和精力,并使整个搭建过程更加方便和高效。 ### 回答3: Docker容器是一种轻型的虚拟化技术,能够快速搭建大型分布式系统集群。可以使用Docker容器快速搭建Hadoop,Spark,Hive和HBase集群。下面是基于Docker容器搭建大数据集群的详细过程记录: 1.安装DockerDocker-Compose 首先需要安装DockerDocker-Compose。可以按照官方文档详细教程进行安装。 2.创建Docker文件 创建一个Dockerfile文件用于构建Hadoop,Spark,Hive和HBase的镜像。在该文件内添加以下内容: FROM ubuntu:16.04 RUN apt-get update # Install JDK, Python, and other dependencies RUN apt-get install -y openjdk-8-jdk python python-dev libffi-dev libssl-dev libxml2-dev libxslt-dev # Install Hadoop RUN wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz RUN tar -xzvf hadoop-2.7.7.tar.gz RUN mv hadoop-2.7.7 /opt/hadoop # Install Spark RUN wget http://www.eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz RUN tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz RUN mv spark-2.4.0-bin-hadoop2.7 /opt/spark # Install Hive RUN wget http://www.eu.apache.org/dist/hive/hive-2.3.4/apache-hive-2.3.4-bin.tar.gz RUN tar -zxvf apache-hive-2.3.4-bin.tar.gz RUN mv apache-hive-2.3.4-bin /opt/hive # Install HBase RUN wget http://www.eu.apache.org/dist/hbase/hbase-1.4.9/hbase-1.4.9-bin.tar.gz RUN tar -zxvf hbase-1.4.9-bin.tar.gz RUN mv hbase-1.4.9 /opt/hbase # Set Environment Variables ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64 ENV HADOOP_HOME /opt/hadoop ENV SPARK_HOME /opt/spark ENV HIVE_HOME /opt/hive ENV HBASE_HOME /opt/hbase ENV PATH $PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$HIVE_HOME/bin:$HBASE_HOME/bin # Format HDFS RUN $HADOOP_HOME/bin/hdfs namenode -format 3.创建Docker-Compose文件 创建一个docker-compose文件,里面有一个master节点和两个worker节点。在docker-compose文件中添加以下内容: version: "3" services: master: image: hadoop-spark-hive-hbase container_name: master hostname: master ports: - "22" - "8088:8088" - "8030:8030" - "8031:8031" - "8032:8032" - "9000:9000" - "10020:10020" - "19888:19888" - "50010:50010" - "50020:50020" - "50070:50070" - "50075:50075" volumes: - /data:/data command: - /usr/sbin/sshd - -D worker1: image: hadoop-spark-hive-hbase container_name: worker1 hostname: worker1 ports: - "22" - "50010" - "50020" - "50075" volumes: - /data:/data command: - /usr/sbin/sshd - -D worker2: image: hadoop-spark-hive-hbase container_name: worker2 hostname: worker2 ports: - "22" - "50010" - "50020" - "50075" volumes: - /data:/data command: - /usr/sbin/sshd - -D 4.构建镜像 运行以下命令来构建镜像: docker build -t hadoop-spark-hive-hbase . 5.启动容器 运行以下命令来启动容器: docker-compose up -d 6.测试集群 在浏览器中输入http://IP地址:8088,可以看到Hadoop和YARN的Web控制台。 在浏览器中输入http://IP地址:50070,可以看到HDFS的Web控制台。 在浏览器中输入http://IP地址:8888,可以看到Jupyter Notebook。 在Jupyter Notebook中,创建一个Python文件并运行以下代码来测试Spark集群: from pyspark import SparkContext sc = SparkContext() rdd1 = sc.parallelize(range(1000)) rdd2 = sc.parallelize(range(1000, 2000)) rdd3 = rdd1.union(rdd2) rdd3.take(10) 以上就是基于Docker容器快速搭建Hadoop,Spark,Hive和HBase集群的详细过程记录。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值