基础环境安装
首先,我们需要搭建软件的运行环境,本文中所使用的软件均需要运行在Java环境之上,所以在您的电脑中安装JDK,并设置好环境变量,参考链接:传送门 。( 作者已将相关软件上传网盘:点击下载 提取码:8cmm )
注意事项:
- JDK版本建议1.8,1.9会报错
- 所有软件的安装路径不能存在空格,特别是Java的安装路径
- spark-shell的运行需要scala环境,所以在安装spark前我们需要先安装Scala(先从网盘中下载Scala,然后解压到电脑硬盘中,设置scala环境变量,配置变量名 SCALA_HOME,值为解压的目录地址,然后到PATH中添加路径 %SCALA_HOME%\bin)
- 启动hive前必须先启动hadoop,才能连接上9000端口
spark安装
下载spark安装软件(网盘中版本为spark-2.4.3-bin-hadoop2.7),将其解压至硬盘中,然后配置环境变量,变量名 SPARK_HOME,变量值为解压的目录地址,最后到PATH中添加路径 %SPARK_HOME%\bin
测试:打开cmd窗口,输入 spark-shell 命令
Hadoop安装
下载Hadoop安装软件(网盘中版本为hadoop-2.8.4.tar),将其解压至硬盘中。其次,下载winutils-master.zip并解压。
接下来按如下操作:
- 删除hadoop目录下的bin和etc,将hadooponwindows-master下的bin和etc目录拷贝到hadoop下
- 修改etc/hadoop/core-site.xml文件
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- 修改etc/hadoop/mapred-site.xml文件
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
- 修改etc/hadoop/hdfs-site.xml文件(在Hadoop安装目录下新建data/namenode和data/datanode文件夹)
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/xxx(你的安装目录)/hadoop-2.8.4/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/xxx(你的安装目录)/hadoop-2.8.4/data/datanode</value>
</property>
</configuration>
- 修改etc\hadoop\yarn-site.xml文件
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
- 修改etc/hadoop/hadoop-env.cmd文件
@rem set JAVA_HOME=%JAVA_HOME%
set JAVA_HOME=xxx(java的安装目录)\jdk1.8.0_111
- 配置环境变量,变量名 HADOOP_HOME,变量值为解压的目录地址,最后到PATH中添加路径 %HADOOP_HOME%\bin
- 格式化namenode,打开cmd窗口,执行 hdfs namenode -format
- 启动Hadoop,打开cmd窗口,进入sbin目录下执行 start-all.cmd,一共启动4个窗口:namenode,datanode,yarn resourcemanager,yarn nodemanager
此时即代表Hadoop成功运行。
hive安装
下载hive安装软件(网盘中版本为apache-hive-2.1.1-bin.tar),将其解压至硬盘中。配置环境变量,变量名 HIVE_HOME,变量值为解压的目录地址,最后到PATH中添加路径 %HIVE_HOME%\bin。
因为Hive需要仓库存储数据,我们需要提前安装MySQL(网盘中版本为mysql-5.6.36-winx64),安装过程参考博客:MySQL-5.6.13免安装版配置方法 。
- 复制conf文件夹下的hive-default.xml.template文件,重命名为hive-site.xml,修改关于元数据存储的数据库的配置,每个配置都需要搜索名称(name),然后改成自己的配置(value)。
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
- 接下来,修改类似于 “${system:” 的配置,在hive-site.xml中有很多地方引用了这种形式的变量,但在Windows环境中会报错,所以我们需要一一修改成具体地址。(相关文件夹可创建)
<property>
<name>hive.exec.local.scratchdir</name>
<value>xxx(hive安装的目录)/scratch_dir</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>xxx(hive安装的目录)/resources_dir/${hive.session.id}_resources</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>xxx(hive安装的目录)/querylog_dir</value>
<description>Location of Hive run time structured log file</description>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>xxx(hive安装的目录)/operation_dir</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
此外,hive.metastore.warehouse.dir,hive.user.install.directory的地址需要在hdfs中创建。
- 将hive-log4j2.properties.template这个文件复制,重命名为hive-log4j2.properties,并更改内容如下
status = INFO
name = HiveLog4j2
packages = org.apache.hadoop.hive.ql.log
# list of properties
property.hive.log.level = INFO
property.hive.root.logger = DRFA
property.hive.log.dir = hive_log
property.hive.log.file = hive.log
property.hive.perflogger.log.level = INFO
# list of all appenders
appenders = console, DRFA
# console appender
appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{ISO8601} %5p [%t] %c{2}: %m%n
# daily rolling file appender
appender.DRFA.type = RollingRandomAccessFile
appender.DRFA.name = DRFA
appender.DRFA.fileName = ${hive.log.dir}/${hive.log.file}
# Use %pid in the filePattern to append <process-id>@<host-name> to the filename if you want separate log files for different CLI session
appender.DRFA.filePattern = ${hive.log.dir}/${hive.log.file}.%d{yyyy-MM-dd}
appender.DRFA.layout.type = PatternLayout
appender.DRFA.layout.pattern = %d{ISO8601} %5p [%t] %c{2}: %m%n
appender.DRFA.policies.type = Policies
appender.DRFA.policies.time.type = TimeBasedTriggeringPolicy
appender.DRFA.policies.time.interval = 1
appender.DRFA.policies.time.modulate = true
appender.DRFA.strategy.type = DefaultRolloverStrategy
appender.DRFA.strategy.max = 30
# list of all loggers
loggers = NIOServerCnxn, ClientCnxnSocketNIO, DataNucleus, Datastore, JPOX, PerfLogger
logger.NIOServerCnxn.name = org.apache.zookeeper.server.NIOServerCnxn
logger.NIOServerCnxn.level = WARN
logger.ClientCnxnSocketNIO.name = org.apache.zookeeper.ClientCnxnSocketNIO
logger.ClientCnxnSocketNIO.level = WARN
logger.DataNucleus.name = DataNucleus
logger.DataNucleus.level = ERROR
logger.Datastore.name = Datastore
logger.Datastore.level = ERROR
logger.JPOX.name = JPOX
logger.JPOX.level = ERROR
logger.PerfLogger.name = org.apache.hadoop.hive.ql.log.PerfLogger
logger.PerfLogger.level = ${hive.perflogger.log.level}
# root logger
rootLogger.level = ${hive.log.level}
rootLogger.appenderRefs = root
rootLogger.appenderRef.root.ref = ${hive.root.logger}
- 在mysql创建hive库,并对该库执行scripts\metastore\upgrade\mysql\hive-txn-schema-2.1.0.mysql.sql脚本
- 复制mysql驱动jar到$HIVE_HOME/lib下(网盘中版本为mysql-connector-java-5.1.46)
- 在Hadoop启动后,打开cmd窗口,执行以下命令启动metastore
hive --service metastore -hiveconf hive.root.logger=DEBUG
接着,新开一个cmd窗口,执行以下命令启动hiveserver
hive --service hiveserver2
最后,可以打开hive客户端
hive --service cli
- 执行测试语句
show databases;
总结
至此,spark+Hadoop+hive的Windows伪分布式测试环节已搭建完成,各位有问题欢迎随时交流。