大数据平台学习之路(3)安装hive-2.3.4编译spark-2.3.3

1、背景

上一篇博客中已经安装了 hadoop并配置了免密登录,这篇博客记录安装hive和编译spark的过程。

2、准备文件

ubuntu 16.04 

http://releases.ubuntu.com/16.04/ubuntu-16.04.6-desktop-amd64.iso.torrent?_ga=2.96890143.1440843407.1553350287-1855693555.1552535409

hadoop-2.8.5 集群

http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz

spark-2.3.3源码包

https://archive.apache.org/dist/spark/spark-2.3.3/spark-2.3.3-bin-sources.tgz(作者这里进行访问的时候显示The requested URL /dist/spark/spark-2.3.3/spark-2.3.3-bin-sources.tgz was not found on this server.如果不能下载请访问下面链接去清华源下载https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.3.3/spark-2.3.3.tgz

)

hive-2.3.4官方编译版本

https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-2.3.4/apache-hive-2.3.4-bin.tar.gz

maven 3.5.4(可不用)

https://archive.apache.org/dist/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz

scala-2.11.8(最好是这个版本,不要最新版)

https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz

3、安装hive

我将hive配置在了data1节点上

在data1上执行命令

$ wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-2.3.4/apache-hive-2.3.4-bin.tar.gz
$ tar zxvf apache-hive-2.3.4-bin.tar.gz -C /usr/local/
$ mv /usr/local/apache-hive-2.3.4-bin /usr/local/hive-2.3.4

修改hive参数(hive/conf路径)前面的是需要添加的,后面的是需要修改的,直接查找就能找到。文件中有注释,

其中数据库的连接用户名和密码需要改成自己的。

$ cd /usr/local/hive-2.3.4/conf
$ cp hive-default.xml.template hive-site.xml
$ gedit hive-site.xml
<!--这些是需要添加的-->
<property>
    <name>system:java.io.tmpdir</name>
    <value>/usr/local/hadoop-2.8.5/tmp</value>
    <description/>
  </property>
<property>
     <name>system:user.name</name>
     <value>hadoop</value>
</property>
<property>
    <name>spark.eventLog.enable</name>
    <value>true</value>
    <description/>
  </property>
<property>
    <name>spark.home</name>
    <value>/usr/local/spark-2.3.3</value>
    <description/>
  </property>
<property>
<name>spark.master</name>
<value>yarn-client</value>
</property>
<property>
<name>spark.default.parallelism</name>
<value>6</value>
</property>
<property>
<name>spark_worker_cores</name>
<value>2</value>
</property>
<property>
  <name>spark.yarn.jars</name>
  <value>hdfs://master:9000/spark-jars/*</value>
</property>
<property> 
<name>spark.submit.deployMode</name> 
<value>client</value> 
</property>
<property>
    <name>spark.executor.memory</name>
    <value>512m</value>
    <description/>
  </property>
<property>
    <name>spark.executor.instances</name>
    <value>20</value>
    <description/>
  </property>
<property>
    <name>spark.driver.memory</name>
    <value>512m</value>
    <description/>
  </property>
<property>
    <name>spark.serializer</name>
    <value>org.apache.spark.serializer.KryoSerializer</value>
    <description/>
  </property>
<property>
<name>spark.executor.extraJavaOptions</name>
<value>-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"</value>
</property>
	<property>
    <name>spark.eventLog.dir</name>
    <value>hdfs://master:9000/spark-log</value>
    <description/>
  </property>
<property>
  <name>hive.enable.spark.execution.engine</name>
  <value>true</value>
</property>
<!--上面是需要添加的-->
<!--下面是需要修改的,直接查找参数名可以搜到-->
<property>
    <name>hive.exec.reducers.max</name>
    <value>999</value>
    <description>
      max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is
      negative, Hive will use this one as the max number of reducers when automatically determine number of reducers.
    </description>
  </property>
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://192.168.0.11:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>Username to use against metastore database</description>
  </property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
    <description>password to use against metastore database</description>
  </property>
<property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <description>
      Enforce metastore schema version consistency.
      True: Verify that version information stored in is compatible with one from Hive jars.  Also disable automatic
            schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
            proper metastore schema migration. (Default)
      False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
    </description>
  </property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
<property>
    <name>hive.server2.thrift.bind.host</name>
    <value>192.168.0.11</value>
    <description>Bind host on which to run the HiveServer2 Thrift service.</description>
  </property>
<property>
    <name>hive.server2.webui.host</name>
    <value>192.168.0.11</value>
    <description>The host address the HiveServer2 WebUI will listen on</description>
  </property>
<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
    <description>
      Expects one of [mr, tez, spark].
      Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR
      remains the default engine for historical reasons, it is itself a historical engine
      and is deprecated in Hive 2 line. It may be removed without further warning.
    </description>
  </property>
<!--上面是需要修改的,直接查找参数名可以搜到-->

因为将hive的元数据放在mysql中,安装mysql-server

$ sudo apt-get install mysql-server 

设置密码

mysql数据库中建hive元数据库我的示例中密码是123123 

$ mysql -uroot -p123123
#将下面的复制直接输入到mysql-shell中
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123123';
GRANT ALL PRIVILEGES ON *.* TO 'root'@'localhost' IDENTIFIED BY '123123';
GRANT ALL PRIVILEGES ON *.* TO 'root'@'192.168.0.11' IDENTIFIED BY '123123';
flush privileges;
create database hive character set utf8 collate utf8_general_ci;
GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'localhost' IDENTIFIED BY 'hive';
GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'192.168.0.11' IDENTIFIED BY 'hive';
GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'%' IDENTIFIED BY 'hive';
flush privileges;

修改环境变量

$ vim ~/.bashrc
#添加下面变量
export HIVE_HOME=/usr/local/hive-2.3.4
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/local/hive-2.3.4/lib/*
$ source ~/.bashrc

添加mysql-connector-java.jar

$ wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.47/mysql-connector-java-5.1.47.jar
$ cp mysql-connector-java-5.1.47.jar /usr/local/hive-2.3.4/lib/mysql-connector-java.jar
$ sudo cp mysql-connector-java-5.1.47.jar ${JAVA_HOME}/lib/

初始化hive mysql源数据库

$ schematool -initSchema -dbType mysql

接下来编译spark

3、编译spark前准备(spark-2.3.3)

下载spark-2.3.3源码

$ wget https://archive.apache.org/dist/spark/spark-2.3.3/spark-2.3.3-bin-sources.tgz 

如果发现连接资源不存在 请到https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.3.3/spark-2.3.3.tgz下载,如果依然无法下载,请等待官方解决。

解压文件到主目录下

$ tar xvf spark-2.3.3.tgz -C ~/

添加环境变量(编译目录)

 export SPARK_HOME=/home/hadoop/spark-2.3.3

下载scala-2.11.8

 

$ wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
$ tar xvf scala-2.11.8.tgz /usr/local/

添加scala环境变量

$ vim ~/.bashrc
export SCALA_HOME=/usr/local/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin
export PATH=$PATH:$SCALA_HOME/sbin

4、开始编译spark

$ cd ~/spark-2.3.3
~/spark-2.3.3$ ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.8.5  -Pyarn -Phive -Phive-thriftserver -DskipTests clean package

注意编译时的主目录是pom.xml所在的目录,否则会因为缺少定义而编译停止

如果发现zinc-0.3.15下载过慢,可以将这个文件在浏览器下载完成之后拷贝到/home/hadoop/spark-2.3.3/build/路径下

scala-2.11.8同样

然后spark在下载完maven之后会进行编译,但是下载速度非常慢,因为默认使用国外源,修改为国内源(阿里源)

修改/home/hadoop/spark-2.3.3/build/apache-maven-3.3.9/conf下的settings.xml

$ cd /home/hadoop/spark-2.3.3/build/apache-maven-3.3.9/conf
$ gedit settings.xml

添加源

       <mirror>
           <id>alimaven</id>
           <mirrorOf>central</mirrorOf>
           <name>aliyun maven</name>
           <url>http://maven.aliyun.com/nexus/content/repositories/central/</url>
       </mirror>
   
       <!-- 中央仓库1 -->
       <mirror>
           <id>repo1</id>
           <mirrorOf>central</mirrorOf>
           <name>Human Readable Name for this Mirror.</name>
           <url>http://repo1.maven.org/maven2/</url>
       </mirror>
   
       <!-- 中央仓库2 -->
       <mirror>
           <id>repo2</id>
           <mirrorOf>central</mirrorOf>
           <name>Human Readable Name for this Mirror.</name>
           <url>http://repo2.maven.org/maven2/</url>
      </mirror>

然后执行上面命令继续编译,需要挺长时间。

然后执行,生成编译包

~/spark-2.3.3$ ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-2.7 -Dhadoop.version=2.8.5 -Pyarn -Phive -Phive-thriftserver

中间会报 ImportError: No module named setuptools

需要安装setuptools 不过我们安装pip就可以了

$ sudo apt-get install python-pip

编译完成之后会在spark-2.3.3文件夹里找到安装包。解压到/usr/local/目录下

$ tar ~/spark-2.3.3/spark-2.3.3-bin-custom-spark.tgz -C /usr/local/
$ mv /usr/local/spark-2.3.3-bin-custom-spark mv /usr/local/spark-2.3.3
 

修改conf文件夹下参数

$ cd /usr/local/spark-2.3.3/conf
$ cp spark-env.sh.template spark-env.sh
$ cp spark-defaults.conf.template spark-defaults.conf
$ cp slaves.template slaves
$ cp log4j.properties.template log4j.properties

修改spark-env.sh 添加下面参数

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_HOME=/usr/local/spark-2.3.3
export SCALA_HOME=/usr/local/scala-2.11.8
export HADOOP_HOME=/usr/local/hadoop-2.8.5
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)
#export HADOOP_CLASSPATH=.:$CLASSPATH:$HADOOP_CLASSPATH:$HADOOP_HOME/bin
#export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_MASTER_INSTANCES=10
#export SPARK_WORKER_MEMORY=2048m
export SPARK_DRIVER_MEMORY=1024m
export SPARK_YARN_AM_MEMORY=512 
export SPARK_MASTER_HOST=192.168.0.10
export SPARK_EXECUTOR_MEMORY=1024m
export SPARK_WORKER_CORES=2
export SPARK_LIBRARY_PATH=hdfs://192.168.0.10:9000/spark-jars
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_WORKER_DIR=$SPARK_HOME/work
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_LOG_DIR=$SPARK_HOME/logs
export SPARK_PID_DIR='/usr/local/spark-2.3.3/run'
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/local/spark-2.3.3/jars/mysql-connector-java.jar

修改spark-defaults.xml ,添加下面参数

spark.master                     yarn-client
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://master:9000/spark-log
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.yarn.archive      	hdfs://master:9000/spark-jars
spark.executor.extraJavaOptions  -Dsun.io.serialization.extendedDebugInfo=true
spark.driver.extraJavaOption -Dsun.io.serialization.extendedDebugInfo=true

修改slaves

master
data1
data2

修改log4j.properties

log4j.rootCategory=INFO, console -->> log4j.rootCategory=WARN, console

添加环境变量,删除之前的spark变量

$ vim ./.bashrc
export SPARK_HOME=/usr/local/spark-2.3.3
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:$SPARK_HOME/sbin
export SPARK_DIST_CLASSPATH=/usr/local/hadoop-2.8.5/bin/hadoop
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

将mysql-connector-java.jar 拷贝到spark/jars中。

$ cp ~/mysql-connector-java-5.1.47.jar /usr/local/spark-2.3.3/jars/mysql-connector-java.jar

将hive目录下conf/hive-site.xml拷贝到spark/conf/下并分配给其他节点

$ cp /usr/local/hive-2.3.4/conf/hive-site.xml /usr/local/spark-2.3.3/conf/
$ scp /usr/local/hive-2.3.4/conf/hive-site.xml hadoop@master:/usr/local/spark-2.3.3/conf/
$ scp /usr/local/hive-2.3.4/conf/hive-site.xml hadoop@data2:/usr/local/spark-2.3.3/conf/

将spark,scala 文件夹分发到其他目录

$ scp -r /usr/local/spark-2.3.3 hadoop@data1:/usr/local/
$ scp -r /usr/local/scala-2.11.8 hadoop@data1:/usr/local/
$ scp -r /usr/local/spark-2.3.3 hadoop@data2:/usr/local/
$ scp -r /usr/local/scala-2.11.8 hadoop@data2:/usr/local/

将spark中的部分jar包拷贝到hive的lib中以支持hive使用spark引擎。

$ cp /usr/local/spark-2.3.3/jars/spark-* /usr/local/hive-2.3.4/lib/
$ cp /usr/local/spark-2.3.3/jars/scala-* /usr/local/hive-2.3.4/lib/

启动hadoop集群,在主节点执行

$ start-all.sh

创建hdfs上我们定义参数中涉及的文件夹

$ hdfs dfs -mkdir  /spark-jars
$ hdfs dfs -mkdir  /tmp
$ hdfs dfs -mkdir /spark-log
$ hdfs dfs -chmod -R 777 /tmp

将spark/jars中的jar包传到hdfs上以便于每次运行spark时不用再从本地上传

$ hdfs dfs -put /usr/local/spark-2.3.3/jars/* /spark-jars/

jars包经常使用,冗余备份3份而且会分散到3个节点中(机架感知)

hadoop dfs -setrep -w 3 -R /spark-jars/

在data节点上启动hive元数据库

$ hive --service metastore 

如果出现这个提示

将hive中的log4j-slf4j-impl-*给删除掉就不会提示了

hive on spark 

使用 hive

$ hive

 

启动hiveserver2 

$ hiveserver2

 

关闭hiveserver2(ctrl +c 关闭进程即可)

spark-sql 

$ spark-sql --master yarn

出现警告  出现同一包不同版本,不用理会

 

还会出现很多的警告,参数不存在,这个不用理会。

         

出现了shell命令行

然后执行没有错误

使用pyspark 

需要启动start-thriftserver.sh

$ start-thriftserver.sh --master yarn

pyspark调用hivecontext 查表中数据

$ pyspark --master yarn

 

运行基本都没有问题了

下个博客讲编译hue并配置使用hue

有问题可以评论,我会尽可能的及时回答,谢谢

喜欢点个赞呗

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值