Hive on Spark安装,hive是基于Hadoop的数据仓库,hdfs为hive存储空间,mapreduce为hive的sql计算引擎。但是由于mapreduce很多计算过程都要经过硬盘读写等劣势,和spark等计算引擎相比,无论是计算速度,还是计算灵活度上都有很多劣势,这也导致了hive on mapreduce计算速度并不是令人很满意。本篇来讲下hive on spark,将hive的计算引擎替换为spark,速度将有很大的提升. 一、环境准备 centos6.5 hadoop2.6集群,需要hdfs、yarn hive2.0.0 spark1.5源码 maven3.5(自行安装) jdk1.8(自行安装) scala2.10(自行安装) 二、maven编译spark,在官网下载spark1.5源码,在源码根目录下运行 ./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided" 生成spark-1.5.0-bin-hadoop2-without-hive.tgz 三、安装hadoop2.6集群 1、免密登陆并修改主机名 ssh-keygen -t rsa ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop 2、解压 3、配置环境变量 export JAVA_HOME=/usr/local/jdk1.8.0_121/ export JAVA_BIN=$JAVA_HOME/bin export PATH=$PATH:$JAVA_HOME/bin export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar HADOOP_HOME=/home/hadoop/apps/hadoop-2.6.0-cdh5.5.2 HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop PATH=$HADOOP_HOME/bin:$PATH export HADOOP_HOME HADOOP_CONF_DIR PATH source .bash_profile 4、修改core-site.xml vi core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://weisc:9000</value> <description>NameNode URI.</description> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> <description>Size of read/write buffer used inSequenceFiles.</description> </property> </configuration> 5、 编辑hdfs-site.xml [hadoop@h201 hadoop-2.6.0]$ mkdir -p dfs/name [hadoop@h201 hadoop-2.6.0]$ mkdir -p dfs/data [hadoop@h201 hadoop-2.6.0]$ mkdir -p dfs/namesecondary [hadoop@h201 hadoop]$ vi hdfs-site.xml <property> <name>dfs.namenode.secondary.http-address</name> <value>weisc:50090</value> <description>The secondary namenode http server address andport.</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///home/hadoop/apps/hadoop-2.6.0-cdh5.5.2/dfs/name</value> <description>Path on the local filesystem where the NameNodestores the namespace and transactions logs persistently.</description> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///home/hadoop/apps/hadoop-2.6.0-cdh5.5.2/dfs/data</value> <description>Comma separated list of paths on the local filesystemof a DataNode where it should store its blocks.</description> </property> <property> <name>dfs.namenode.checkpoint.dir</name> <value>file:///home/hadoop/apps/hadoop-2.6.0-cdh5.5.2/dfs/namesecondary</value> <description>Determines where on the local filesystem the DFSsecondary name node should store the temporary images to merge. If this is acomma-delimited list of directories then the image is replicated in all of thedirectories for redundancy.</description> </property> <property> <name>dfs.replication</name> <value>1</value> </property> 6、 编辑mapred-site.xml [hadoop@h201 hadoop]$ cp mapred-site.xml.template mapred-site.xml <property> <name>mapreduce.framework.name</name> <value>yarn</value> <description>Theruntime framework for executing MapReduce jobs. Can be one of local, classic oryarn.</description> </property> <property> <name>mapreduce.jobhistory.address</name> <value>weisc:10020</value> <description>MapReduce JobHistoryServer IPC host:port</description> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>weisc:19888</value> <description>MapReduce JobHistoryServer Web UI host:port</description> </property> ***** 属性”mapreduce.framework.name“表示执行mapreduce任务所使用的运行框架,默认为local,需要将其改为”yarn” ***** 7、 编辑yarn-site.xml [hadoop@h201 hadoop]$ vi yarn-site.xml <property> <name>yarn.resourcemanager.hostname</name> <value>weisc</value> <description>The hostname of theRM.</description> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> <description>Shuffle service that needs to be set for Map Reduceapplications.</description> </property> 8、 [hadoop@h201 hadoop]$ vi hadoop-env.sh export JAVA_HOME=/usr/jdk1.7.0_25 9、 [hadoop@h201 hadoop]$ vi slaves h202 h203 10、格式化 bin/hadoop namenode -format 四、安装mysql-server yum -y install mysql-server mysql create user 'hive' identified by 'hive'; 必须设置远程可登陆 grant all privileges on *.* to hive@'%' identified by 'hive' with grant option; flush privileges; create database hive; 五、安装hive2.0 1、修改conf/hive-site.xml <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> <description>password to use against metastore database</description> </property> <property> <name>hive.exec.scratchdir</name> <value>/tmp/hive</value> <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description> </property> <property> <name>hive.exec.local.scratchdir</name> <value>/tmp/hive/local</value> <description>Local scratch space for Hive jobs</description> </property> <property> <name>hive.downloaded.resources.dir</name> <value>/tmp/hive/resources</value> <description>Temporary local directory for added resources in the remote file system.</description> </property> 2、初始化数据库bin/schematool -initSchema -dbType mysql 六、安装scala2.10.1 tar -zxvf scala-2.10.1.tgz vi .bash_profile 修改环境变量 export SCALA_HOME=/apps/scala-2.10.1 PATH=$HADOOP_HOME/bin:$PATH:$SCALA_HOME/bin 查看scala -version 七、安装spark后 bin/run-example org.apache.spark.examples.SparkPi 修改hive-site.xml <property> <name>hive.execution.engine</name> <value>spark</value> </property> <property> <name>hive.enable.spark.execution.engine</name> <value>true</value> </property> <property> <name>spark.home</name> <value>/apps/spark</value> </property> <!--sparkcontext --> <property> <name>spark.master</name> <value>yarn-cluster</value> </property> <property> <name>spark.serializer</name> <value>org.apache.spark.serializer.KryoSerializer</value> </property> <property> <name>spark.executor.memeory</name> <value>2g</value> </property> <property> <name>spark.driver.memeory</name> <value>1g</value> </property> <property> <name>spark.executor.cores</name> <value>2</value> </property> <property> <name>spark.executor.instances</name> <value>4</value> </property> <property> <name>spark.app.name</name> <value>myInceptor</value> </property> <!--事务相关 --> <property> <name>hive.support.concurrency</name> <value>true</value> </property> <property> <name>hive.enforce.bucketing</name> <value>true</value> </property> <property> <name>hive.exec.dynamic.partition.mode</name> <value>nonstrict</value> </property> <property> <name>hive.txn.manager</name> <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value> </property> <property> <name>hive.compactor.initiator.on</name> <value>true</value> </property> <property> <name>hive.compactor.worker.threads</name> <value>1</value> </property> <property> <name>spark.executor.extraJavaOptions</name> <value>-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" </value> </property> <property> <name>hive.server2.enable.doAs</name> <value>false</value> </property> 八、运行hive select count(*) from test;
hive on spark的安装实现
最新推荐文章于 2023-06-03 20:55:52 发布