Hive安装配置

最新推荐文章于 2023-07-07 16:12:16 发布

weixin_33860147

最新推荐文章于 2023-07-07 16:12:16 发布

阅读量156

点赞数

文章标签：数据库大数据操作系统

原文链接：https://my.oschina.net/tearsky/blog/629763

版权

2019独角兽企业重金招聘Python工程师标准>>>

普及：什么是HIVE

Hive 是建立在 Hadoop 上的数据仓库基础构架。它提供了一系列的工具，可以用来进行数据提取转化加载（ETL ），这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。Hive 定义了简单的类 SQL 查询语言，称为 QL ，它允许熟悉 SQL 的用户查询数据。同时，这个语言也允许熟悉 MapReduce 开发者的开发自定义的 mapper 和 reducer 来处理内建的 mapper 和 reducer 无法完成的复杂的分析工作。
Hive是SQL解析引擎，它将SQL语句转译成M/R Job然后在Hadoop执行。
Hive的表其实就是HDFS的目录，按表名把文件夹分开。如果是分区表，则分区值是子文件夹，可以直接在M/R Job里使用这些数据。

Hive相当于hadoop的客户端工具，部署时不放在不一定放在集群节点中，可以放在某个节点上。

一、安装配置

1、下载hive源文件

http://hive.apache.org/downloads.html

2、解压hive文件

tar -zxvf apache-hive-1.2.1-bin.tar.gz

3、进入$HIVE_HOME/conf/修改文件
cp hive-env.sh.template hive-env.sh

cp hive-default.xml.template hive-site.xml

这里一定注意，只有hive-site.xml才能被识别，所以不能使用hive-default.xml,一定修改为hive-site.xml

4、配置环境变量

#vim /etc/profile

#增加HIVE的HOME配置

export HIVE_HOME=/u01/apache-hive-1.2.1-bin

export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_HOME/bin:$HBASE_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$HIVE_HOME/bin

1， hive 命令行模式，直接输入/hive/bin/hive的执行程序，或者输入 hive –service cli
用于linux平台命令行查询，查询语句基本跟mysql查询语句类似
2， hive web界面的启动方式，hive –service hwi
用于通过浏览器来访问hive，感觉没多大用途

3， hive 远程服务 (端口号10000) 启动方式，nohup hive –service hiveserver2 &

#/u01/apache-hive-1.2.1-bin/bin/hiveserver2 >/dev/null 2>/dev/null &

默认端口10000

二、配置说明

配置HDFS存储的表位置：表与目录的对应关系

<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
    <description>location of default database for the warehouse</description>

</property>

三、简单操作

1、HIVE与外部交互

与linux交互命令！
!ls
!pwd

与hdfs交互命令
dfs -ls /

dfs -mkdir /hive

2、CLI操作

直接输入#/hive/bin/hive的执行程序，

显示数据库

hive> show databases;
OK
default

Time taken: 0.061 seconds, Fetched: 1 row(s)

显示表

hive>show tables;

创建数据表，名称：YHJHK_IY02

hive>create table YHJHK_IY02(id int,name string);

加载数据库（inpath后面的是hdfs里面的路径，这里省略了文件系统头，最后一个是表明，需要提前建立）

hive> load data inpath '/data/IY02_C.txt' into table YHJHK_IY02;
Loading data to table default.yhjhk_iy02
Table default.yhjhk_iy02 stats: [numFiles=1, totalSize=5581057933]
OK

Time taken: 10.845 seconds

切换数据库

hive>use database_name;

查看表结构

hive> describe YHJHK_IY02 ;

修改表名称

hive>ALTER TABLE YHJHK_IY02 RENAME TO IY02 ;

退出

hive>quit;

#hadoop fs -ls /user/hive/warehouse/

修改参数：hive.metastore.warehouse.dir

表与目录的对应关系

这里需要注意，如果使用的derby数据库，那么一旦切换了目录再执行HIVE进入客户端，则无法查询到刚刚创建的表信息，只有切换到刚刚创建表信息的目录下执行HIVE命令才能查询的出信息，具体原因请查看下面问题2

3、创建一张表，并加载HDFS的一个txt文件数据

文件数据格式为：

386 陆风汽车 88

第一个是ID，第二个是名称，第三个是父ID，中间用tab制表符分开

hive> create table brands

> (id int, name string,pid int)

> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'

> STORED AS TEXTFILE;

上面表示创建了一张brands的表，每一行作为分割，字段每个字段终止是tab制表符（\t制表符类似按下tab按键出来的空格），最后存储作为文本文件进行存储起来

通过上面这个创建过程，我们就可以在HDFS中查看到一个brands目录，如hadoop fs -ls /user/hive/warehouse/

开始加载数据

hive>load data inpath '/data/brands.txt' into table brands;

把hdfs中的brands.txt文件加载到hive管理中，数据映射到表brands中，根据刚刚的表，数据加载进去，同时该操作会移除hdfs中的brands.txt文件，并加载到了hive的数据仓库中，

地址是：

[bdata@bigdata4 data]$ hdfs dfs -cat /user/hive/warehouse/brands/brands.txt

然后我们可以通过hive的类sql语句进行查询了，如

hive> select * from brands;

386     陆风汽车        88
387     斯柯达(进口)    67
388     上海大众斯柯达 67

389 东风雪铁龙 72

hive> select * from brands where id = 399;
OK
399 东风英菲尼迪 73

Time taken: 0.165 seconds, Fetched: 1 row(s)

同时也可以做统计操作，如我们计算所有id累加起来的和

hive> select sum(id) from brands;
Query ID = bdata_20151217161725_79ff770f-b5de-46e9-a9f4-6869ec8d03ff
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2015-12-17 16:17:27,804 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local298820850_0002
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 26564 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
45227

Time taken: 1.928 seconds, Fetched: 1 row(s)

上面这两步操作有一个明显的却别，只是查询是没有触发MapReduce任务的，而统计是触发了MapReduce任务操作的，在日常的开发中，如果能避免触发任务机制，就避免，这样可以节约时间。

建立外部表(不会移动原始文件）

创建一个外部表需要使用external

hive> create external table brands_f(id int,name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/home/external';
OK
Time taken: 0.086 seconds
hive> show tables;
OK
brands
brands_f
tst
Time taken: 0.018 seconds, Fetched: 3 row(s)
hive> dfs -ls /home
drwxr-xr-x - bdata supergroup 0 2015-12-17 17:47 /home/external
hive> load data inpath '/home/external/brands.txt' into table brands_f;
Loading data to table default.brands_f
Table default.brands_f stats: [numFiles=0, totalSize=0]

处理后查看文件

hive> dfs -ls /home/external/;

Found 1 items

-rwxr-xr-x 3 bdata supergroup 3317 2015-12-17 17:48 /home/external/brands_copy_1.txt

可以看到，在数据仓库中并没有移动文件

hive> dfs -ls /user/hive/warehouse;
Found 3 items
drwxr-xr-x   - bdata supergroup          0 2015-12-17 16:07 /user/hive/warehouse/brands
drwxr-xr-x   - bdata supergroup          0 2015-12-17 11:56 /user/hive/warehouse/test
drwxr-xr-x   - bdata supergroup          0 2015-12-17 15:44 /user/hive/warehouse/tst

hive>

4、修改默认derby数据库为mysql数据库，存储元信息

把当前的hive-site.xml内容可以全部删除，然后输入以下内容即可

[bdata@bdata4 conf]$ vim hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>

<property>
    <name>hive.exec.scratchdir</name>

<value>hdfs://dfscluster/hive/scratchdir</value>

<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>

</property>

   <property>
    <name>hive.scratch.dir.permission</name>
    <value>733</value>
    <description>The permission for the user specific scratch directories that get created.</description>

</property>

<property>
<name>hive.metastore.warehouse.dir</name>

<value>hdfs://dfscluster/hive/warehouse</value>

<description>location of default database for the warehouse</description>
</property>

<property>
<name>hive.querylog.location</name>

<value>hdfs://dfscluster/hive/logs</value>

<description>Location of Hive run time structured log file</description>
</property>

<property>
<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://192.168.8.46:3306/hive?createDatabaseIfNotExist=true</value>

</property>

<property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
</property>

<property>
        <name>javax.jdo.option.ConnectionUserName</name>

<description>db username</description>

</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>

<description>db password</description>

</property>

<name>hive.metastore.uris</name>

<value>thrift://192.168.10.34:9083</value>

<description>jdbc/odbc connection hive,if mysql must set </description>

</property>

</configuration>

特殊说明：

4.1、配置MYSQL作为存储元数据时，数据库字符集需要修改为latin1,否则会出现问题5一样的错误

修改方式，可人工建立数据库，如hive，选择latin1，也可以在客户端执行hive命令，自动创建数据库hive，然后人工通过工具进入数据库执行以下命令

>alter database hive character set latin1;

4.2、如果我们将来需要通过jdbc/odbc的方式来连接hive，需要启动metastore shfift，因此必须配置hive.metastore.uris。一旦配置了metastore那么在启动HIVE客户端时，需要先启动metastore服务，否则会出现连接错

启动metastore命令

[bdata@bdata4 bin]$ ./hive --service metastore -hiveconf hbase.zookeeper.quorum=bdata1,bdata2,bdata3 -hiveconf hbase.zookeeper.property.clientPort=2181

4.3、而hive.aux.jars.path是与hbase整合的时候需要用到的jar包，如果我们需要整合HBase时，则必须加上,在配置lib包时，需要在HBase的lib目录下去找，默认hive的lib目录下是不存在的

<property>
     <name>hive.aux.jars.path</name>
     <value>file:///usr/local/hive/lib/hive-hbase-handler-0.13.1.jar,file:///usr/local/hive/lib/protobuf-java-2.5.0.jar,file:///usr/local/hive/lib/hbase-client-0.96.0-hadoop2.jar,file:///usr/local/hive/lib/hbase-common-0.96.0-hadoop2.jar,file:///usr/local/hive/lib/zookeeper-3.4.5.jar,file:///usr/local/hive/lib/guava-11.0.2.jar</value>
     <description>hive connection hbase must requre lib jar</description>

</property>

5、命令启动

启动metastore：（前提是远程mysql需要启动）

[bdata@bdata4 ~]$ hive --service metastore &

和HBase整合的情况下

[bdata@bdata4 ~]$ hive --service metastore -hiveconf hbase.zookeeper.quorum=bdata1,bdata2,bdata3 -hiveconf hbase.zookeeper.property.clientPort=2181 &

启动hiveservice：

[bdata@bdata4 ~]$ hiveserver2 &

和HBase整合

[bdata@bdata4 ~]$ hiveserver2 -hiveconf hbase.zookeeper.quorum=bdata1,bdata2,bdata3 -hiveconf hbase.zookeeper.property.clientPort=2181 &

6、设置hive on spark模式

hive底层默认使用的是Hadoop的MapReduce运算方式，这里我们切换为spark引擎，通过spark的内存计算优势加速计算速度

启动hive后设置执行引擎为spark：

hive> set hive.execution.engine=spark;

设置spark的运行模式：Standalone

hive> set spark.master=spark://bdata4:7077;

或者yarn

hive>set spark.master=yarn;

设置前后运行情况对比：

执行语句（通过一个编号查询一批数据，总数据量：20801219条）：

hive> select * from TABLE_01 where AB12 = '510922197303151981';

A、HADOOP的MapReduce执行过程明细

Query ID = bdata_20160113161216_63688fae-c556-4e62-97d3-d4dd55f7095f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-01-13 16:12:17,377 Stage-1 map = 0%, reduce = 0%
2016-01-13 16:12:19,380 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local643350913_0004
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 413571707859 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec

数据省略

Time taken: 49.32 seconds, Fetched: 7 row(s)

B、Spark的MapReduce执行过程明细

Query ID = bdata_20160113162148_30be5045-5913-4798-9773-d9c70272b682
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Spark Job = 2b03fdf1-c4e0-468d-9929-555cdf513183
Status: SENT
Failed to execute spark task, with exception 'java.lang.IllegalStateException(RPC channel is closed.)'

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

转载于:https://my.oschina.net/tearsky/blog/629763