注意:安装mysql使用的是root,所以,需要将Hadoop的工作用户XXX_user添加到root组,否则hive初始化失败:
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to getschema version.
Caused by: java.sql.SQLException: Access denied for user'hive'@'host22VLK1737' (using password: YES)
注意,一般一个集群就一个hive实例。在master节点安装则可。Hive的数据存储在HDFS上。
本篇包含Hive的安装,Hive与spark的整合,以及Hive的基本使用示例。
一、运行模式
与 Hadoop 类似,Hive 也有 3 种运行模式:
1. 内嵌模式
将元数据保存在本地内嵌的 Derby 数据库中,这是使用 hive 最简单的方式。但是这种方式缺点也比较明显,因为一个内嵌的 Derby 数据库每次只能访问一个数据文件,这也就意味着它不支持多会话连接。
2. 本地模式
这种模式是将元数据保存在本地独立的数据库中(一般是 MySQL),这用就可以支持多会话和多用户连接了。
3. 远程模式
此模式应用于 Hive 客户端较多的情况。把 MySQL 数据库独立出来,将元数据保存在远端独立的 mysql 服务中,避免了在每个客户端都安装 MySQL 服务从而造成冗余浪费的情况。master节点安装hive服务,将hive安装包拷贝到其他节点。那么其它节点通过hive client就可以远程访问Hive 服务。
二、下载安装
主要参考帖子:http://blog.csdn.net/u014695188/article/details/54646575
其它参考链接:
http://blog.csdn.net/yinlei_81/article/details/69569403
http://blog.csdn.net/skywalker_only/article/details/34847043
http://blog.csdn.net/u010454030/article/details/53189943
http://blog.csdn.net/u014695188/article/details/54646575
http://blog.csdn.net/u011086367/article/details/52863345
http://www.jianshu.com/p/6108e0aed204
1、下载网址:
http://hive.apache.org/downloads.html
跳转到http://apache.fayea.com/hive/
apache-hive-2.1.1-bin.tar.gz 2016-12-0821:45 143M
apache-hive-2.1.1-src.tar.gz 2016-12-0821:45 19M 源码包(java)
2、环境变量
/etc/profile
********************
# Hive environment
exportHIVE_HOME=/app/data_platform/soft/hive/apache-hive-2.1.1-bin
export PATH=$HIVE_HOME/bin:$HIVE_HOME/conf:$PATH
********************
3、hive配置
安装路径:/app/data_platform/soft/hive/apache-hive-2.1.1-bin
配置文件:[hive-site.xml]
配置如下:
cd/app/data_platform/soft/hive/apache-hive-2.1.1-bin/conf
cp hive-default.xml.template hive-site.xml
[以下参数可以不改,保持默认则可]
hive.metastore.warehouse.dir #该参数指定了 Hive 的数据存储目录,默认位置在 HDFS 上面的 /user/hive/warehouse 路径下。
hive.exec.scratchdir #该参数指定了 Hive 的数据临时文件目录,默认位置为 HDFS 上面的 /tmp/hive 路径下。
[hive-env.sh]
cd /app/data_platform/soft/hive/apache-hive-2.1.1-bin/conf
cp hive-env.sh.template hive-env.sh
********************
# Set HADOOP_HOME to point to a specifichadoop install directory
export HADOOP_HOME=/app/data_platform/soft/hadoop
# Hive Configuration Directory can becontrolled by:
export HIVE_CONF_DIR=/app/data_platform/soft/hive/apache-hive-2.1.1-bin/conf
# Folder containing extra ibraries requiredfor hive compilation/execution can be controlled by:
export HIVE_AUX_JARS_PATH=/app/data_platform/soft/hive/apache-hive-2.1.1-bin/lib
********************
4、创建hdfs路径
hadoop fs -ls /
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -mkdir -p /tmp/hive
hadoop fs -chmod 777/user/hive/warehouse
hadoop fs -chmod 777 /tmp/hive
5、修改 io.tmpdir 路径
要修改 hive-site.xml 中所有包含 ${system:java.io.tmpdir} 字段的 value 即路径(vim下 / 表示搜索,后面跟你的关键词,比如搜索 hello,
则为 /hello , 再回车即可),你可以自己新建一个目录来替换它,
例如:/app/data_platform/data/hive
把hive-site.xml 中所有包含 ${system:Java.io.tmpdir}替换成/opt/sdg_date/json_data/hive/iotmp
全局替换命令先按Esc键 再同时按shift+:把以下替换命令粘贴按回车即可全局替换(一共修改了4行)
%s#${system:java.io.tmpdir}#/opt/sdg_date/json_data/hive/iotmp#g
%s#${system:java.io.tmpdir}#/app/data_platform/data/hive#g
6、运行(内嵌模式)
[your_user@XXXX apache-hive-2.1.1-bin]$./bin/schematool -initSchema -dbType derby
which: no hbase in(/app/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:.:/opt/data_platform/soft/jdk1.8.0_111/bin:/opt/data_platform/soft/jdk1.8.0_111/jre/bin:/app/data_platform/soft/hadoop/sbin:/app/data_platform/soft/hadoop/bin:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/app/data_platform/soft/scala/bin:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/app/data_platform/soft/hadoop/bin:/app/data_platform/soft/hadoop/sbin:/home/es_user/bin:/app/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/app/data_platform/soft/hadoop/bin:/app/data_platform/soft/hadoop/sbin:/app/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/app/data_platform/soft/hadoop/bin:/app/data_platform/soft/hadoop/sbin)
SLF4J: Class path contains multiple SLF4Jbindings.
SLF4J: Found binding in[jar:file:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in[jar:file:/app/data_platform/soft/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Seehttp://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type[org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User: APP
Starting metastore schema initialization to2.1.0
Initialization scripthive-schema-2.1.0.derby.sql
Error: FUNCTION 'NUCLEUS_ASCII' alreadyexists. (state=X0Y68,code=30000)
org.apache.hadoop.hive.metastore.HiveMetaException:Schema initialization FAILED! Metastore state would be inconsistent !!
Underlying cause: java.io.IOException :Schema script failed, errorcode 2
Use --verbose for detailed stacktrace.
*** schemaTool failed ***
[your_user@XXXX apache-hive-2.1.1-bin]$ rm-rf metastore_db
[your_user@XXXX apache-hive-2.1.1-bin]$bin/schematool -initSchema -dbType derby
which: no hbase in(/app/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:.:/opt/data_platform/soft/jdk1.8.0_111/bin:/opt/data_platform/soft/jdk1.8.0_111/jre/bin:/app/data_platform/soft/hadoop/sbin:/data05/opt/data_platform/soft/hadoop/bin:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/app/data_platform/soft/scala/bin:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/app/data_platform/soft/hadoop/bin:/app/data_platform/soft/hadoop/sbin:/home/es_user/bin:/app/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/app/data_platform/soft/hadoop/bin:/app/data_platform/soft/hadoop/sbin:/app/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/app/data_platform/soft/hadoop/bin:/app/data_platform/soft/hadoop/sbin)
SLF4J: Class path contains multiple SLF4Jbindings.
SLF4J: Found binding in[jar:file:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in[jar:file:/app/data_platform/soft/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Seehttp://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type[org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User: APP
Starting metastore schema initialization to2.1.0
Initialization scripthive-schema-2.1.0.derby.sql
Initialization script completed
schemaTool completed
[your_user@XXXX apache-hive-2.1.1-bin]$bin/hive
Logging initialized using configuration injar:file:/app/data_platform/soft/hive/apache-hive-2.1.1-bin/lib/hive-common-2.1.1.jar!/hive-log4j2.propertiesAsync: true
Hive-on-MR is deprecated in Hive 2 and maynot be available in the future versions. Consider using a different executionengine (i.e. spark, tez) or using Hive 1.X releases.
hive>
hive> create database aaa;
OK
Time taken: 0.188 seconds
hive> create table t1 ( f1 string, f2int);
OK
Time taken: 0.808 seconds
hive> show tables;
OK
t1
Time taken: 0.199 seconds, Fetched: 1row(s)
hive> insert into t1 values('aaa',333);
WARNING: Hive-on-MR is deprecated in Hive 2and may not be available in the future versions. Consider using a differentexecution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID =es_user_20170830153032_252e01ae-3531-476c-969a-fd8d5669fde3
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 sincethere's no reduce operator
Starting Job = job_1504074498349_0001,Tracking URL = http://你的ip:8090/proxy/application_1504074498349_0001/
Kill Command = /app/data_platform/soft/hadoop/bin/hadoopjob -kill job_1504074498349_0001
Hadoop job information for Stage-1: numberof mappers: 1; number of reducers: 0
2017-08-30 15:31:16,337 Stage-1 map =0%, reduce = 0%
2017-08-30 15:31:33,731 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.17 sec
MapReduce Total cumulative CPU time: 3seconds 170 msec
Ended Job = job_1504074498349_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by conditionresolver.
Stage-5 is filtered out by condition resolver.
Moving data to directoryhdfs://XXXX:9000/user/hive/warehouse/aaa.db/t1/.hive-staging_hive_2017-08-30_15-30-32_792_1024599595931100846-1/-ext-10000
Loading data to table aaa.t1
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.17 sec HDFS Read: 4046 HDFS Write: 70 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds170 msec
OK
Time taken: 63.441 seconds
hive>select * from t1;
OK
aaa 333
Time taken: 0.129 seconds, Fetched: 1row(s)
报错与问题:
[ERROR] Failed with exception Java.io.IOException:java.lang.IllegalArgumentException:java.NET.URISyntaxException:
Relative path in absolute URI:${system:user.name}
解决方法:hive-site.xml
将 /opt/sdg_date/json_data/hive/iotmp/${system:user.name}
改成:/opt/sdg_date/json_data/hive/iotmp/${ user.name}
注意:Hive本身自带一个数据库derby,但是有弊端,hive本身数据库,每次只允许一个用户登录
7、配置使用MySQL作为元数据存储(本地模式/远程模式)
mysql安装:http://blog.csdn.net/u014695188/article/details/51532410
关于安装mysql以及分配权限的请参考散仙之前的文章:http://qindongliang.iteye.com/blog/2337865
或者《【二、大数据环境篇】003、Hive安装(上)- MySql安装》
vi hive-site.xml
配置下面的几项参数
在最上面添加这个置顶属性
<property>
<name>system:user.name</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://你的ip:3306/hive?createDatabaseIfNotExist=true&characterEncoding=utf-8</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, providedatabase-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastoredatabase</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastoredatabase</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBCmetastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for thewarehouse</description>
</property>
<property>
<name>hive.metastore.schema.verification.record.version</name>
<value>false</value>
<description>
When true the current MS version is recorded in the VERSION table. Ifthis is disabled and verification is
enabled the MS will be unusable.
</description>
</property>
***********************
其他的凡是包含 ${system:java.io.tmpdir}变量的统一替代成绝对路径,上面的内嵌模式已经配置好了。
最后切记添加mysql的jdbc驱动包到hive/lib的目录下面
/app/data_platform/soft/hive/apache-hive-2.1.1-bin/lib/mysql-connector-java-5.1.42-bin.jar
说明下:
上面的jdbc的url里面驱动字符串设置为数据库编码为utf-8此外&符号需要转义
jdbc:mysql://192.168.10.40:3306/hive?createDatabaseIfNotExist=true&characterEncoding=utf-8
此外默认hive读取Hbase的lib如果没有安装hbase则会启动不起来: 需要下载hbase然后配置对应的HBASE_HOME,文末会给出所有的环境变量
注意:切换元数据的数据库,需要清空原来的数据库信息
rm -rf metastore_db
bin/schematool -initSchema -dbType mysql
如果已经创建了hive库,那么重新初始化,还是可能报错的,使用root登入mysql,删除hive数据库。
[your_user@XXXapache-hive-2.1.1-bin]$bin/schematool -initSchema -dbType mysql
which: no hbase in (/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:.:/opt/data_platform/soft/jdk1.8.0_111/bin:/opt/data_platform/soft/jdk1.8.0_111/jre/bin:/data05/opt/data_platform/soft/hadoop/sbin:/data05/opt/data_platform/soft/hadoop/bin:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/data05/opt/data_platform/soft/scala/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hadoop/bin:/data05/opt/data_platform/soft/hadoop/sbin:/home/es_user/bin:/data05/opt/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hadoop/bin:/data05/opt/data_platform/soft/hadoop/sbin:/data05/opt/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hadoop/bin:/data05/opt/data_platform/soft/hadoop/sbin)
SLF4J: Class path contains multiple SLF4Jbindings.
SLF4J: Found binding in[jar:file:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in[jar:file:/data05/opt/data_platform/soft/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Seehttp://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type[org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:mysql://你的ip:3306/hive?createDatabaseIfNotExist=true&characterEncoding=utf-8
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive
Starting metastore schema initialization to2.1.0
Initialization scripthive-schema-2.1.0.mysql.sql
Error: Duplicate key name 'PCS_STATS_IDX'(state=42000,code=1061)
org.apache.hadoop.hive.metastore.HiveMetaException:Schema initialization FAILED! Metastore state would be inconsistent !!
Underlying cause: java.io.IOException :Schema script failed, errorcode 2
Use --verbose for detailed stacktrace.
*** schemaTool failed ***
【如果是当前hive-site配置的主机不允许访问Mysql server的话,登录mysql 进行授权,详细见《 XXXX hive安装(上)》
GRANT ALL PRIVILEGES ON *.* TO'hive'@'%' IDENTIFIED BY 'hive'WITH GRANT OPTION;
FLUSH PRIVILEGES ;】
解决方法:进入MySQL,删掉metastore数据库再重新创建metastore数据库即可(以上定义的metastore数据库的名称是hive)
登录mysql客户端:Linux命令行直接执行 mysql -uhive -phive (-u 用户 -p 密码, k-v中间不需要空格!)
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
+--------------------+
2 rows in set (0.00 sec)
mysql> drop database hive;
Query OK, 57 rows affected (1.85 sec)
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
+--------------------+
1 row in set (0.00 sec)
重新执行初始化:
[XXXXX@XXXXapache-hive-2.1.1-bin]$bin/schematool -initSchema -dbType mysql
which: no hbase in(/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:.:/opt/data_platform/soft/jdk1.8.0_111/bin:/opt/data_platform/soft/jdk1.8.0_111/jre/bin:/data05/opt/data_platform/soft/hadoop/sbin:/data05/opt/data_platform/soft/hadoop/bin:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/data05/opt/data_platform/soft/scala/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hadoop/bin:/data05/opt/data_platform/soft/hadoop/sbin:/home/es_user/bin:/data05/opt/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hadoop/bin:/data05/opt/data_platform/soft/hadoop/sbin:/data05/opt/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hadoop/bin:/data05/opt/data_platform/soft/hadoop/sbin)
SLF4J: Class path contains multiple SLF4Jbindings.
SLF4J: Found binding in[jar:file:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in[jar:file:/data05/opt/data_platform/soft/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindingsfor an explanation.
SLF4J: Actual binding is of type[org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:mysql://你的ip:3306/hive?createDatabaseIfNotExist=true&characterEncoding=utf-8
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive
Starting metastore schema initialization to2.1.0
Initialization scripthive-schema-2.1.0.mysql.sql
Initialization script completed
schemaTool completed
8、运行(本地模式/远程模式)
[your_user@XXXX apache-hive-2.1.1-bin]$bin/hive
which: no hbase in(/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/conf:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:.:/opt/data_platform/soft/jdk1.8.0_111/bin:/opt/data_platform/soft/jdk1.8.0_111/jre/bin:/data05/opt/data_platform/soft/hadoop/sbin:/data05/opt/data_platform/soft/hadoop/bin:/opt/data_platform/soft/oracle/instantclient_11_2:/opt/data_platform/soft/jdk1.8.0_111/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/data05/opt/data_platform/soft/scala/bin:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hadoop/bin:/data05/opt/data_platform/soft/hadoop/sbin:/home/es_user/bin:/data05/opt/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hadoop/bin:/data05/opt/data_platform/soft/hadoop/sbin:/data05/opt/data_platform/soft/scala/bin:/opt/data_platform/soft/jdk1.8.0_111/bin:/data05/opt/data_platform/soft/hadoop/bin:/data05/opt/data_platform/soft/hadoop/sbin)
SLF4J:Class path contains multiple SLF4J bindings.
SLF4J:Found binding in [jar:file:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Found binding in[jar:file:/data05/opt/data_platform/soft/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J:Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logginginitialized using configuration in jar:file:/data05/opt/data_platform/soft/hive/apache-hive-2.1.1-bin/lib/hive-common-2.1.1.jar!/hive-log4j2.propertiesAsync: true
Hive-on-MRis deprecated in Hive 2 and may not be available in the future versions.Consider using a different execution engine (i.e. spark, tez) or using Hive 1.Xreleases.
hive>
说明:which: nohbase in 表示没有安装hbase,只是一个提示,不影响使用。
hive> create database test_db;
OK
Time taken: 0.214 seconds
hive> show databases;
OK
default
test_db
Time taken: 0.027 seconds, Fetched: 2row(s)
hive> use test_db;
OK
Time taken: 0.028 seconds
hive> create table test1(name string,age int);
OK
Time taken: 0.492 seconds
hive> insert into test1values('tom',20);
WARNING: Hive-on-MR is deprecated in Hive 2and may not be available in the future versions. Consider using a differentexecution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID =es_user_20170830160012_868d00f4-c775-487e-a934-bced26f22c54
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 sincethere's no reduce operator
Starting Job = job_1504074498349_0002,Tracking URL = http://ES100:8090/proxy/application_1504074498349_0002/
Kill Command =/data05/opt/data_platform/soft/hadoop/bin/hadoop job -kill job_1504074498349_0002
Hadoop job information for Stage-1: numberof mappers: 1; number of reducers: 0
2017-08-30 16:01:08,915 Stage-1 map =0%, reduce = 0%
2017-08-30 16:01:15,149 Stage-1 map =100%, reduce = 0%, Cumulative CPU 2.6sec
MapReduce Total cumulative CPU time: 2seconds 600 msec
Ended Job = job_1504074498349_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by conditionresolver.
Stage-5 is filtered out by conditionresolver.
Moving data to directory hdfs://ES100:9000/user/hive/warehouse/test_db.db/test1/.hive-staging_hive_2017-08-30_16-00-12_103_8680995249472473972-1/-ext-10000
Loading data to table test_db.test1
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.6 sec HDFS Read: 4103 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds600 msec
OK
Time taken: 64.554 seconds
hive> select * from test1;
OK
tom 20
Time taken: 0.115 seconds, Fetched: 1row(s)
启动hive的metastore服务【远程模式下】:
bin/hive --service metastore 2>&1>> /tmp/hive_metastore.log &
或者
hive --service metastore &
报错与问题:
***************************
启动报错
【1、查看MySQL 状态】
启动mysql:
service mysqld start (5.0版本是mysqld)
service mysql start (5.5.7版本是mysql)
service 是root的命令,所以要加sudo
[your_user@XXXX bin]$ sudo service mysqldstart
正在启动 mysqld: [确定]
[your_user@XXXX bin]$ sudo service mysqldstatus
mysqld (pid 28955) 正在运行...
mysql -uhive -phive
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
+--------------------+
注意:登录hive客户端是看不到hive自己的metastore数据库的。
2 rows in set (0.00 sec)
mysql> use hive
Reading table information for completion oftable and column names
You can turn off this feature to get aquicker startup with -A
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_hive |
+---------------------------+
| AUX_TABLE |
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| COMPACTION_QUEUE |
| COMPLETED_COMPACTIONS |
| COMPLETED_TXN_COMPONENTS |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| DELEGATION_TOKENS |
| FUNCS |
| FUNC_RU |
| GLOBAL_PRIVS |
| HIVE_LOCKS |
| IDXS |
| INDEX_PARAMS |
| KEY_CONSTRAINTS |
| MASTER_KEYS |
| NEXT_COMPACTION_QUEUE_ID |
| NEXT_LOCK_ID |
| NEXT_TXN_ID |
| NOTIFICATION_LOG |
| NOTIFICATION_SEQUENCE |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_EVENTS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| ROLES |
| ROLE_MAP |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TXNS |
| TXN_COMPONENTS |
| TYPES |
| TYPE_FIELDS |
| VERSION |
| WRITE_SET |
+---------------------------+
57 rows in set (0.00 sec)
说明:hive的元数据库的表,不需要我们过多关注。
【2、连接失败】
注意:使用元数据服务器模式,需要先启动元数据服务:hive --service metastore &
否则会报错:
【3、超时关闭】
如果在wait_timeout秒期间内,数据库连接(Java.sql.Connection)一直处于等待状态,mysql就将该连接关闭。
这时,你的Java应用的连接池仍然合法地持有该连接的引用。当用该连接来进行数据库操作时,就碰到上述错误。
vi /etc/my.cnf
追加wait_timeout=1814400
重启mysql
mysql> show global variables like "wait_timeout";
+---------------+---------+
| Variable_name | Value |
+---------------+---------+
| wait_timeout | 1814400 |
+---------------+---------+
1 row in set (0.00 sec)
【4】
hive> show tables;
OK
Failed with exceptionjava.io.IOException:java.lang.IllegalArgumentException:java.net.URISyntaxException: Relative path in absolute URI: ${system:user.name%7D
Time taken: 0.02 seconds
解决方法:
在hive-site.xml中添加
<property>
<name>system:java.io.tmpdir</name>
<value>/home/grid/apache-hive-2.1.0-bin/tmpdir</value>
</property>
<property>
<name>system:user.name</name>
<value>hive</value>
</property>
9、其他节点调用hive
将以上的hive目录直接拷贝到其他节点,
配置/etc/profile,source /etc/profile
配置io.tmpdir 路径
然后就可以直接连接到主节点的hive了。
10、spark 整合 hive
Spark SQL的其中一个分支就是Spark onHive,
也就是使用Hive中HQL的解析、逻辑执行计划翻译、执行计划优化等逻辑,
可以近似认为仅将物理执行计划从MR作业替换成了Spark作业。
预先编译好的Spark assembly包是不支持Hive的,如果你需要在Spark中使用Hive,必须重新编译,
加上-Phive选项既可,具体如下:
cd $SPARK_HOME
./make-distribution.sh --tgz -Phadoop-2.6-Pyarn -DskipTests -Dhadoop.version=2.6.0 -Phive
编译完成之后,会在SPARK_HOME的lib目录下多产生三个jar包:
datanucleus-api-jdo-3.2.6.jar、
datanucleus-core-3.2.10.jar、
datanucleus-rdbms-3.2.9.jar,
这些包都是Hive所需要的。不想编译,直接拷贝放入lib目录
启动hive的metastore服务:
bin/hive --service metastore 2>&1>> /tmp/hive_metastore.log &
在SPARK_HOME/conf目录下创建hive-site.xml文件,然后在该配置文件中,仅仅添加hive.metastore.uris属性则可
(当然也可以将HIVE_HOME目录的整个hive-site.xml 拷贝到 spark的conf目录):
<?xml version="1.0"encoding="UTF-8" standalone="no"?>
<?xml-stylesheettype="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except incompliance with
the License. You may obtain acopy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS"BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--><configuration>
<!-- WARNING!!! This file is auto generated for documentationpurposes ONLY! -->
<!-- WARNING!!! Any changes you make to this file will be ignored byHive. -->
<!-- WARNING!!! You must make your changes in hive-site.xmlinstead. -->
<!-- Hive Execution Parameters -->
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://你的ip:9083</value>
<description>Thrift URI for the remote metastore. Used bymetastore client to connect to remote metastore.</description>
</property>
</configuration>
然后将mysql的驱动包拷贝到 spark的lib目录。
启动spark-sql:(通过bin/spark-sql –help可以查看CLI命令参数)
bin/spark-sql --master spark://你的ip:7077 --executor-memory1g
【报错:】
Mon Feb 13 12:01:42 EST 2017 WARN:Establishing SSL connection without server's identity verification is notrecommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSLconnection must be established by default if explicit option isn't set. Forcompliance with existing applications not using SSL the verifyServerCertificateproperty is set to 'false'. You need either to explicitly disable SSL bysetting useSSL=false, or set useSSL=true and provide truststore for servercertificate verification.
Exception in thread "main"java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:java.lang.RuntimeException: Unable to instantiateorg.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
atorg.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:591)
atorg.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:531)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException:Unable to instantiateorg.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
atorg.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:226)
at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:366)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:310)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:290)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:266)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:558)
... 9 more
Caused by: java.lang.RuntimeException:Unable to instantiateorg.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1654)
atorg.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:80)
atorg.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:130)
atorg.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:101)
atorg.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3367)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3406)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3386)
atorg.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3640)
atorg.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:236)
at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:221)
... 14 more
Caused by:java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
atsun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1652)
... 23 more
Caused by: MetaException(message:Versioninformation not found in metastore. )
atorg.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:7753)
atorg.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:7731)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
atorg.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101)
at com.sun.proxy.$Proxy21.verifySchema(Unknown Source)
atorg.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:565)
atorg.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:626)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:416)
atorg.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78)
atorg.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
atorg.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6490)
atorg.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:238)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:70)
... 28 more
解决方法:
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>
Enforce metastore schema version consistency.
True: Verify that version information stored in is compatible with onefrom Hive jars. Also disable automatic
schema migration attempt. Users are required to manually migrate schemaafter Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't matchwith one from in Hive jars.
</description>
</property>
【正常启动】:
[XXX@XXXXX spark]$ bin/spark-sql--master spark://XXXXX:7077 --executor-memory 1g
SLF4J: Class path contains multiple SLF4Jbindings.
SLF4J: Found binding in[jar:file:/opt/sdg_date/json_data/alluxio-spark-getting-started/alluxio-1.5.0-hadoop-2.6/client/spark/alluxio-1.5.0-spark-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in[jar:file:/data05/opt/data_platform/soft/spark/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Seehttp://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type[org.slf4j.impl.Log4jLoggerFactory]
Mon Sep 04 16:16:54 CST 2017Thread[main,5,main] java.io.FileNotFoundException: derby.log (权限不够)
----------------------------------------------------------------
Mon Sep 04 16:16:54 CST 2017:
Booting Derby version The Apache SoftwareFoundation - Apache Derby - 10.12.1.1 - (1704137): instancea816c00e-015e-4bf6-b9c3-00012b5dd0f8
on database directory/tmp/spark-f8fc4bec-3624-43cc-84d2-58f2cd74392a/metastore with class loadersun.misc.Launcher$AppClassLoader@42a57993
Loaded fromfile:/data05/opt/data_platform/soft/spark/lib/spark-assembly-1.6.3-hadoop2.6.0.jar
java.vendor=Oracle Corporation
java.runtime.version=1.8.0_111-b14
user.dir=/data05/opt/data_platform/soft/spark
os.name=Linux
os.arch=amd64
os.version=2.6.32-642.13.1.el6.x86_64
derby.system.home=null
Database Class Loader started -derby.database.classpath=''
SEThive.support.sql11.reserved.keywords=false
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
spark-sql>
界面查看:
http://XXXXX:8080/ 可以查看到当前spark-sql的程序在运行。
补充笔记:
1)把$HIVE_HOME/conf/hive-site.sml复制到$SPARK_HOME/conf
最好另外配置,因为原生的hive-site.xml配置项太多了。
<property>
<name>hive.metastore.uris</name>
<value>thrift://XXXX:9083</value>
<description>Thrift URI for the remote metastore. Used bymetastore client to connect to remote metastore.</description>
</property>
<property>
<name>hive.server2.thrift.min.worker.threads</name>
<value>5</value>
<description>Minimum number of Thrift workerthreads</description>
</property>
<property>
<name>hive.server2.thrift.max.worker.threads</name>
<value>500</value>
<description>Maximum numberof Thrift worker threads</description>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
<description>Port number ofHiveServer2 Thrift interface. Can be overridden by setting$HIVE_SERVER2_THRIFT_PORT</description>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>XXXX</value>
<description>Bind host on which to run the HiveServer2 Thriftinterface.Can be overridden bysetting$HIVE_SERVER2_THRIFT_BIND_HOST</description>
</property>
2)在spark_env.sh中添加如下内容:
exportHADOOP_DIR_CONF=/usr/local/soft/hadoop/etc/hadoop
exportSPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/local/soft/hive/lib/mysql-connector-java-5.1.10.jar
3)记得将以上配置,都同步到其他spark节点上。
启动方式:
1 : bin/spark-sql --master spark://hadoop1:7077 --jars /usr/local/soft/hive/lib/mysql-connector-java-5.1.10.jar
2: bin/spark-shell --master yarn-client --jars /usr/local/soft/hive/lib/mysql-connector-java-5.1.10.jar
以上如果都配置在环境变量,则直接登录:
bin/spark-sql 或者 bin/spark-shell 则可
确定能否集成成功之前:
首先得保证hive能用
首先得保证hadoop(HDFS Mapreduce YARN)
保证MySQL能用
三、基本使用
更多概念性知识点,请参考《004、Spark实战系列-L篇.docx》中的hive篇章。
1、database 操作
hive> show databases;
hive> create database my_hive;
hive> drop database aaaa;
下面的操作都是基于my_hive数据库进行:
use my_hive;
2、table 操作
2.1 内部表
创建内部表
由于Hive使用了类似SQL的语法,所以HQL创建内部表的语句,相对SQL而言,只是增加了行和字段分隔符。
CREATE TABLE SOGOUQ2(DT STRING,WEBSESSIONSTRING,WORD STRING,S_SEQ INT,C_SEQ INT,WEBSITE STRING) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n' ;
HQL格式化后:
CREATE TABLE SOGOUQ2 (
DT STRING,
WEBSESSION STRING,
WORD STRING,
S_SEQ INT,
C_SEQ INT,
WEBSITE STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'v-v' LINESTERMINATED BY '\n' ;
默认列分隔符是\t, 这里自顶一个一个新的分隔符\001
注意:自定义的分隔符必须是单个字符,形如 @#@ 只会识别为第一个字符@
常用的分隔符
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
删除内部表
hive> drop table SOGOUQ2;
加载外部数据
数据文件可以从HDFS或者本地操作系统加载到表中。
l 加载HDFS文件使用LOAD DATA INPATH;
l 加载本地操作系统文件使用LOADDATA LOCAL INPATH命令。
HIVE表保存的默认路径在${HIVE_HOME}/conf/hive-site.xml配置文件的hive.metastore.warehouse.dir属性指定。
当创建表时会在hive.metastore.warehouse.dir指向的目录下以表名创建一个文件夹,在本演示中表默认指向的是/user/hive/warehouse。
数据文件在本地操作系统将复制到表对应的目录中,而数据文件在HDFS中,数据文件将移动到表对应的目录中,原来的路径将不存在该文件。
搜狗日志数据可以从http://www.sogou.com/labs/resource/q.php 下载,其中完整版大概2GB左右,文件中字段分别为:
访问时间\t
用户ID\t
[查询词]\t
该URL在返回结果中的排名\t
用户点击的顺序号\t
用户点击的URL。
专门为您生成的下载账号密码已启用。
用户名:XXXXXX@163.com
密码:O&2ArWdONEcmrse&
请不要将用户名、密码公布给其他人,否则搜狗将终止您的下载权限!
[es_user@ES100 hive]$ wc -l SogouQ.sample
10000 SogouQ.sample
里面的字段\t 有问题,需要修复,并转为新的分隔符:sed"s/v-v/^A/g" SogouQ.sample > SogouQ.sample_A
Linux 环境下,^A = (ctrl + v)+ A
hive> LOAD DATA LOCAL INPATH '/data05/opt/data_platform/soft/hive/SogouQ.sample_A' INTO TABLESOGOUQ2;
hive> select * from SOGOUQ2 limit 10;
http://es100:50070/explorer.html#/user/hive/warehouse/my_hive.db/sogouq2
查看HQL对应的MR作业状态
hive> select count(*) from Sogouq2;
查看job的工作状态:
[es_user@ES100 hive]$ mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to/data05/opt/data_platform/soft/hadoop/logs/mapred-es_user-historyserver-xxx.out
[xxxxxx@xxxxx hive]$ jps
12512 AlluxioWorker
31457 DataNode
12643 AlluxioProxy
12324 AlluxioMaster
8844 JobHistoryServer
6384 Worker
31794 ResourceManager
31635 SecondaryNameNode
31924 NodeManager
31317 NameNode
8887 Jps
6265 Master
5662 RunJar
【yarn-site.xml】
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>xxxxx:8090</value>
</property>
查询行数
hive> select count(*) from Sogouq2;
包含baidu的行数
hive> select count(*) from SOGOUQ2 where WEBSITE like '%baidu%';
查询结果排名第1,点击次序排第2,其中URL包含baidu的数据
hive > select count(*) from SOGOUQ2 whereS_SEQ=1 and C_SEQ=2 and WEBSITE like '%baidu%';
查询次数排行榜
按照session号进行归组,并按照查询次数进行排序,最终显示查询次数最多的前10条。
hive> selectWEBSESSION,count(WEBSESSION) as cw from SOGOUQ2 group by WEBSESSION order by cw desc limit 10;
2.2 外部表
创建表关联数据
hadoop fs -mkdir -p /my_hive/Sougo_data/
hadoop fs -put SogouQ.sample_A/my_hive/Sougo_data/
hadoop fs -lsr /my_hive/Sougo_data/
创建外部表,指定表存放目录
创建外部表多了EXTERNAL关键字以及指定了表对应存放文件夹LOCATION ‘/class5/sogouq1’
CREATE
EXTERNAL TABLE
SOGOUQ1
(DT
STRING,WEBSESSION
STRING,WORD
STRING,S_SEQ INT,C_SEQ INT,WEBSITE
STRING) ROW FORMAT DELIMITED FIELDSTERMINATED BY
'\001'LINES TERMINATED BY
'\n'STORED AS TEXTFILE LOCATION
'/my_hive/Sougo_data/';
【注】在删除表的时候,内部表将删除表的元数据和数据文件;而删除外部表的时候,仅仅删除外部表的元数据,不删除数据文件
加载数据文件到外部表对应的目录中
创建Hive外部表关联数据文件有两种方式:
l 一种是把外部表数据位置直接关联到数据文件所在目录上,这种方式适合数据文件已经在HDFS存在;
l 另外一种方式是创建表时指定外部表数据目录,随后把数据加载到该目录下。
以上属于第一种:我们的源数据放在/my_hive/Sougo_data/,同时外部表也是指定/my_hive/Sougo_data/,所以表和数据直接关联在一起了,可以直接查询:
下面演示第二种情况:
CREATE
EXTERNAL TABLE
SOGOUQ3
(DT
STRING,WEBSESSION
STRING,WORD
STRING,S_SEQ INT,C_SEQ INT,WEBSITE
STRING) ROW FORMAT DELIMITED FIELDSTERMINATED BY
'\001'LINES TERMINATED BY
'\n'STORED AS TEXTFILE LOCATION
'/my_hive/Sougo_data/SOGOUQ3
';
创建外部表后,立即生成表对应的目录:
hadoop fs -lsr /my_hive/Sougo_data/
drwxr-xr-x - es_user supergroup 02017-08-30 18:24 /my_hive/Sougo_data/SOGOUQ3
加载数据到该目录:
(1) 可以从本地文件系统复制上传到hdfs的表目录
hadoop fs
-copyFromLocal /data05/opt/data_platform/soft/hive/SogouQ.sample_A
/my_hive/Sougo_data/SOGOUQ3
(2) 也可以从其它HDFS路径copy到表目录
hadoop fs -cp/my_hive/Sougo_data/SogouQ.sample_A /my_hive/Sougo_data/SOGOUQ3
综合案例:交易数据查询
2.3 分区表
后续补全
2.4 桶
后续补全