hive笔记

最新推荐文章于 2023-06-06 19:16:31 发布

Scathon

最新推荐文章于 2023-06-06 19:16:31 发布

阅读量251

点赞数

本文链接：https://blog.csdn.net/qq_31617409/article/details/70751896

版权

hive安装：
================================================
1、解压安装包到指定目录
2、进行配置：
进入到hive安装目录中的conf文件夹，vi hive-site.xml
输入如下配置：
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>
保存退出：
启动hive之前要先启动Hadoop集群，然后运行bin目录中的hive命令，启动hive；
启动过程中可能会出现异常，是因为Hadoop中的jline的jar包与hive中的jar的版本不一致引起的，
一般来说hive的版本会高一些，所以用$HIVE_HOME/lib/jline-***.jar 替代
$HADOOP_HOME/share/hadoop/yarn/lib/jline-×.××.jar 高版本的会兼容低版本的，这样异常
解决了。
3、MySQL的安装，主要是权限的授权配置：
一下配置：任何的数据库的任何表都可以被任何IP访问；
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root' WITH GRANT OPTION;
FLUSH PRIVILEGES;

hive常用命令积累
================================================
1、清空表的数据
truncate table tableName;
2、使用客户端使用hive
a、启动hiveserver bin/hiveserver2;
b、bin/beeline
c、运行命令： !connect jdbc:hive2://localhost:10000，然后根据提示输入用户名和密码：
3、创建外部表的命令：
create external table table02(id int,name string) row format delimited fields
terminated by "\t" stored as textfile location '/externalTable';
4、显示表的信息：
--------------------------------------------------------------------------------
a、desc tableName;
格式如下：
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| id | int | |
| name | string | |
+-----------+------------+----------+--+
----------------------------------------------------------------------------
b、desc extended tableName(获得更丰富的信息，但是没有格式)
id int
name string

Detailed Table Information Table(tableName:table02, dbName:hive01, owner:root, createTime:1493063234, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null), FieldSchema(name:name, type:string, comment:null)], location:hdfs://hadoop1:9000/externalTable, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim= , serialization.format=
Time taken: 0.174 seconds, Fetched: 4 row(s)
----------------------------------------------------------------------------
c、desc formatted tableName;(格式化信息输出)
+-------------------------------+-------------------------------------------------------------+-----------------------+--+
| col_name | data_type | comment |
+-------------------------------+-------------------------------------------------------------+-----------------------+--+
| # col_name | data_type | comment |
| | NULL | NULL |
| id | int | |
| name | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | hive01 | NULL |
| Owner: | root | NULL |
| CreateTime: | Tue Apr 25 03:47:14 CST 2017 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://hadoop1:9000/externalTable | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | EXTERNAL | TRUE |
| | transient_lastDdlTime | 1493063234 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | field.delim | \t |
| | serialization.format | \t |
+-------------------------------+-------------------------------------------------------------+-----------------------+--+

--------------------------------------------------------------------------------
5、给外部表导入数据：
load data local inpath 'filePath' into【插入】(overwrite【重写】) table tableName;
6、外部表和内部表的区别：
外部表删除表的时候只会删除元数据，不会删除hdfs里面的数据，但是内部表会一起连同元数据和dfs内容一起删除；
7、hive 分区有关操作：
a、创建分区：
create table tbl_partition(id int,name string) partitioned by (country string)
row format delimited fields terminated by ',';
b、载入数据：
load data local inpath 'dataFilePath' into table tableName partition(partitionName="pName1");
load data local inpath 'dataFilePath' into table tableName partition(partitionName="pName2");
//导入的数据将分成两部分：一个是pName1，一个是pName2；
例如：
1、//选择全部数据（无论那个分区都能查询出来）；
jdbc:hive2://localhost:10000> select * from tbl_partition;
+-------------------+---------------------+------------------------+--+
| tbl_partition.id | tbl_partition.name | tbl_partition.country |
+-------------------+---------------------+------------------------+--+
| 1 | zhangsan | China |
| 2 | wangwu | China |
| 3 | zhaosi | China |
| 4 | lisi | China |
| 1 | kobe | usa |
| 2 | james | usa |
| 3 | curry | usa |
| 4 | anthony | usa |
+-------------------+---------------------+------------------------+--+
2、 jdbc:hive2://localhost:10000> select * from tbl_partition where country='China';
可以将partitionName当做一个新的字段来进行查询；
+-------------------+---------------------+------------------------+--+
| tbl_partition.id | tbl_partition.name | tbl_partition.country |
+-------------------+---------------------+------------------------+--+
| 1 | zhangsan | China |
| 2 | wangwu | China |
| 3 | zhaosi | China |
| 4 | lisi | China |
+-------------------+---------------------+------------------------+--+
3、加入新的partition：
alter table tbl_partition add partition (country = "japan");
1: jdbc:hive2://localhost:10000> show partitions tbl_partition;
+----------------+--+
| partition |
+----------------+--+
| country=China |
| country=japan |
| country=usa |
+----------------+--+
4、删除某个partition：
alter table tbl_partition drop partition (country = "japan");
1: jdbc:hive2://localhost:10000> show partitions tbl_partition;
+----------------+--+
| partition |
+----------------+--+
| country=China |
| country=usa |
+----------------+--+

8、hive分桶操作：
a、创建表：
create table tbl_buck (id int,name string) clustered by (id) sorted by (id) into 4 buckets row format delimited fields terminated by ",";
b、设置开启分桶
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4 ;//设置reduce数量和分桶的数量一致
c、insert into tbl_buck select id,name from tbl_test distribute by (id) sorted by (id);
等价于：
insert into tbl_buck select id,name from tbl_test cluster by (id);
d、order by 是全局排序，只有一个人reduce 即使设置了reduce数量，sort是区内排序
e、在strict模式中如果使用order by 必须紧跟 limit ？否则报错：
set hive.mapred.mode=strict;//设置strict模式
运行： select * from score order by courseId;报错如下：
Error: Error while compiling statement: FAILED: SemanticException 1:29 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'courseId' (state=42000,code=40000)
运行：select * from score order by courseId limit 2; 查询结果如下：
+------------------+-----------------+--------------+--+
| score.studentid | score.courseid | score.score |
+------------------+-----------------+--------------+--+
| 95001 | 1 | 81 |
| 95014 | 1 | 91 |
+------------------+-----------------+--------------+--+
f、将查询结果进行转储：
1、//将查询结果[按照sex进行分发，并且按照age进行排序] 作为一个新表【newTbl】进行存储
create table newTbl as select * from students stu distribute by (stu.sex) sort by (stu.age);
2、//将查询结果存入本地文件系统
insert overwrite local directory '/root/hiveData/test' select * from students stu distribute by (stu.sex) sort by (stu.age);
3、 //将查询结果的数据插入已经存在的数据表中
insert into newTbl select * from students stu distribute by (stu.sex) sort by (stu.age);
4、将查询结果存入hdfs中
insert overwrite directory 'hdfsPath' select * from students stu distribute by (stu.sex) sort by (stu.age);
9、几种join：
inner join
left join===left outer join,右边的能连接上的都有数据，连接不上的就是NULL；
right join===right outer join,与左外链接相反；
full outer join:两边的记录都能显示，只不过连不上的记录就显示NULL；
left semi join:只返回一半数据

10、hive中没有exist 用法但是有些情况可以用left semi join重写；

11、测试hive内置函数的方法：
a、创建一个dual表，create table dual；
b、load数据（只有一行，一个空格）
c、测试函数
12、自定义函数的使用：
①、编写Java代码，继承UDF类；
public class ToLowerCase extends UDF{
public static Map<String, String> province = new HashMap<>();
static{
province.put("136", "beijing");
province.put("137", "shanghai");
province.put("138", "shenzhen");
}
public String evaluate(String field){
String result =field.toLowerCase();
return result;
}
public String evaluate(int phoneNumber){
String pnb = String.valueOf(phoneNumber);
String tmp = province.get(pnb.substring(0,3));
return tmp==null?"notExist":tmp;
}
}
②、打包成jar包上传至hive所在的节点；
③、将jar包加入到classpath，具体执行命令：
add JAR jarFilePath;
④、创建自定义函数与jar包对应类的对应关系
命令：create temporary function functionName as "com.hive.customFuncs.ToLowerCase";
⑤、执行查询测试：
select field,functionName(field),field from tableName;
13、使用自定义逻辑解析json串；
①、Java代码：
public class JSONParser extends UDF{
public String evaluate(String jsonLine){
ObjectMapper mapper = new ObjectMapper();
MovieRateBean bean = new MovieRateBean();
try {
bean = mapper.readValue(jsonLine, MovieRateBean.class);
return bean.toString();
} catch (IOException e) {
e.printStackTrace();
}
return "";
}
}
②：打包上传至hive节点，构造temporary function。。。
③、先建立一张表存储json数据：create table json_tbl(line string) row format delimited;
原始数据格式：
{"movie":"1191","rate":"5","timestamp":"978300450","uid":"1"}
{"movie":"1180","rate":"2","timestamp":"978300920","uid":"1"}
{"movie":"1133","rate":"3","timestamp":"978300120","uid":"1"}
{"movie":"1115","rate":"1","timestamp":"978300238","uid":"1"}
{"movie":"1923","rate":"4","timestamp":"978303340","uid":"1"}
上传数据、
+----------------------------------------------------------------+--+
| json_tbl.line |
+----------------------------------------------------------------+--+
| {"movie":"1191","rate":"5","timestamp":"978300450","uid":"1"} |
| {"movie":"1180","rate":"2","timestamp":"978300920","uid":"1"} |
| {"movie":"1133","rate":"3","timestamp":"978300120","uid":"1"} |
| {"movie":"1115","rate":"1","timestamp":"978300238","uid":"1"} |
| {"movie":"1923","rate":"4","timestamp":"978303340","uid":"1"} |
+----------------------------------------------------------------+--+
④、初次处理的结果仍然是一个字段的记录，如图所示，并没有按照属性进行分段；

+---------------------+--+
| _c0 |
+---------------------+--+
| 1191 5 978300450 1 |
| 1180 2 978300920 1 |
| 1133 3 978300120 1 |
| 1115 1 978300238 1 |
| 1923 4 978303340 1 |
+---------------------+--+
进行分属性存储：
create table mv_rate as select
split(parsejson(line),'\t')[0]as movieid,
split(parsejson(line),'\t')[1] as rate,
split(parsejson(line),'\t')[2] as timestring,
split(parsejson(line),'\t')[3] as uid
from json_tbl;
运行结果：
+------------------+---------------+---------------------+--------------+--+
| mv_rate.movieid | mv_rate.rate | mv_rate.timestring | mv_rate.uid |
+------------------+---------------+---------------------+--------------+--+
| 1191 | 5 | 978300450 | 1 |
| 1180 | 2 | 978300920 | 1 |
| 1133 | 3 | 978300120 | 1 |
| 1115 | 1 | 978300238 | 1 |
| 1923 | 4 | 978303340 | 1 |
+------------------+---------------+---------------------+--------------+--+
实现了分字段存储。

Scathon

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive笔记

hive安装：================================================1、解压安装包到指定目录2、进行配置：进入到hive安装目录中的conf文件夹，vi hive-site.xml输入如下配置：javax.jdo.option.ConnectionURLjdbc:mysql://localhost:3306/hive?c
复制链接

扫一扫