hive相关操作_hive input.regex-CSDN博客

本文链接：https://blog.csdn.net/u012045426/article/details/78903765

hive相关操作

学习了尚学堂的hive视频，做如下笔记。主要是hive中的建表，插入，函数相关的知识。

一、表数据

以下为人员信息表数据包含四个字段，分别为id、name、likes、address
- 1，xiaoming,book-tv-football,beijing:haidian-tianjin:wuqing
- 2，sunjian,tv-football,xian:gaoxin-tianjin:wuqing
- 3，liuyang,book-code-football,henan:xinxiang-liaoning:dalian

二、DDL 几种方式创建表

1、创建内部表
hive中数据类型比较丰富，如下有数组类型和键值对的map类型，上面数据字段之间以‘，’分隔，likes字段以‘-’分隔为数组样式，address字段为键值对形式（键‘：’值）。所以如下建表请注意这些分隔方式

create table psn1(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';

2、创建外部表

create table psn2(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/usr/psn2';

3、创建表3(根据其他表，创建中间表时使用)

create table psn3
as
select id,name,likes,address from psn1;

4、创建表4(创建psn1的表结构)

create table psn4 like psn1;

5、分区
创建表（加分区）例按性别分区分区字段不能使列名
装载数据时，按分区分别存储
（应用：日志文件过大就可以采用时间分区）

create table psn5(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by (sex string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';

指定分区字段

load data local inpath '/root/data' into table psn5 partition (sex='boy');
load data local inpath '/root/data' into table psn5 partition (sex='girl');

按分区查询

select * from psn5 where sex='boy';

添加分区

alert table psn5 partition (sex='qita');

删除分区(会删除分区的数据)

alert table psn5 drop partition (sex='qita');

当分区字段定义了多个时（定义时分区字段有顺序，有层次结构）
，载入数据要将多个分区字段都写清楚。

6、hive中有正则表达式方式来清洗数据

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;

三、DML

1、插入数据load方式
（load方式用的最多，不需要转为mr操作）
加载本地数据（将本地文件上传到hdfs,在放到表的文件夹）

load data local inpath '本地路径' into table psn1;

加载hdfs数据（移动hdfs的文件到表的文件夹）

load data inpath 'hdfs路径' into table psn1;

2、(一般做分析psn1存储到结果表psn2)用下面方式

from psn1 pvs
insert into table psn2
select pvs.id,pvs.name,psv.likes,psv.address;

3、insert into （转换为mr执行，效率低一般不会使用）

四、hive中的函数

1、关系运算符（A=B、…）
2、算数运算符 A+B、A%B、…
3、复杂类型函数 A[n]、A[key]、…
4、内置表生成函数（UDTF）（太多了，列出几个链接）
官网：
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
博客园（中文）：
https://www.cnblogs.com/MOBIN/p/5618747.html
易佰教程：
http://www.yiibai.com/hive/hive_built_in_functions.html
5、也可以自己写函数，上传到服务器。