Hive源码解析与开发实战笔记--讲师贾杰

最新推荐文章于 2024-07-21 01:10:04 发布

燃烧的岁月_

最新推荐文章于 2024-07-21 01:10:04 发布

阅读量4.4k

点赞数 2

分类专栏： hive

本文链接：https://blog.csdn.net/china_demon/article/details/51821428

版权

hive 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

Hive实战
目录
Hadoop生态系统
日志分析系统
Hive介绍
Hive shell常用操作
hive -e
hive -f
hive -v
hive -i
hive -S

Hive环境搭建
Hive基本使用
-----------------------------------------------------------------------------
日志分析系统-流程
数据收集=》数据清洗=》数据存储与管理=》数据分析=》数据显示

Hadoop 日志分析系统

Hive介绍
什么是Hive？
hive是基于Hadoop的一个数据仓库工具
可以将结构化的数据文件映射为一张数据库表，并提供类sql（HQL）的查询功能
可以将sql语句转换为MapReduce任务进行运行
可以用来进行数据提取转化加载（ETL）

优点与缺点
成本低，入手较快
可以通过类SQL语句快速实现简单的MapReduce统计，不必开发专门的MapReduce应用
不支持实时查询

Hive安装
安装前准备
JDK1.6+
Hadoop
1.x,2.x
Hive安装包
0.13,...,0.9
Mysql
mysql-connection-java

下载压缩包
tar -xzf 解压文件到目录，比如：/home/hive-0.9/*
配置环境变量
vi /etc/profile
export HIVE_HOME= /home/hive-0.9
export PATH=$PATH:$HIVE_HOME/bin
source /etc/profile

修改hive配置文件
$HIVE_HOME/conf/hive-default.xml.template
修改为hive-site.xml

修改配置hive-site.xml内容（可选）
hive.metastore.warehouse.dir
hive.querylog.location

终端：输入hive回车
show tables；(命令后面加分号，回车)
显示：OK

Hive元数据存储
Derby
单session
在启动终端目录创建元数据文件
不能多用户共享

MySql
安装MySql，配置账户，权限
mysql-connection-java-5.1.22-bin.jar,拷贝至hive安装目录lib目录下
修改hive-site.xml

MySql配置
javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true

javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
root
javax.jdo.option.ConnectionPassword
12345

>>show tables;
>>create table test1(name String)

>>mysql -uroot -p
>>show databases;

Hive Client
访问hive的方式
Ci
Hwi
hive --service hwi
http://localhost:9999/hwi
HiveServer
hive --service hiveserver

Hive-JDBC
private static String
HiveDriver="org.apache.hadoop.hive.jdbc.HiveDriver";
private static String
url="jdbc:hive://192.168.1.198:1000/default";
private static String name = "";
private static String password="";

Class.forName(HiveDriver);
Connection conn =
DriverManager.getConnection(url,name,password);
Statemetn stat = conn.createStatement();
String sql = "show tables";

启动hive
>>hive --service hiveserver
http://192.168.1.198:10000

netstat -ano | grep 10000

HiveJDBC.java

package com.hive;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

public class HiveJdbc{
private static String HiveDriver = "org.apache.hadoop.hive.jdbc.HiveDriver";
private static String url = "jdbc:hive://127.0.0.1:10000/default";
private static String name = "";
private static String password = "";

public static void main(String[] args) throws SQLException{
try {
Class.forName(HiveDriver);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Connection conn = DriverManager.getConnection(url,name,password);
Statement stat = conn.createStatement();
String sql1 = "show tables";
String sql2 = "select * from import_stock_d limit 1000";
ResultSet rs = stat.executeQuery(sql1);
while(rs.next()){
System.out.println(rs.getString(1));
}
}

}
---------------------------------------------------------------------------------------------
对于 hive 1.2 及以上的版本，hive不再使用，而直接使用 hiveserver2 命令；
在Linux shell：
[root@hadoop0 ~]# hiveserver2 &

hadoop dfsadmin -safemode leave
set hive.cli.print.current.db=true 显示打印库名称
set hive.cli.print.header=true 显示表头名称

-----------------------------------Hive表操作----------------------------------------------
Hive数据类型
Hive文件格式
Hive表的创建
Hive操作表
Hive表分区
Hive查询表

Hive基本使用-数据类型
基本数据类型
tinyint，smallint，int，bigint，boolean，float，double，string，binary，timestamp，decimal，char，varchar，date
-------------------------------------------------------------------------
RCFile：把一列数据转换成一行数据，提高查询速度。
----------------------------------------------------------------------------------------------------
Hive基本使用-表
CREATE[EXTERNAL] TABLE[IF NOT EXISTS][db_name.]table_name
[(col_name data_type[COMMENT col_comment],...)]
[PARTITIONED BY (col_name,col_name,...)[SORTED BY(col_name[ASC|DESC],...)] INTO num_buckets BUCKETS]
[
[ROW FORMAT row_format][STORED AS file_format]|STORED BY 'storage.handler.class.name'[WITH SERDEPROPERTIES(...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES(property_name=property_value,...)]
[AS select_statement]

create table [external] employees(
name string,
salary float,
subordinates array<string>,
deductions map<string,float>,
address struct<street:string,city:string,state:string,zip:int>
)
row format delimited fields terminated by '\t' --字段分隔符默认/001
lines terminated by '\n' stored as textfile; ---行分隔符默认/002

--内部表实例
create table testtable(
name string comment 'name value',
address string comment 'address value'
)
row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
load data local inpath '/home/data/data' overwrite into table testtable;

show tables;
desc testtable;
desc extended testtable;
desc formatted testtable;
--删除表
drop table tablename;
--显示建表语句
show create table tablename;

--外部表实例
create external table if not exists employees(
name string,
salary float,
subordinates array<string>,
deductions map<string,float>,
address struct<street:string,city:string,state:string,zip:int>
)
row format delimited
fields terminated by '\t' --每个字段值的分隔符
collection item terminated by ',' ---集合类型字段中单个record的分隔符
map keys terminated by ':' --map类型字段单个record中key和value的分隔符
lines terminated by '\n'
stored as textfile
location '/warehouse/employee';

数据格式：
wang 123 a1,a2,a3 k1:1,k2:2,k3:3 s1,s2,s3,4
liu 123 a4,a5,a6 k1:1,k2:2,k3:3 s1,s2,s3,4
zhang 123 a7,a8,a9 k1:1,k2:2,k3:3 s1,s2,s3,4

selecct subordinates[1] from employees;查找索引为1的数据
selecct deductions["k2"] from employees;
selecct address.city from employees;

Hive建表的其他方式
由一个表创建另外一张表
Create table test3 like test2;

从其他表查询创建表
Create table test4 as select name,addr from test5;

Hive不同文件读取对比
stored as textfile
直接查看hdfs
hadoop fs -text
stored as sequencefile
hadoop fs -text
stored as rcfile
hive-service rcfilecat path
stored as inputformat'class'
outformat'class'

create table test_text(name string,val string) stored as textfile;
create table test_seq(name string,val string) stored as sequencefile;

？？？自动以输入流输出流没有实现！
自定义outputformat和inputformat
--------------inputformat-----------------------------------------------------------------
UDInputFormat.java
package com.zyf.hive.inputformat;

import java.io.IOException;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class UDInputFormat<K,V> extends FileInputFormat<K, V>{
@Override
public org.apache.hadoop.mapred.RecordReader<K, V> getRecordReader(
org.apache.hadoop.mapred.InputSplit split, JobConf job,
Reporter reporter) throws IOException {
reporter.setStatus(split.toString());
return (org.apache.hadoop.mapred.RecordReader<K, V>) new UDRecordReader<K,V>(job,(FileSplit) split);
}

@Override
protected FileStatus[] listStatus(JobConf job) throws IOException {
FileStatus[] files = super.listStatus(job);
for(int i = 0;i < files.length;i++){
FileStatus file = files[i];
if(file.isDir()){
Path dataFile = new Path(file.getPath(),MapFile.DATA_FILE_NAME);
org.apache.hadoop.fs.FileSystem fs = file.getPath().getFileSystem(job);
files[i] = fs.getFileStatus(dataFile);
}
}
return files;
}
}

---------------------------------------------------------------------------------
UDRecordReader.java
package com.zyf.hive.inputformat;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader;

public class UDRecordReader<K,V> extends SequenceFileRecordReader<K, V> {
public UDRecordReader(Configuration conf,FileSplit split) throws IOException{
super();
}
}
-----------------------------------------------------
add jar /root/dev_store/UDInputFormat.jar

drop table testinputformat;
create table if not exists testinputformat(
name string comment 'name value',
addr string comment 'addr value'
)
row format delimited fields terminated by '\t' lines terminated by '\n'
stored as inputformat 'com.zyf.hive.inputformat.UDInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
location ''

load data local inpath '/home/data/data' into table testtable;

add jar path;
add jar /home/data/UDInputFormat.jar;
-----------------------------------------------------------------------------------
Hive使用SerDe
SerDe是"Serializer和Deserializer的简写"
Hive使用SerDe(和FileFormat)来读、写表的行
读写数据的顺序如下：
HDFS文件--> InputFileFormat--> <key,value> --> Deserializer --> Row对象
Row对象--> Serializer --> <key,value> --> OutputFileFormat --> HDFS文件

create table apachelog(
t_host STRING,
t_identity STRING,
t_user STRING,
t_time STRING,
t_request STRING,
t_status STRING,
t_size STRING,
t_referer STRING,
t_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES(
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([0-9]*) ([^ ]*) ([^ ]*)"
)STORED AS TEXTFILE;

load data local inpath '/root/dev_store/apache.access.log' overwrite into table apachelog;

load data local inpath '/root/dev_store/apache.access.2.log' overwrite into table apachelog;
select host from apachelog;

add jar /usr/hadoop/apache-hive-1.2.1-bin/lib/hive-contrib-1.2.1.jar;

drop table apachelog;
create table if not exists apachelog(
t_host STRING,
t_identity STRING,
t_user STRING,
t_time STRING,
t_request STRING,
t_status STRING,
t_size STRING,
t_referer STRING,
t_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*) ([^ ]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s")STORED AS TEXTFILE;

数据：
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

---------------------------------------------------------------------------------------------------

Hive 分区表
分区
在Hive Select查询中一般会扫描整个表内容，会消耗很多时间做没必要的工作
分区表指的是在创建表时指定partition的分区空间

分区语法
Create table tablename(
name string
)
partitioned by (key type,...)

Hive 分区表
create table employee(
name string,
salary float,
subordinates array<string>,
deductions map<string,float>,
address struct<street:string, city:string, state:string, zip:int>
)
partitioned by (dt string,type string)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n' stored as textfile;

create table if not exists employees(
name string,
salary float,
subordinates array<string>,
deductions map<string,float>,
address struct<street:string,city:string,state:string,zip:int>
)
partitioned by (dt string,type string)
row format delimited
fields terminated by '\t' --每个字段值的分隔符
collection item terminated by ',' ---集合类型字段中单个record的分隔符
map keys terminated by ':' --map类型字段单个record中key和value的分隔符
lines terminated by '\n'
stored as textfile

>>desc formatted employees;

增加分区
Alter table employees add if not exists partition(country='xxx'[,state='yyyy'])

删除分区
Alter table employees drop if exists partition(country='xxx'[,state='yyyy'])

--显示分区
show partitions employees;
---------------------------------------------------------------------------------------------------------
Hive 分桶
分桶
对于每一个表或者分区，Hive可以进一步组成桶，也就是说桶是更为细粒度的数据范围划分
Hive是针对某一列进行分桶
Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中

好处
获得更高的查询处理效率。
使取样(sampling)更高效

create table bucketed_user
(
id int,
name string
)
clustered by(id) sorted by(name) into 4 buckets
row format delimited fields terminated by '\t' stored as textfile;
Set hive.enforce.bucketing = true;

Hive基本使用-查询
基本语法
SELECT[ALL|DISTINCT]select_expr,select_expr,... FROM table_name
[WHERE where_condition]

-----------------------------------------------------------------------------------------
*Hive 数据操作*

Hive执行命令方式
Hive数据操作
Hive动态分区
Hive高级查询
Hive桶
Hive索引
Hive视图

前面讲的几种
cli,jdbc,hwi,
beeline 从客户端执行，走jdbc
详细讲cli shell
hive -help
hive --help
list,source
注：命令脚本必须在集群的节点或hiveclient执行

hive -e "select * from test3" > /home/cloudera/data
hive -S -e "select * from test3" > /home/cloudera/data
hive -v -e "select * from test3" > /home/cloudera/data
hive -f "/home/data/hql/select_hql"

test.sh
#!/bin/base
time=''
hive -e "select * from testtext where name=${time}"
Hive操作-变量
配置变量
set val='';
${hiveconf:val}

环境变量
${env:HOME}，注env查看所有环境变量

set val=wer;
set val
select * from testtext where name = '${hiveconf:val}'
select '${env:HOME}' from testtext;

--\N hive底层默认的格式

Hive数据加载注意问题
分隔符问题，且分隔符默认只有单个字符
数据类型对应问题
Load数据，字段类型不能互相转化时，查询返回NULL
select查询插入，字段类型不能互相转化时，插入数据为NULL

select查询插入数据，字段值顺序要与表中字段顺序一致，名称可不一致
Hive在数据加载时不做检查，查询时检查
外部分区表需要添加分区才能看到数据

Hive数据加载
外表数据加载
创建表时指定数据位置
create external table tablename() location''
查询插入，同内表
使用Hadoop命令拷贝数据到指定位置（hive的shell中执行和Linux的shell执行）

分区表数据加载
内部分区表和外部分区表数据加载
内部分区表数据加载方式类似于内表
外部表分区数据加载方式类似于外表
注意：数据存放的路径层次要和表分区一致；
如果分区表没有新增分区，即使目标路径下已经有数据了，
但依然查不到数据
不同之处
加载数据指定目标表的同时，需要指定分区

Hive分区表数据加载
本地数据加载
Load data local inpath 'localpath'[overwrite] into table tablename partition(pn='')

加载hdfs数据
Load data inpath 'hdfspath'[overwrite] into table tablename partition(pn='')

由查询语句加载数据
insert[overwrite] into table tablename partition(pn='')
select col1,col2 from table where ...
----------------------------------------------------------------
--数据导出及动态分区

Hive 数据导出
导出的方式
Hadoop命令的方式
get
text
通过INSERT... DIRECTORY方式
insert overwrite[local] directory '/tmp/ca_employees'
[row format delimited fields terminated by '\t']
select name,salary,address from employees

example1:
insert overwrite local directory '/home/data3'
row format delimited fields terminated by '\t'
select name,addr from testtext;

Shell命令加管道：hive-f/e | sed/grep/awk >file
第三方工具
----------------------------------------------------------------------
hadoop fs -text /warehouse/testtext/*

Hive 动态分区
动态分区
不需要为不同的分区添加不同的插入语句
分区不确定，需要从数据中获取

几个参数
set hive.exec.dynamic.partition=true //使用动态分区
set hive.exec.dynamic.partition.mode=nonstrict; //无限制模式
如果模式是strict，则必须有一个静态分区，且放在最前面(例如有三个字段，一个静态分区，两个动态分区，那么静态分区要放在最前面)

set hive.exec.max.dynamic.partitions.pernode=10000;//每个节点生成动态分区的最大个数
set hive.exec.max.dynamic.partitions=100000;//生成动态分区的最大个数
set hive.exec.max.created.files=1500000;//一个任务最多可以创建的文件数目
set dfs.datanode.max.xcievers=8192;//限定一次最多打开的文件数

create table d_part(
name string
)
partitioned by (value string)
row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;

show partitions d_part;

select * from d_part;

set hive.exec.dynamic.partition.mode=nonstrick;
set hive.exec.dynamic.partition=true

insert overwrite table d_part partition(value)
select name,addr as value from testtext;
-------------------------------------------------------------------------------------------------------------
表属性操作
修改表明
alter table table_name rename to new_table_name;
修改列名
alter table tablename change column c1 c2 int comment 'xxxxxxxxxx'

after severity;//可以把该列放到指定列的后面，或者使用'first'放到第一位
增加列
alter table tablename add columns(c1 string comment 'xxxxxxxxxx',c2 long comment 'yyyyyyyyyy')

alter table test change column type type string after name;
alter table test change column type type string first;
alter table test change column type col2 int;

修改tblproperties
alter table table_name set tblproperties(property_name=property_value,property_name=property_value,...)
针对无分区与有分区表不同
alter table table_name
set serdeproperties('field.delim'='\t');
有分区表
alter table test1 partition(dt='xxxxxx') set serdeproperties('field.delim'='\t');
alter table test set tblproperties('comment'='xxxxxx');
--无分区表
create table city(
time string,
country string,
province string,
city string
)
row format delimited fields terminated by '#' lines terminated by '\n'
stored as textfile
load data local inpath '/home/data/city' into table city;
alter table city set serdeproperties('field.delim'='\t');

---分区表
create table city(
time string,
country string,
province string,
city string
)
partitioned by (dt string)
row format delimited fields terminated by '#' lines terminated by '\n'
stored as textfile

表属性操作
修改location
alter table table_name[partition()] set location 'path'
alter table table_name set TBLPROPERTIES
('EXTERNAL'='TRUE');//内部表转外部表
alter table table_name set TBLPROPERTIES
('EXTERNAL'='FALSE');//外部表转内部表

alter table city set location 'hdfs://master:9000/location';
alter table city set tblproperties('EXTERNAL'='TRUE');

1、alter table properties
2、alter serde properties
3、alter table/partition file format
4、alter table storage properties
5、alter table rename partition
6、alter table set location

wiki LanguageManual DDL

show partitions test_part;

Hive高级查询
查询操作
group by、Order by、Join、distribute by、Sort by、cluster by、Union all

底层的实现
Mapreduce

几个简单的聚合操作
count 计数
count(*) count(1) count(col)
sum求和
sum(可转成数字的值)返回bigint
avg求平均值
avg（可转成数字的值）返回double
distinct 不同值个数
count（distinct col）

Order by
按照某些字段顺序
样例
select col1,other...
from table
where condition
order by col1,col2[asc|desc]
注意
order by 后面可以有多列进行排序，默认按字典排序
order by 为全局排序
order by 需要reduce操作，且只有一个reduce，与配置无关

group by
按照某些字段的值进行分组，有相同值放到一起
样例
select col1[,col2],count(1),sel_expr（聚合操作）
from table
where condition
group by col1[,col2]
[having]
注意
select 后面非聚合列必须出现在group by中
除了普通列就是一些聚合操作
group by 后面也可以跟表达式，比如substr（col）

特性
使用了reduce操作，受限于reduce数量，设置reduce参数mapred.reduce.tasks
输出文件个数与reduce数相同，文件大小与reduce处理的数据有关
问题
网络负载过重
数据倾斜，优化参数hive.groupby.skewindata

set mapred.reduce.tasks=5;
set hive.groupby.skewindata=true;

select * from m order by col desc,col2 asc;
----------------------------------------------------------
--------------------join操作----------------------------
表连接
两个表m,n之间按照on条件连接，m中的一条记录和n中的一条记录组成一条新记录
join等值连接，只有某个值在m和n中同时存在时
left outer join左外连接，左边表中的值无论是否在b中存在时，都输出，右边表中的值只有在左边表中存在时才输出
right outer join 和left outer join 相反
left semi join 类似exists
mapjoin 在map端完成join操作，不需要用reduce，基于内存做join，属于优化操作

col col2 a
1 w
3 r
5 e

col3 col4 b
1 f
1 g
5 j
2 p

(select col from a) s join
(select col from b) t on s.col = t.col3

结果：
1 w f
1 w g
5 e j
-------------------------------------------
(select col from a) s left outer join
(select col from b) t on s.col = t.col3
结果：
1 w f
1 w g
5 e j
3 r null
------------------------------------------
(select col from a) s right outer join
(select col from b) t on s.col = t.col3
结果：
1 w f
1 w g
5 e j
2 p null
-------------------------------------------
select s.col,s.col2,t.col4
(select col,col2 from a) s left semi join
(select col3,col4 from b) t on s.col = t.col3

1 w f
5 e j

--------------------------------------------------------------------------
样例
select m.col as col,m.col2 as col2,n.col3 as col3
from
(select col,col2 from test where ...(map端执行)) m(左表)
[leftouter|right outer|left semi] join
n(右表)
on m.col = n.col
where condition(reduce端执行)
set hive.optimize.skewjoin = true;

数据输出对比
|------------|----------------|------------------|------------------|
|数据 | join | leftouterjoin | rightouterjoin |
|------------|----------------|------------------|------------------|
|左表M | col col2 col3 | col col2 col3 | col col2 col3|
|col col2 | A 1 6 | A 1 6 | A 1 6 |
|A 1 | C 5 4 | C 5 4 | C 5 4 |
|C 5 | C 3 4 | B 2 NULL | C 3 4 |
|B 2 | | C 3 4 | D NULL 5 |
|C 3 | | | |
|------------| | | |
|右表 | | | |
|col col3 | | | |
|C 4 | | | |
|D 5 | | | |
|A 6 | | | |
|------------|----------------|------------------|------------------|

Mapjoin
mapjoin(map side join)
在map端把小表加载到内存中，然后读取大表，和内存中的小表完成连接操作
其中使用了分布式缓存技术

优缺点
不消耗集群的reduce资源（reduce相对紧缺）
减少了reduce操作，加快程序执行
降低网络负载

占用部分内存，所以加载到内存中的表不能过大，因为每个计算节点都会加载一次
生成较多的小文件

Mapjoin
配置以下参数，是hive自动根据sql，选择使用common join或者map join
set hive.auto.convert.join = true;
hive.mapjoin.smalltable.filesize默认值是25mb
第二种方式，手动指定
select /*+mapjoin(n)*/m.col,m.col2,n.col3 from m
join n
on m.col = n.col
简单总结一下，mapjoin的使用场景：
1、关联操作中有一张表非常小
2、不等值的链接操作

select /*+mapjoin(n)*/ m.city,n.province from
(select province,city from city) m
join
(select province from province) n
on m.province = n.province

load data local inpath '/home/data/city' into table city;
set hive.auto.convert.join;
set hive.auto.convert.join=true

Hive 分桶
分桶
对于每一个表（table）或者分区，Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分
Hive是针对某一列进行分桶
Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中

好处
获得更高的查询处理效率。
是取样（smpling）更高效

create table bucketed_user(
id int,
name string
)
clustered by(id) sorted by(name) into 4 buckets
row format delimited fields terminated by '\t' stored as textfile;
set hive.enforce.bucketing=true;

分桶的使用
select * from bucketed_user
tablesample(bucket 1 out of 2 on id)
bucket join
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

Hive 分桶
连接两个在（包含连接列）相同列上划分了桶的表，可以使用Map端连接（Map-side join）高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这
两个表都进行了桶操作。那么将保持相同列值的桶进行JOIN操作就可以，可以大大减少JOIN的数据量。
对于map端连接的情况，两个表以相同方式划分桶。处理左边表内某个桶的mapper知道右边表内相匹配的行在对应的桶内。因此，mapper只需要获取那个桶（这只是右边表内存数据的一小部分）
即可进行连接。这一优化方法并不一定要求两个表必须桶的个数相同，两个表的桶个数是倍数关系也可以。

------------Distribute by 和 Sort by----------------------------------------------------------------
Distribute分散数据
distribute by col
按照col列把数据分散到不同的reduce
Sort排序
sort by col2
按照col列把数据排序
select col1,col2 from M
distribute by col1
sort by col1 asc,col2 desc;
两者结合出现，确保每个reduce的输出都是有序的

对比
distribute by 与 group by
都是按key值划分数据
都使用reduce操作
唯一不同，distribute by只是单纯的分散数据，而group by把相同key的数据聚集到一起，后续必须是聚合操作

--------------------------------------------
order by 与 sort by
order by 是全局排序
sort by 只是确保每个reduce上面输出的数据有序，如果只有一个reduce时，和order by作用一样

应用场景
map输出的文件大小不均
reduce输出文件大小不均
小文件过多
文件超大
----------------------------------------
cluster by
把有相同值得数据聚集到一起，并排序
效果
cluster by col
distribute by col order by col

set mapred.reduce.tasks=5;
insert overwrite table city
select time,
country,
province,
city
from info
distribute by province

set mapred.reduce.tasks=1;
insert overwrite table province partition(dt='20140901')
select time,
country,
province,
city
from city
distribute by country
--------------------------------------
Union all
多个表的数据合并成一个表，hive不支持union
样例
select col
from(
select a as col from t1
union all
select b as col from t2
)tmp
------------------------------------
Hive 函数
目录
函数分类
|-----|-----------------------------------------|
| | 简单函数---map阶段 |
| | 聚合函数---reduce阶段 |
| | 集合函数---map阶段 |
| | 特殊函数 |
| | 内置函数 |
|函数-|-----------------------------------------|
| | 自定义函数 UDF map阶段 |
| | UDAF reduce阶段|
|-----|-----------------------------------------|

内置函数
正则表达式
自定义函数

cli命令
1、显示当前会话有多少函数可用
SHOW FUNCTIONS；
2、显示函数的描述信息
DESC FUNCTION concat；
3、显示函数的扩展描述信息
DESC FUNCTION EXTENDED concat;

------------------简单函数----------------------------
函数的计算粒度-单条记录
关系运算
数学运算
逻辑运算
数值运算
类型转型
日期函数
条件函数
字符串函数
统计函数

-----------------聚合函数-------------------------------
函数处理的数据粒度-多条记录
sum()-求和
count()-求数据量
avg()-求平均值
distinct-求不同值数
min-求最小值
max-求最大值
-----------------集合函数--------------------------
复合类型构建
复杂类型访问
复杂类型长度
-------------------特殊函数-----------------------
窗口函数
分析函数
混合函数
UDTF
----------------窗口函数--------------------------------
应用场景
用于分区排序
动态Group By
Top N
累计计算
层次查询
Windowing functions
lead
lag
FIRST_VALUE
LAST_VALUE
------------------分析函数-------------------------------------
The OVER clause
COUNT
SUM
MIN
MAX
AVG
Analytics functions
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE
-------------------混合函数----------------------------
java_method(class,method[,arg1[,arg2..]])
reflect(class,method[,arg1[,arg2...]])
hash(a1[,a2...])
----------------UDTF----------------------
表函数
lateraView:LATERAL VIEW udtf(expression) tableAlias AS columnAlias(',',columnAlias) * fromClause:FROM baseTable(lateralView)*
例
explode函数
------------------------------------------
select id,money from winfunc
where id='1001' or id='1002' and money='100'

case(money as bigint)

if(con,'','') case when con then when then else end

get_json_object

select get_json_object('{"name":"jack","age":"20"}','$.name') from winfunc limit 1
(注意数据格式)

select parse_url('http://baidu.com/path/p.php?k1=v1&k2=v2#Ref1','HOST') from winfunc limit 1;

concat
concat_ws(string SEP,array<string>) 参数

select concat(type,'123') from winfunc;

collect_set()
collect_list()

sum(money)
count(*)

first_value(money) over(partition by id order by money rows between 1 preceding and 1 following)

select id,name, first_value(money) over(partition by id order by money rows between 1 preceding and 3 following) from winfunc;

select id,name,rank() over(partition by id order by name) from winfunc;

select id,name,dense_rank() over(partition by id order by name) from winfunc;

select id,name,money,cume_dist() over(partition by id order by money) from winfunc;

select id,name,money,ntile(2) over (partition by id order by money desc) from winfunc;

select id,name,money,java_method("java.lang.Math","sqrt",cast(id as double)) from winfunc;

select id,adid from winfunc lateral view explode(split(type,'B')) tt as adid

select userid,pageid,visitdate,rank() over(partition by userid order by pageid) from (select distinct userid,pageid,visitdate from test) a;

select id,name,row_number() over(partition by id order by name) from winfunc;

hive -e "select distinct userid,pageid, first_value(pageid) over(partition by userid order by pageid rows between 1 preceding and 5 following) first_value from test where visitdate='20150501'" >test.txt

hive -e "select distinct userid,pageid, last_value(pageid) over(partition by userid order by pageid rows between 1 preceding and 5 following) last_value from test where visitdate='20150501'" >test.txt

hive -e "select distinct userid,pageid, first_value(pageid) over(partition by userid order by pageid ) first_value from test where visitdate='20150501'" >test.txt

hive -e "select userid,pageid,visitdate,rank() over(partition by userid order by pageid) from test" >test.txt

lead(money,2) over(order by money)

rank() over(partition by id order by money)
dense_rank() over(partition by id order by money)

cume_dist() over(partition by id order by money) ((相同值最大行号)/(行数))*每个值的个数，与前面的累加

percetn_rank()over(partition by id order by money) ((相同值最小行号-1)/(行数-1))

第一个总是从0开始的
select id,money,cume_dist() over(partition by id order by money),percent_rank() over(partition by id order by money) from winfunc;

ntile(2) over (order by money desc nulls last) 分片
select id,money,ntile(2) over (order by money desc ) from winfunc;

select java_method("java.lang.Math","sqrt",cast(id as double)) from winfunc;

select id,adid from winfunc lateral view explode(split(type,'B')) tt as adid;

select id,money,first_value(money) over(partition by id order by money) from winfunc;
select id,money,first_value(money) over (partition by id order by money) rows between 1 preceding and 1 following) from winfunc;
select id,money,lead(money,2) over(order by money) from winfunc;
select id,money,rank() over(partition by id order by money) from winfunc;

-------------------------------------------
winfunc
1001 100.0 ABC
1001 150.0 BCD
1001 200.0 CDE
1001 150.0 DEF
1002 200.0 ABC
1002 200.0 ABC
1002 100.0 BCD
1002 300.0 CDE
1002 50.0 DEF
1002 400.0 EFG
1003 100.0 ABC
1003 50.0 BCD
1004 100.0 ABC
1004 90.0 ABC
1004 30.0 ABC
1004 80.0 ABC
1004 40.0 ABC
1004 70.0 ABC
1004 50.0 ABC
1004 60.0 ABC

create table winfunc(id int,money float,name string)
row format delimited fields terminated by '\t' lines terminated by '\n'
stored as textfile

------------------------------------------------
select if(2>1,v1,v2) from winfunc

select case when id='1001' then 'v1' when id ='1002' then 'v2' else 'v3' end from winfunc

正则表达式
使用正则表达式的函数

A LIKE B,字符‘_’表示任意单个字符，而字符“%”表示任意数量的字符
A RLIKE B
regexp_replace(string A,string B,string C)
regexp_extract(string subject,string pattern,int index)

select 1 from dual where 'footbar' rlike ^f.*r$;
select regexp_replace('foobar','oo|ar',") from dual;
select regexp_extract('foothebar','foo(.*?)(bar)',1) from winfunc;
select regexp_extract('979|7.10.80|8684','.*\\|(.*)',1) from winfunc limit 1;
select regexp_extract('979|7.10.80|8684','(.*?)\\|(,*)',1) from winfunc limit 1;
-------------------自定义函数--------------------------------------------------------------
UDF<-- 自定义函数 -->UDAF

UDF
UDF-用户自定义函数(user defined function)
针对单条纪录
创建函数
自定义一个Java类
继承UDF类
重写evaluate方法
打jar包
hive执行add jar
hive执行创建模板函数
hql中使用

---------------------------------------------------------------------------------------
udf
BigThan.java

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class BigThan extends UDF{
public boolean evalute(final Text t1,final Text t2){
if(t1==null || t2==null){
return false;
}
double num = Double.parseDouble(t1.toString());
double tmp = Double.parseDouble(t2.toString());
if(num>tmp){
return true;
}else{
return false;
}
}
}

udftest.java

import org.apache.hadoop.hive.ql.exec.UDF;
public class udftest extends UDF{
public boolean evaluate(Text t1,Text t2){
if(t1==null || t2==null){
return false;
}
double d1 = Double.parseDouble(t1.toString());
double d2 = Double.parseDouble(t2.toString());
if(d1>d2){
return true;
}else{
return false;
}
}
}

add jar /home/jar/function.jar;
create temporary function bigthan as 'com.peixun.udf.udftest';
select name,addr.bigthan(addr.80) from testtext;

UDAF
UDAF用户自定义聚合函数
user defined aggregation function
针对纪录集合

开发通过UDAF有两个步骤
第一个是编写resolver类，resolver负责类型检查，操作符重载。
第二个是编写evaluator类，evaluator真正实现UDAF的逻辑

通常来说，顶层UDAF类继承
org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2,里面编写嵌套类evaluator实现UDAF的逻辑

一、实现resolver
resolver通常继承
org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2,但是更建议继承AbstractGenericUDAFResolver,隔离将来hive接口的变化。GenericUDAFResolver和GenericUDAFResolver2接口的区别是，
后面的允许evaluator实现可以访问更多的信息，例如DISTINCT限定符，通配符FUNCTION(*).
二、实现evaluator
所有evaluators必须继承抽象类
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.子类必须实现它的一些抽象方法，实现UDAF的逻辑。
Mode
这个类比较重要，它表示了udaf在MapReduce的各个阶段，理解Mode的含义，就可以理解了hive的UDAF的运行流程。
public static enum Mode{
PARTIAL1,
PARTIAL2,
FINAL,
COMPLETE
};
------------------------------------------------------------------------------------------------------------
udaftest.java
public class udaftest extends AbstractGenericUDAFResolver{

}
CountBigThan.java
public class CountBigThan extends AbstractGenericUDAFResolver{

}
----------------------------------------------------------------
PARTIAL1:这个是MapReduce的map阶段：从原始数据到部分数据聚合，将会调用iterate()和terminatePartial()
PARTIAL2:这个是MapReduce的map端的Combiner阶段，负责在map端合并map的数据；从部分数据聚合到部分数据聚合，将会调用merge（）和terminatePartial()
FINAL:mapreduce的reduce阶段：从部分数据的聚合到完全聚合，将会调用merge（）和terminatePartiao()
COMPLETE:如果出现了这个阶段，表示MapReduce只有map，没有reduce，所以map端就直接出结果了；从原始数据直接到完全聚合，将会调用iterate()和terminate()

跟着源码学
src\ql\src\java\org\apache\hadoop\hive\ql\udf\genreic

-------------永久函数------------------------
如果希望在hive中自定义一个函数，且能永久使用，则修改源码添加相应的函数类，然后在修改ql/src/java/org/apache/hadoop/hive/ql/exec/Function Registry.java类，添加相应的注册函数代码。
registerUDF("parse_url",UDFParseUrl.class,false);
新建hiverc文件
jar包放到安装目录下或者指定目录下
$HOME/.hiverc
把初始化语句加载到文件中
----------------------------------------------
select bitthan(addr,80) from testtext;

mapreduce阶段调用函数
MAP
init()
iterate()
terminatePartial()
REDUCE
init()
merge()
terminate()
Combiner
merge()
terminatePartial()
---------------------Hive HQL优化---------------
Hive 执行
Hive 表优化
HiveSQL 优化
Hive job优化
Hive Map优化
Hive Shuffle优化
Hive Reduce优化
Hive 权限管理
---------------------------------------------------------------------------
hive查询操作优化
join优化
hive.optimize.skewjoin=true;如果是join过程出现倾斜应该设置为true
set hive.skewjoin.key=1000000;--这个是join的键对应的记录条数超过这个值则会进行优化
mapjoin
set hive.auto.convert.join=true;
hive.mapjoin.smalltable.filesize默认值是25mb
select /*+mapjoin(A*/ f.a,f.b from A t join B f on(f.a = t.a)
简单总结一下,mapjoin的使用场景：
1、关联操作中有一张表非常小
2、不等值的链接操作
----------------------------------------------------------------------------
bucket join
两个表以相同方式划分桶
两个表的桶个数是倍数关系

create table order(cid int,price float) clustered by(cid) into 32 buckets;
create table customer(id int,first string) clustered by(id) into 32 buckets;
select price from order t join customer s on t.cid=s.id

join优化前
select m.cid,u.id from order m join customer u on m.cid = u.cid where m.dt = '2013-12-12';
优化后
select m.cid,u.id from (select cid from order where dt='2013-12-12') m join customer u on m.cid = u.id;
--------------------------------------------------------------------------
count distinct 优化
优化前
Select count(distinct id) from tablename
优化后
Select count(1) from (select distinct id from tablename) tmp;
Select count(1) from (select id from tablename group by id) tmp;
----------------------------------------------------------------------------------
select count(distinct city) from info;
select count(distinct city) from (select distinct city from info) tmp;
set mapred.reduce.tasks=3;
-----------------------------------
select a,sum(b),count(distinct c),count(distinct d) from test
优化后
select a,sum(b) as b,count(c) as c,count(d) as d from(
select a,o as b,c,null as d from test group by a,c
union all
select a,o as b,null as c,d from test group by a,d
union all
select a,b,null as c,null as d from test
) tmp1 group by a;
-------Hive优化和权限管理------------------
Hive优化目标
在有限的资源下，执行效率高
常见问题
数据倾斜
Map数设置
Reduce数设置
其他
---------------------------------
Hive执行

||
VV
HQL

||
VV
Job

||
VV
Map/Reduce
----------------------------------
执行计划
查看执行计划
explain[extended]hql

explain extended hql 查看更详细的执行计划

样例
select col,count(1) from test2 group by col;
explain select col,count(1) from test2 group by col;
------语法树-----------
ABSTRACTSYNTAX TREE:
(TOK_QUERY(TOK_FROM(TOK_TABREF
(TOK_TABNAME test2)))(TOK_INSERT
(TOK_DESTINATION(TOK_DIR TOK_TMP_FILE))
(TOK_SELECT(TOK_SELEXPR
(TOK_TABLE_OR_COL col))(TOK_SELEXPR
(TOK_FUNCTION count1)))(TOK_GROUPBY
(TOK_TABLE_OR_COL col))))
------执行阶段--------------------------
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
----------Hive执行过程（需要截图）----------

----------Hive 表优化--------------
分区
静态分区
动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
分桶
set hive.enforce.bucketing = true;
set hive.enforce.sorting = true;
数据
相同数据尽量聚集在一起
----------Hive job优化-----------------------------
并行化执行
每个查询被hive转化成多个阶段，有些阶段关联性不大，则可以并行化执行，减少执行时间

set hive.exec.parallel = true;
set hive.exec.parallel.thread.numbe=8;
本地化执行
set hive.exec.model.local.auto = true;
当一个job满足如下条件才能真正使用本地模式:
1、job的输入数据大小必须小于参数：
hive.exec.mode.local.auto.inputbytes.max（默认128MB）
2、job的map数必须小于参数:
hive.exec.mode.local.auto.tasks.max(默认4)
3、job的reduce数必须为o或者1
------------------------------------------------
job合并输入小文件
set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
合并文件数由mapred.max.split.size限制的大小决定
job合并输出小文件
set hive.merge.smallfiles.avgsize=256000000;当输出文件平均大小小于该值，启动新job合并文件
set hive.merge.size.per.task=64000000;合并之后的文件大小
-------------------------------------------
JVM重利用
set mapred.job.reuse.jvm.num.tasks=20;
JVM重利用可以是JOB长时间保留slot，直到作业结束，这在对于有较多任务和较多小文件的任务是非常有意义的，减少执行时间。当然这个值不能设置过大，因为
有些作业会有reduce任务，如果reduce任务没有完成，则map任务占用的slot不能释放，其他的作业可能就需要等待。
---------------------------------------------
压缩数据
中间压缩就是处理hive查询的多个job之间的数据，对于中间压缩，最好选择一个节省CPU耗时的压缩方式
set hive.exec.compress.intermediate = true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;
hive查询最终的输出也可以压缩
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.output.compression.type=BLOCK;
-------------------------------------------------
set mapred.map.tasks=10;
(1)默认map个数
default_num = total_size/block_size;
(2)期望大小
goal_num=mapred.map.tasks;
(3)设置处理的文件大小
split_size=max(mapred.min.split.size,block_size);
split_num=total_size/split_size;
(4)计算的map个数
compute_map_num = min(split_num,max(default_num,goal_num))
--------------------------------------------
Hive Map 优化
经过以上的分析，在设置map个数的时候，可以简单的总结为以下几点：
(1)如果想增加map个数，则设置mapred.map.tasks为一个较大的值。
(2)如果想减少map个数，则设置mapred.min.split.size为一个较大的值。
情况1：输入文件size巨大，但不是小文件
增大mapred.min.split.size的值
情况2:输入文件数量巨大，且都是小文件，就是单个文件的size小于blockSize。这种情况通过增大mapred.min.split.size不可行，需要使用CombineFileInputFormat将多个input path合并成一个
InputSplit送给mapper处理，从而减少mapper的数量。
--------------------------------------
map端聚合
set hive.map.aggr=true;
推测执行
mapred.map.tasks.speculative.execution
--------Hive Shuffle优化-------------------------------

Map端
io.sort.mb
io.sort.spill.percent
min.num.spill.for.combine
io.sort.factor
io.sort.record.percent

Reduce端
mapred.reduce.parallel.copies
mapred.reduce.copy.backoff
io.sort.factor
mapred.job.shuffle.input.buffer.percent
mapred.job.shuffle.input.buffer.percent
mapred.job.reduce.input.buffer.percent
--------------Hive Reduce优化------------
需要reduce操作的查询
聚合函数
sum,count,distinct...
高级查询
group by,join,distribute by,cluster by ...
order by 比较特殊，只需要一个reduce
推测执行
mapred.reduce.tasks.speculative.execution
hive.mapred.reduce.tasks.speculative.execution

Reduce 优化
set mapred.reduce.tasks=10;直接设置
hive.exec.reducers.max
hive.exec.reducers.bytes.per.reducer 默认:1G
计算公式
numRTasks = min[maxReducers,input.size/perReducer]
maxReducers = hive.exec.reducers.max
perReducer = hive.exec.reducers.bytes.per.reducer
----------------HIVE 案例实战---------------------------------
日志处理流程
Flume-ng
Kafka
Flume-ng+Kafka+Hdfs
Hive仓库
日志处理

日志处理流程
数据收集 ==》数据清洗 ==》数据存储与管理 ==》数据分析 ==》数据显示

--------------Flume-ng介绍-------------------------------------
Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统
Flume支持在日志系统中定制各类数据发送方，用于收集数据
Flume提供对数据进行简单处理，并写到各种数据接受方（可定制）的能力。
当前Flume有两个版本Flume0.9X版本的统称Flume-og,Flume1.X版本的统称Flume-ng。
主要元素
agent source channel sink

Agent使用JVM运行Flume。每台机器运行一个agent，但是可以在一个agent中包含多个sources和sinks
Client生产数据，运行在一个独立的线程
Source从Client收集数据，传递给Channel
Sink从Channel收集数据，运行在一个独立线程
Channel连接sources和sinks
Events可以是日志记录、avro对象等

https://cwiki.apache.org/confluence/display/FlUME/Getting+Started
-----------------------------------------------------------
Flume安装
修改配置文件
agent.sources=r1
agent.sinks=s1
agent.channels=c1
agent.source.r1.channels=c1
agent.sinks.s1.channel=c1
#Describe/configure the source
agent.sources.r1.type=exec
agent.sources.r1.command=tail-F /home/flume/loginfo
#Use a channel which buffers events in memory
agent.channels.c1.type=memory
agent.channels.c1.capacity=1000
agent.channels.c1.transcationCapacity=100
agent.sinks.s1.type=logger

启动Flume
bin/flume-ng agent --conf ./conf/ -f conf/flume-conf.properties-Dflume.root.logger=DEBUG,console-n agent
------------------------------------------------------------------------------
Kafka安装
Kafka是高吞吐量日志处理的分布式消息队列
Kafka几个概念
broker
producer
consumer
topic
partition

下载安装包
解压安装包
tar -xzf kafka-<VERSION>.tgz
编译代码
./sbt update
./sbt package
配置环境变量，生效配置文件
export KAFKA_HOME=/home/kafka/kafka-0.7.2-incubating-src
export KAFKA_CONF_DIR=/home/kaflka-0.7.2-incubating-src/config
export PATH=$PATH:$KAFKA_HOME/bin

使用kafka几个操作
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic test
bin/kafka-console-consumer.sh --zookeeper localhost:2181 -topic test --from-beginning

学习地址
http://kafka.apache.org/07/quickstart.html
----------------Flume----------------------------------------------------------
主要配置Flume
同一个source多个sink
HDFS
Kafka（特殊性需要自定义）
注意几个问题
JDK版本
Hadoop jar 包版本
hdfs 端口问题

-----------hive数据仓库------------------------------------------
内容
不同格式数据源处理
不同数据格式统一格式
不同来源数据统一字段
非统一字段使用集合
来自不同来源使用分区

-------------日志处理------------------------------------------
Flume收集日志
日志分发
hdfs
kafka
日志处理
MR处理HDFS数据
Spark处理kafka数据
Hive管理HDFS数据
编写hql统计数据
----------------------------------------------------------------
create table testflume (a string,b string) row format delimited fields terminated
by '\t' lines terminated by '\n' stored as textfile location '/root/sou'
-----------------------------------------------------------------
desc fromated testflume;
alter table testflume set tblproperties('EXTERNAL'='TRUE');
select * from testflume;
---------------------------------------------
LogHandler.java
import java.io.IOException;
public class LogHandler extends Configuraed implements Tool{
public static void main(String[] args) throws Exception{
int exit = ToolRunner.run(new LogHandler(),args);
System.exit(exit);
}
@Override
public int run(String[] args) throws Exception{
Configuration conf = new Configuration();
Path dst = new Path(args[1]);
FileSystem fs = FileSystem.get(new Path("hdfs://192.168.1.198:8020")).toUri(),conf);
if(fs.exists(dst)){
fs.delete(dst,true);
}
Job job = new Job(conf,"LongHandler");
job.setMapperClass(LogMapper.class);
Job.setOutputKeyClass(Text.class);
Job.setOutputValue(Text.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));

boolean success = job.waitForCompletion(true );
return success ? 0 : 1;
}

}
public static class LogMapper extends Mapper<LongWritable,Text,Text,Text>{
@Override
protected void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException{
String str = value.toString().replace(",","\t");
context.write(new Text(str),new Text(""));
}
}

public static class LogReducer extends Reducer<Text,Text,Text,Text>{
@Override
protected void reduce(Text key,Iterable<Text> it,Context context) throws IOException,InterruptedException{
context.write(key,new Text(""));
}
}

hdfs://192.168.1.198:8020/root/source hdfs://192.168.1.198:8020/root/result
-------电信案例实践------------------------------------------------------------------------------------------------------------------
订单商品模块
--订单主要信息表ods_b2c_orders
drop table if exists itqsc.ods_b2c_orders;
create external table itqsc.ods_b2c_orders(
order_id bigint , --订单ID
order_no string , --订单号
order_date timestamp, --订单日期
user_id bigint , --用户ID
user_name string , --登录名
order_money double , --订单金额
order_type string , --订单类型
order_status string , --订单状态
pay_type string , --支付类型
pay_status string , --支付状态
order_source string , --订单来源
last_update_time timestamp, --订单的最后修改时间
dw_date timestamp
)
partitioned by
(dt string)
LOCATION 'hdfs://hadoop0:9000/user/hadoop/dev/itqsc/ods_b2c_orders';
------------------------------------------------------------------------------
--订单商品信息表ods_b2c_orders_goods
--订单的详细信息表ods_b2c_ordrs_desc
--订单与商品宽表dm_b2c_orders_goods
--订单宽表dm_b2c_orders
--购物车表ods_b2c_cart
--订单指标表dm_user_order_tag
--商品信息表ods_b2c_goods
--商品信息汇总表dm_user_goods_amt
--shell脚本调用--------------------------------
#!/bin/bash
#============================
#dm_b2c_orders.sh
#==============================
DT='date -d '-1 day' "+%Y-%m-%d"'
sysdate = 'date "+%Y-%m-%d"'
if [$1];then
DT=$1
fi

SQL="
insert overwrite table itqsc.dm_b2c_orders partition(dt='"${DT}"')
select a.order_id,
a.order_no,
a.order_date,
a.user_id,
a.user_name,
a.order_money,
a.order_type,
a.order_status,
a.pay_type,
a.pay_status,
a.order_source,
b.consignee,
b.area_id,
b.area_name,
b.address,
b.mobilephone,
b.telphone,
b.coupon_id,
b.coupon_money,
b.carriage_money,
b.create_time,
a.last_update_time,
'"${sysdate()}"' dw_date
from (select * from itqsc.ods_b2c_orders where dt = '"${DT}"') a
join (select * from itqsc.ods_b2c_orders_desc where dt = '"${DT}"') b
on (a.order_id = b.order_id);
"
echo "${SQL}"
hive -e "$SQL"
-------------------------------------------------------------------------------------------
集合类型
ARRAY：ARRAY类型是由一系列相同数据类型的元素组成，这些元素可以通过下标来访问，例array[1]
MAP:MAP包含key-》value键值对，可以通过key来访问元素，例map['key']
STRUCT:可以包含不同数据类型的元素，这些元素可以通过“点语法”的方式获得，例struct.key1

Hive基本使用-文件
文件格式
textfile
Sequencefile
Rcfile

扩展接口
默认的文件读取方式
自定义inputformat
自定义serde

load inpath '/home/data/data' overwrite into table testtable;