Hive HQL 操作总结

hive 的log path(linux) 在 /tmp/username/

调用hive:

hive -f script.sql
hive -e 'select * from table'
hive --config /user/another/config/directory
hive -hiveconf fs.defaultFS=hdfs://localhost

 

HQL 和 Mysql 中用的 SQL 非常相似:

show tables;
dfs -ls /;
SET hive.some.parameter;   	# view the value of parameter
SET hive.some.parameter=1; 	# set the value
SET hive.block.size;        # hdfs block size

describe tablename; 		# show the information of a table
describe function length; 

create table tablename(ID int);
load data inpath '/user/input/data.txt' into table tablename;
load data inpath '/user/input/data.txt' overwrite into table tablename;

create external table tablename(ID int); 
load data local inpath '/user/input/sample.txt' into table tablename; # when specify "local", load data from the local directory not hdfs

crete table tablename (ID int, field1 int, field2 String)
ROW FORMAT DELIMITED      # means one row in file is equivalent to one row in hive table
fields terminated by ','; # column delimiter

# Will create a sperate new table, CREATE-TABLE-AS-SELECT cannot create external table
create table tablename as select * from oldtable; 

select map('a',1,'b',2)
select array(1,2)

show create table tablename;

hive的外部表在drop时不会删除本地文件! 只是把元数据中的映射关系删除.

 


Partition & Bucket:

# make one column as partition column
create table logs(ts bigint, line string) partition by (dt string);

# use hash to divide data into different groups
create table logs(ts bigint, line string) clustered by ts into 4 buckets;

create table logs(ts bigint, line string) clustered by (ts) sorted by (ts asc) into 5 buckets;

# use both
create table logs(ts bigint, line string) 
partition by (dt string)
clustered by ts into 4 buckets;

select * from bucketed_table tablesample(bucket 1 out of 4 on id);

如果想在JOIN的过程中利用partition的话, 必须在 WHERE子句或ON子句中指明partition列的取值.

 

以下引用自https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-PartitionBasedQueries

Partition Based Queries

In general, a SELECT query scans the entire table (other than for sampling). If a table created using the PARTITIONED BY clause, a query can do partition pruning and scan only a fraction of the table relevant to the partitions specified by the query. Hive currently does partition pruning if the partition predicates are specified in the WHERE clause or the ON clause in a JOIN. For example, if table page_views is partitioned on column date, the following query retrieves rows for just days between 2008-03-01 and 2008-03-31.

SELECT page_views.*

FROM page_views

WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'

If a table page_views is joined with another table dim_users, you can specify a range of partitions in the ON clause as follows:

SELECT page_views.*

FROM page_views JOIN dim_users

  ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')

 

 

 

使用正则表达式来解析文件并生成表:

create table users(user STRING,address STRING,phone STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' with SERDEPROPERTIES(
    "input.regex" = "(\\d{6} (\\d{5} (.{29}))) .*"
);

插入数据(insert into 表示追加数据, overwite表示清空后插入, 不要忘了"table"):

insert into table destination_table select col1, col2 from source;

insert overwrite table destination_table select col1, col2 from source;

insert overwrite table destination_table partition (part='1') select col1, col2 from source;

insert overwrite table destination_table partition (dt) select col1, col2, dt from source; #partition根据源表某列动态生成

insert into table destination_table partition (year,month) (id,num2,num3) select id, year(dt6) as year ,month(dt6) as month  from employee  limit 1000;

insert into table destination_table partition (year,month) select id, year(dt6) as year ,month(dt6) as month  from employee  limit 1000;

insert overwrite table destination_table partition (year=2003,month) select * ,month(dt6) as month  from employee  limit 1000;

这里的insert语句其实涉及到了Hive的动态分区和静态分区的问题了. 动态分区是运行时根据insert语句的附带的select子句来动态地创建分区, 而静态分区与之相反, 比如上面例句的最后一条year=2003表示了year这个分区是静态的, 而后面的month则是动态地根据month(dt6)来生成.

Hive静态分区不能是动态分区的子分区, 比如上面的year作为动态的,而month作静态分区是不允许的.

默认情况下动态分区功能关闭, 打开该功能语句:

set hive.exec.dynamic.partition=true

默认情况下为了防止用户不小心创建了大量分区, 所以Hive默认必须有一个静态分区(strict模式),如果你的要求是所有分区均为动态的,则运行:

set hive.exec.dynamic.partition.mode=nonstrict;

 

HQL 高级查询:

#use distribute that all the rows for a given year end up in the same reducer
from table
select year, temperature
distribute by year
sort by year asc, temperature desc;

#add a file to distributed cache, then use the file as a mapper.
ADD FILE /user/script/mapper.py
from table select transform(year, temperature, quality)
using 'mapper.py' as year, temperature

FROM (
	FROM table
	MAP year, temperature, quality
	USING 'mapper.py'
	AS year, temperature) map_output
REDUCE year, temperature
USING 'reducer.py'
AS year, temperature;

UDF(User Defined Function):

create function strip as 'com.company.hive.Strip'
using jar '/path/to/some.jar' #local address or dfs address

 

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

爱知菜

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值