1.beeline连接
bin/hiveserver2
hive --service metastore
bin/beeline
!connect jdbc:hive2://zhangbk:10000 root password01
或
bin/beeline -u !connect jdbc:hive2://zhangbk:10000 root password01
HiveServer2 JDBC
将分析的结果存储在hive表(result),前端通过DAO代码,进行数据的查询。
2.数据压缩
1)安装sanppy
2)编译Hadoop
压缩: set mapreduce.map.output.compress=true
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
set mapreduce.output.fileoutputformat.compress=true
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
iput -> map -> shuffle -> reduce -> output
数据压缩
数据量小
本地磁盘,IO
减少 网络IO
压缩格式:bzip2, gzip, lzo, snappy等, 现在常用snappy
压缩比:bzip2>gzip>lzo
解压速度:lzo>gzip>bzip
通常情况下:block -> map
===================================================================================
file_format:
: SEQUENCEFILE 序列化文件
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE
| ORC
| PARQUET
| AVRO
| JSONFILE
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
数据存储
* 按行存储数据 TEXTFILE SEQUENCEFILE
* 按列存储数据 RCFILE ORC PARQUET
orc格式大大减少数据的存储,查询效率也会提高
总结:在实际的项目开发当中,hive表的数据
存储格式 orcfile, parquet
数据压缩 snappy
================================================================
Hive企业优化
hive-site.sh
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
<description>
Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have
any aggregations or distincts (which incurs RS), lateral views and joins.
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)
</description>
</property>
大表拆分子表
外部表,分区表
数据 存储格式(textfile,orcfile,parquet) 数据压缩
Hive Join 学习网站:http://shiyanjun.cn/archives/588.html
Common/Shuffle/Reduce Join
连接发生的阶段,发生在Reduce Task, 大表对大表
表的数据从文件中读取
Map Join
连接发生的阶段,发生在Map Task 小表对大表
大表的数据从文件中读取,小表的数据从内存中读取
SMB Join
Sort-Merge-Bucket Join
set hive.auto.convert.join=true; 自动识别join方式。
执行计划:
EXPLAIN select * from emp ;
EXPLAIN EXTENDED select * from emp ;
并行执行:
set hive.exec.parallel=true
set hive.exec.parallel.thread.number=16
jvm重用
set mapreduce.job.jvm.numtasks=3
Reduce数目:
set mapreduce.job.reduces=3
推测执行:
set hive.mapred.reduce.tasks.speculative.execution=false ;
set mapreduce.map.speculative=false ;
set mapreduce.reduce.speculative=false ;
动态分区调整:
表的类型,外部表,分区表
======================================================================================
正则表达式 https://www.regexpal.com/ 在线验证
CREATE TABLE serde_regex(
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;
----------------------------------------------
Hive中嵌入python MovieLens User Ratings
数据下载 wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
--建表
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
--加载数据
LOAD DATA LOCAL INPATH '/home/hadoop/datas/ml-100k/u.data' OVERWRITE INTO TABLE u_data;
--创建python脚本
weekday_mapper.py
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])
--建表
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
--添加脚本 (会话级别)
add FILE /home/hadoop/datas/py/weekday_mapper.py;
--利用python2脚本处理数据
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
python脚本问题:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing row
{"userid":47,"movieid":324,"rating":3,"unixtime":"879439078"}