hive-0.11 的坑

最新推荐文章于 2024-09-14 18:46:05 发布

iteye_19343

最新推荐文章于 2024-09-14 18:46:05 发布

阅读量105

点赞数

文章标签： java 大数据 jira

本文链接：https://blog.csdn.net/iteye_19343/article/details/82541650

版权

hive-0.11 的坑

1、修改表结果信息后，导入数据正常，但不能正常读取新增加的列。但新建表是没问题的。

1.1、确认数据在前一个表存在


hive> 
 select 
 os,
 sre,
 sco,
 lla
 from access_log a 
 where 
 dt='2013-09-24' 
 limit 100;

Total MapReduce jobs = 1
Launching Job 1 out of 1
Num of R tasks determined at compile time: 1
Starting Job = job_201309251224_0731
Job Stage-1: num of M:429; num of R:1
09-25 16:59:59 S-1 M=3%, R=0%, CPU 72.67s
09-25 17:00:19 S-1 M=79%, R=26%, CPU 1722.15s
09-25 17:00:39 S-1 M=91%, R=27%, CPU 1984.64s
09-25 17:00:59 S-1 M=100%, R=100%, CPU 2194.84s
09-25 17:01:20 S-1 M=100%, R=100%, CPU 2194.84s
MR Total CPU time: 36 minutes 34 seconds 840 msec
Ended Job = job_201309251224_0731
MapReduce Jobs Launched: 
Job 0: Map: 429  Reduce: 1   Cumulative CPU: 2194.84 sec   HDFS Read: 992762379 HDFS Write: 2801 SUCCESS
Total MapReduce CPU Time Spent: 36 minutes 34 seconds 840 msec
OK
expect download select size:2801
WindowsXp       1440.900        32      zh_cn
Windows7        1366.768        32      zh_cn
WindowsXp       1280.800        32      zh_cn
WindowsXp       1366.768        32      zh_cn
WindowsXp       1280.800        32      zh_cn
WindowsXp       1366.768        32      zh_cn
Windows7        1366.768        32      zh_cn
WindowsXp       1280.1024       32      zh_cn
WindowsXp       1440.900        32      zh_cn
WindowsXp       1440.900        32      zh_cn
Windows7        1920.1080       32      zh_cn
Windows7        1366.768        32      zh_cn
WindowsGetFailed        1366.768        32      zh_cn
WindowsXp       1920.1080       32      zh_cn
Windows7        1366.768        32      zh_cn
Windows7        1366.768        32      zh_cn
WindowsXp       1920.1080       32      zh_cn
WindowsXp       1280.720        32      zh_cn
WindowsXp       1680.1050       32      zh_cn
Windows7        2560.1440       32      zh_cn
WindowsXp       1440.900        32      zh_cn
WindowsXp       1920.1080       32      zh_cn
WindowsXp       1024.768        32      zh_cn
WindowsXp       1024.768        32      zh_cn
WindowsXp       1440.900        32      zh_cn
WindowsXp       1360.768        32      zh_cn
WindowsXp       1280.960        32      zh_cn
WindowsXp       1600.900        32      zh_cn
WindowsXp       1440.900        32      zh_cn
WindowsXp       1280.1024       32      zh_cn
Windows7        1360.768        32      zh_cn
WindowsXp       1440.900        32      zh_cn
WindowsXp       1024.768        32      zh_cn

[quote]
源表结构：
CREATE TABLE IF NOT EXISTS page_view (
access_time string COMMENT '访问时间',
ip string COMMENT 'IP地址',
country string COMMENT '国家',
province string COMMENT '省份',
city string COMMENT '城市',
isp string COMMENT '宽带接入商',
mid string COMMENT '机器ID',
path string COMMENT '访问路径',
bve string COMMENT '浏览器版本号'
) COMMENT '登录事件中间表'
PARTITIONED BY (dt string)
STORED AS TEXTFILE

修改表结构
ALTER TABLE page_view ADD COLUNMS (
os string COMMENT '操作系统',
sre string COMMENT '屏幕分辨率',
sco string COMMENT '屏幕色深',
lla string COMMENT '语言'
)
[/quote]

1.2、执行数据导入


INSERT OVERWRITE TABLE page_view PARTITION (dt='2013-09-24')
SELECT 
access_time,
ip string,
country ,
province ,
city ,
isp,
mid,
path,
bve,
os,
sre,
sco,
lla
FROM access_log a
WHERE
 dt='2013-09-24'

DISTRIBUTE BY stime
;

1.3、验证数据导入成功与否

hive> 
select 
os,
sre,
sco,
lla
from page_view a where 
 dt='2013-09-24'
 and  sre is not null
 limit 100;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Num of R tasks determined at compile time: 1
Starting Job = job_201309251224_0771
Job Stage-1: num of M:36; num of R:1
09-25 17:08:22 S-1 M=0%, R=0%
09-25 17:08:42 S-1 M=89%, R=17%, CPU 151.51s
09-25 17:09:02 S-1 M=100%, R=100%, CPU 168.92s
09-25 17:09:22 S-1 M=100%, R=100%, CPU 168.92s
MR Total CPU time: 2 minutes 48 seconds 920 msec
Ended Job = job_201309251224_0771
MapReduce Jobs Launched: 
Job 0: Map: 36  Reduce: 1   Cumulative CPU: 168.92 sec   HDFS Read: 965029852 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 minutes 48 seconds 920 msec
OK
expect download select size:0
Time taken: 105.48100000000001 seconds

无结果输出。

但使用相同metadata的hive-0.10是能正常获取到数据的。使用hadoop fs -cat /dir/page_view/ 在hdfs数据是正常的。

因此确定hive-0.11增加列修改的metadata不能正常被hive-0.11版本的客户端读取。

无奈之下只能通过修改表熟悉为外部表，drop table， add partitions 才得于正常修改完表结构

如下：


alter table page_view set TBLPROPERTIES ('EXTERNAL'='TRUE');
drop table page_view;
CREATE TABLE IF NOT EXISTS page_view (
  access_time string COMMENT '访问时间',
  ip string COMMENT 'IP地址',
  country string COMMENT '国家',
  province string COMMENT '省份',
  city string COMMENT '城市',
  isp string COMMENT '宽带接入商',
  mid string COMMENT '机器ID',
  path string COMMENT '访问路径',
  bve string COMMENT '浏览器版本号',
  os string COMMENT '操作系统',
  sre string COMMENT '屏幕分辨率',
  sco string COMMENT '屏幕色深',
  lla string COMMENT '语言'
) COMMENT '登录事件中间表'
PARTITIONED BY (dt string)
STORED AS TEXTFILE

alter table page_view add partition(dt='2013-06-23');

各位请谨慎升级。

2、还有hive-0.11，窗口函数经常给你报Reset on PersistentByteBasedList not supported


Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Reset on PersistentByteBasedList not supported
	at org.apache.hadoop.hive.ql.exec.PTFPersistence$PersistentByteBasedList.reset(PTFPersistence.java:895)
	at org.apache.hadoop.hive.ql.exec.PTFPersistence$PartitionedByteBasedList.reset(PTFPersistence.java:456)
	at org.apache.hadoop.hive.ql.exec.PTFPartition.reset(PTFPartition.java:67)
	at org.apache.hadoop.hive.ql.exec.PTFOperator.processOp(PTFOperator.java:132)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:848)
	at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
	at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:249)
	... 7 more

set mapred.reduce.tasks=16;
set hive.exec.parallel=true;
set mapred.child.java.opts=-Xmx3600m -XX:-UseGCOverheadLimit;
set hive.ptf.partition.persistence.memsize=671088640;
可通过如上设置，增加子进程的内存大小，和partitions的大小。
并且在生成的数据中，使用DISTRIBUTE BY rand() .
可以解决数据倾斜与不支持过大列表问题。
不过最好还是升级到hive-0.12以上。免得后面有掉坑里面。

3、Multi-groupby optimization fails when same distinct column is used twice or more
[url]https://issues.apache.org/jira/browse/HIVE-3852[/url]