1) 使用like 关键字
在使用hive DDL语句时,遇上了一些问题:
create table if not exists temp.temp1 like vv_20111023;
表创建成功,describe temp1也输出正常;
然后执行数据插入操作:
insert OVERWRITE table temp1 select a.* from vv.vv_20111023 a limit 10;
执行正常,日志如下:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201110171502_22133, Tracking URL = http://haa22:50030/jobdetails.jsp?jobid=job_201110171502_22133
Kill Command = /opt/modules/hadoop/hadoop-0.20.203.0/bin/../bin/hadoop job -Dmapred.job.tracker=haa22:9001 -kill job_201110171502_22133
2011-11-01 10:32:09,433 Stage-1 map = 0%, reduce = 0%
2011-11-01 10:32:15,464 Stage-1 map = 8%, reduce = 0%
2011-11-01 10:32:16,470 Stage-1 map = 10%, reduce = 0%
2011-11-01 10:32:18,482 Stage-1 map = 25%, reduce = 0%
2011-11-01 10:32:19,488 Stage-1 map = 37%, reduce = 0%
2011-11-01 10:32:20,496 Stage-1 map = 47%, reduce = 0%
2011-11-01 10:32:21,512 Stage-1 map = 53%, reduce = 0%
2011-11-01 10:32:22,519 Stage-1 map = 54%, reduce = 0%
2011-11-01 10:32:23,562 Stage-1 map = 59%, reduce = 0%
2011-11-01 10:32:24,679 Stage-1 map = 64%, reduce = 0%
2011-11-01 10:32:25,685 Stage-1 map = 76%, reduce = 0%
2011-11-01 10:32:26,691 Stage-1 map = 80%, reduce = 0%
2011-11-01 10:32:27,697 Stage-1 map = 90%, reduce = 0%
2011-11-01 10:32:29,709 Stage-1 map = 98%, reduce = 0%
2011-11-01 10:32:30,715 Stage-1 map = 100%, reduce = 0%
2011-11-01 10:32:33,732 Stage-1 map = 100%, reduce = 25%
2011-11-01 10:32:39,764 Stage-1 map = 100%, reduce = 30%
2011-11-01 10:32:42,780 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201110171502_22133
Loading data to table temp.temp1
Deleted hdfs://haa22:9000/user/hive/warehouse/temp.db/temp1
Table temp.temp1 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 2096]
10 Rows loaded to temp1
OK
Time taken: 47.768 seconds
查看输出结果:
select * from temp1;
输出:OK
Time taken: 0.096 seconds
奇怪的是插入操作显示正常,但是在执行查询时却没有输出结果;
查看该表的数据文件:
dfs -cat /user/hive/warehouse/temp.db/temp1/000000_0;
输出:
20111023 18 219.139.149.24 2011-10-23 18:00:01 GET /vvlog.html {4A85BC70-9529-45B3-9B92-6F9DF2DB956E} newbox 9709 13 123132 2 -1 0.0 0.0 newbox1/movie/detail.html 3.11.08.04 \N 中国 湖北省 宜昌市 电信
20111023 18 58.47.57.158 2011-10-23 18:00:01 GET /vvlog.html {6C05A61C-84AC-4F23-9211-08A637187F3A} tre17640 13 207087 1 0 0.0 0.0 tree 3.11.10.12 \N 中国 湖南省 常德市 电信
20111023 18 113.111.164.148 2011-10-23 18:00:01 GET /vvlog.html {356C047F-0573-2ABC-8203-53B1913EF441} bov20555 13 237069 1 0 0.0 0.0 bov 5.03.1001.1111 \N 中国 广东省 广州市 电信
20111023 18 60.189.194.51 2011-10-23 18:00:01 GET /vvlog.html {BBEF15BB-8FC9-4CB4-8413-9A4AB12D0F1B} newbox 18248 13 212575 2 0 2490.665 2489.9822 newbox1/index/index_3.html 3.11.06.29 \N中国 浙江省 台州市 电信
20111023 18 111.79.192.55 2011-10-23 18:00:01 GET /vvlog.html {D2867EF8-B912-49E4-B557-9290567765A1} tre12052 13 152434 1 0 0.0 0.0 tree 3.11.09.26 \N 中国 江西省 宜春市 电信
20111023 18 218.93.113.61 2011-10-23 18:00:01 GET /vvlog.html {DC353BFB-3A17-4599-9E28-A080F7D0FD33} newbox 19367 13 228561 2 0 2401.314 2401.191 newbox1/movie/detail.html 3.10.12.01 \N中国 江苏省 常州市 电信
20111023 18 113.16.173.182 2011-10-23 18:00:01 GET /vvlog.html {4EEAC4F0-27AC-414F-BBAA-04220074F9F0} newbox 15776 13 185594 1 0 0.0 0.0 newbox1/search/result.html 3.11.08.04 \N 中国 广西南宁市 电信
20111023 18 219.146.114.68 2011-10-23 18:00:01 GET /vvlog.html {210D0C4B-1BD8-4C9B-9B0D-064217EEF9B9} newbox 20399 13 235465 2 0 60.371 21.960085 newbox1/movie/detail.html 3.11.10.12 \N 中国山东省 潍坊市 电信
20111023 18 180.113.39.152 2011-10-23 18:00:01 GET /vvlog.html {2A887575-077D-456B-941E-46E8AE962869} newbox 18767 13 237061 1 0 0.0 0.0 newbox1/movie/detail.html 3.11.08.04 \N 中国 江苏省 无锡市 电信
20111023 18 113.106.102.77 2011-10-23 18:00:01 GET /vvlog.html {F91F94C2-20AD-0F58-DB1C-7A8AE3D775D4} newbox 1007 24 1176 1 0 0.0 0.0 newbox1/movie/film/p2p/list-areaid-2-sid-1-p-3.shtml 5.01.0820.2122 \N 中国 广东省 珠海市 电信
很明显,数据已经成功的提取了出来,但是查询却无法输出结果,很有可能时定义表的时候产生的问题。
重新创建一张新表temp2(不使用like语法)
执行数据插入操作:
insert OVERWRITE table temp2 select a.* from vv.vv_20111023 a limit 10;
查询:
select * from temp2 limit 3;
输出结果:
20111023 18 219.139.149.24 2011-10-23 18:00:01 GET /vvlog.html {4A85BC70-9529-45B3-9B92-6F9DF2DB956E} newbox 9709 13 123132 2 -1 0.0 0.0 newbox1/movie/detail.html 3.11.08.04 \N 中国 湖北省 宜昌市 电信
20111023 18 58.47.57.158 2011-10-23 18:00:01 GET /vvlog.html {6C05A61C-84AC-4F23-9211-08A637187F3A} tre17640 13 207087 1 0 0.0 0.0 tree 3.11.10.12 \N 中国 湖南省 常德市 电信
20111023 18 113.111.164.148 2011-10-23 18:00:01 GET /vvlog.html {356C047F-0573-2ABC-8203-53B1913EF441} bov20555 13 237069 1 0 0.0 0.0 bov 5.03.1001.1111 \N 中国 广东省 广州市 电信
2)自定义列分隔符
hive目前只支持单字符分割,如果想使用多字符分割,可以采用如下方式:
create table my_table(
col1 string,
col2 string,
col3 string,
col4 string,
col5 string)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ( "input.regex" = "正则表达式" ) STORED AS TEXTFILE;
注意:表的字段只能是string类型