关闭

01-Hive创建表

标签: hive
5478人阅读 评论(1) 收藏 举报
分类:

声明:本实验环境是Apache hadoop-2.2.0,zookeeper-3.4.5,mysql Server version: 5.1.73作为元数据库,hive版本是apache-hive-0.9.0-bin,都是apache,不是CDH和其他。本实验集群3台,一个主节点(hadoop1),三个从节点(hadoop1,hadoop2,hadoop3),zk在三个节点中都安装。本实验的数据都是自己设计和添加的。

  1. 启动zk:zkServer.sh start
  2. 启动hdfs:start-dfs.sh
  3. 启动yarn:start-yarn.sh
  4. 启动mysql:service mysqld start
  5. jps
[root@hadoop1 host]# jps
4388 RunJar
3435 NameNode
3693 SecondaryNameNode
3550 DataNode
4576 Jps
3843 ResourceManager
2076 QuorumPeerMain
3943 NodeManager

6.进入hive的命令行模式:

[root@hadoop1 host]# hive
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in jar:file:/usr/host/hive/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201606011756_819925906.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/host/hadoop/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/host/hive/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive> 

可能是hive版本较老了,出现以上warning警告忽略就可以了,不是error。
7. 进入default数据库

hive> show databases;
OK
default
Time taken: 4.34 seconds
hive> use default;
OK
Time taken: 0.057 seconds
hive> show tables;
OK
testtable
Time taken: 0.32 seconds
----------
目前有一个testable表,我先删除,然后再看看是怎么创建的。

8.删除testable表

hive> drop table testable;
OK
Time taken: 4.373 seconds

9.创建表testable,注意一定要在语句后面加分号“;”,否则会出错。

hive> create table testtable(name string comment 'name value',addr string comment 'addr value');
OK
Time taken: 0.262 seconds

10.看看表的描述信息

hive> desc testtable;
OK
name    string  name value
addr    string  addr value
Time taken: 0.215 seconds
#这里面有两个字段name和addr,都是string类型的,描述信息comment分别是“name value”和“addr value”。
#也可以看拓展的描述信息
hive> desc formatted testtable;
OK
# col_name              data_type               comment             

name                    string                  name value          
addr                    string                  addr value          

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 18:20:24 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/user/hive/warehouse/testtable    
Table Type:             MANAGED_TABLE            
Table Parameters:        
    transient_lastDdlTime   1464830424          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    serialization.format    1                   
Time taken: 0.249 seconds

注意以上的两个地方:Location: hdfs://hadoop1:9000/user/hive/warehouse/testtable
Table Type: MANAGED_TABLE
Location表示数据的存储位置,Table Type表示这是一个内部表。除了内部表之外还会有外部表。
11.我们来建立一个外部表看看长什么样。

hive> create external table testtext(name string comment 'name value',addr string comment 'addr value');
OK
Time taken: 0.08 seconds
hive> 

12 输入desc formatted testtext;

hive> desc formatted testtext;
OK
# col_name              data_type               comment             

name                    string                  name value          
addr                    string                  addr value          

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 18:34:19 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/user/hive/warehouse/testtext     
Table Type:             EXTERNAL_TABLE           
Table Parameters:        
    EXTERNAL                TRUE                
    transient_lastDdlTime   1464831259          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    serialization.format    1                   
Time taken: 0.2 seconds
hive> 

现在加了关键字external后,可以看到Table Type变为了EXTERNAL_TABLE,有没有觉得很神奇?hah
13.接下来我们先把testtext表删除,再重新建立,但是【子段】之间的空格符和【行】之间的空格符

hive> drop table testtext;
OK
Time taken: 0.19 seconds
hive> create table if not exists testtext(
    > name string comment 'name value',
    > addr string comment 'addr value')
    > row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
OK
Time taken: 0.059 seconds

比之前的多了row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile; 这句话的意思是说字段之间以制表符(Tab)隔开,行之间以换行符隔开,format delimited意思是格式界定,格式包括字段之间的格式和行之间的格式,terminated 是结束的意思。
我们再来看一下具体的描述信息:

hive> desc formatted testtext;
OK
# col_name              data_type               comment             

name                    string                  name value          
addr                    string                  addr value          

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 18:41:03 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/user/hive/warehouse/testtext     
Table Type:             MANAGED_TABLE            
Table Parameters:        
    transient_lastDdlTime   1464831663          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    field.delim             \t                  
    line.delim              \n                  
    serialization.format    \t                  
Time taken: 0.314 seconds
hive> 

信息上说,这是一个内部表,因为没有加关键字external嘛。另外看最后面的Storage Desc Params: field.delim 是\t,line.delim 是\n ;serialization.format 是\t,简单吧~

14.现在我们有两个表了testtable和testtext吧,我们插入点数据来玩玩吧。我是在本地的/usr/host 目录下建立data文件,里面的数据如下。当心:

[root@hadoop1 host]# vi data
test    table
test    table1

15 把linux上的/usr/host/data文件加载到testtable上。

hive> load data local inpath '/usr/host/data' overwrite into table testtable;
Copying data from file:/usr/host/data
Copying file: file:/usr/host/data
Loading data to table default.testtable
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/testtable
OK
Time taken: 0.981 seconds
hive> select * from testtable;
OK
test    table   NULL
test    table1  NULL
Time taken: 0.328 seconds
hive> 

哎?我select * from testtable,为啥后面多了NULL?出现这样的情况很正常,当初建立testtable的时候我就没有指定字段之间用什么分割的对吧,hive就把“test tabel”当做一个字段了,那后面还有一个字段怎么办?用NULL来表示咯,这个NULL地球人都知道的吧?
那再来插入数据一次,看看什么情况:

hive> load data local inpath '/usr/host/data'  into table testtable;
Copying data from file:/usr/host/data
Copying file: file:/usr/host/data
Loading data to table default.testtable
OK
Time taken: 0.215 seconds
hive> select * from testtable;
OK
test    table   NULL
test    table1  NULL
test    table   NULL
test    table1  NULL
Time taken: 0.159 seconds
hive> 

嘿嘿,又追加了数据了吧,注意哦,overwrite是覆盖,把overwrite换成into是追加。
16 好了,砸门再来建一个表xielaoshi

hive> drop table if exists xielaoshi;
OK
Time taken: 0.042 seconds
hive> create external table if not exists xielaoshi(
    > name string,
    > salary float,
    > meinv array<string>,
    > haoche map<string,float>,
    > haoza struct<street:string,city:string,state:string, zip:int>)
    > row format delimited fields terminated by '\t'
    > collection items terminated by ','
    > map keys terminated by ':'
    > lines terminated by '\n'
    > stored as textfile
    > location '/data/';
OK
Time taken: 0.137 seconds
hive> 

这里的字段有五个,name,salary,meinv,haoche,haoza.其中meinv所对应的数据类型是数组型的array,haoche对应的是map键值对型的,haoza对应的是struct结构体。
问题又来啦!stored as textfile是什么鬼?其实就是文件格式啦!文件格式在hive中主要是三种:textfile、Sequencefile(序列化文件,学hadoop的都会知道啦)、Rcfile。
17 来来来,看看描述信息

hive> desc formatted xielaoshi;
OK
# col_name              data_type               comment             

name                    string                  None                
salary                  float                   None                
meinv                   array<string>           None                
haoche                  map<string,float>       None                
haoza                   struct<street:string,city:string,state:string,zip:int>  None                

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 19:15:27 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/data     
Table Type:             EXTERNAL_TABLE           
Table Parameters:        
    EXTERNAL                TRUE                
    transient_lastDdlTime   1464833727          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    colelction.delim        ,                   
    field.delim             \t                  
    line.delim              \n                  
    mapkey.delim            :                   
    serialization.format    \t                  
Time taken: 0.163 seconds
hive> 

看到那个Location木有啊?这是什么情况呢?其实很简单得,就是指定了数据位置在哪里,为毛要指定?其实这是外部表的一个特点,砸门慢慢观察。

18 看看表谢老师里面的数据吧!

hive> select * from xielaoshi;
OK
wang    123.0   ["a1","a2","a3"]    {"k1":1.0,"k2":2.0,"k3":3.0}    {"street":"s1","city":"s2","state":"s3","zip":4}
liu 456.0   ["a4","a5","a6"]    {"k4":4.0,"k5":5.0,"k6":6.0}    {"street":"s4","city":"s5","state":"s6","zip":6}
zhang   789.0   ["a7","a8","a9"]    {"k7":7.0,"k8":8.0,"k9":9.0}    {"street":"s7","city":"s8","state":"s9","zip":9}
Time taken: 0.183 seconds
hive> 

这里写图片描述

如果你是初学者你又会疑惑这数据怎么搞的呀?
其实就两步:

[root@hadoop1 host]# vi xielaoshi
wang    123     a1,a2,a3        k1:1,k2:2,k3:3  s1,s2,s3,4
liu     456     a4,a5,a6        k4:4,k5:5,k6:6  s4,s5,s6,6
zhang   789     a7,a8,a9        k7:7,k8:8,k9:9  s7,s8,s9,9

[root@hadoop1 host]# hadoop fs -put xielaoshi /data/
[root@hadoop1 host]# hadoop fs -text /data/xielaoshi
wang    123 a1,a2,a3    k1:1,k2:2,k3:3  s1,s2,s3,4
liu 456 a4,a5,a6    k4:4,k5:5,k6:6  s4,s5,s6,6
zhang   789 a7,a8,a9    k7:7,k8:8,k9:9  s7,s8,s9,9`

搞定了吧,所以接下来实验的数据其实都是这么弄来的,现在linux上编辑,需要上传到hdfs上就执行hadoop fs -put [local 文件] [目标目录]

19 hive的功能是啥?除了存储就查找嘛,当然我们要来玩玩怎么查找啊!
上代码!!!

hive> select meinv[1] from xielaoshi;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
16/06/01 19:34:48 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16/06/01 19:34:48 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
16/06/01 19:34:48 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/root/root_20160601193434_f64326c5-c901-47b8-b0b6-163aa3ccd4ce.log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/host/hadoop/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/host/hive/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 19:35:10,267 null map = 0%,  reduce = 0%
2016-06-01 19:35:24,814 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
2016-06-01 19:35:25,943 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
2016-06-01 19:35:27,205 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
MapReduce Total cumulative CPU time: 800 msec
Ended Job = job_1464828076391_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
a2
a5
a8
Time taken: 45.655 seconds

哎哟!这屎长屎长的代码,走的是啥呀?MapReduce 啊!其实很多是警告信息,怪我咯!Total MapReduce jobs = 1,Hadoop job information for null: number of mappers: 1; number of reducers: 0。看到没,reducers是0个,这没啥奇怪的!

继续:

hive>select haoche["k2"] from xielaoshi;
Total MapReduce jobs = 1
Launching Job 1 out of 1
警告信息省略
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 19:38:21,377 null map = 0%,  reduce = 0%
2016-06-01 19:38:32,483 null map = 100%,  reduce = 0%, Cumulative CPU 0.71 sec
2016-06-01 19:38:33,560 null map = 100%,  reduce = 0%, Cumulative CPU 0.71 sec
MapReduce Total cumulative CPU time: 710 msec
Ended Job = job_1464828076391_0002
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
2.0
NULL
NULL
Time taken: 31.13 seconds

为啥后面出来两个NULL?你猜~
继续:

hive> select haoza.city from xielaoshi;
Total MapReduce jobs = 1
Launching Job 1 out of 1
警告信息省略
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 19:41:14,958 null map = 0%,  reduce = 0%
2016-06-01 19:41:25,138 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
2016-06-01 19:41:26,255 null map = 100%,  reduce = 0%, Cumulative CPU 0.8 sec
MapReduce Total cumulative CPU time: 800 msec
Ended Job = job_1464828076391_0003
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
s2
s5
s8
Time taken: 28.25 seconds
hive> 

有点累了,休息一下。如果你看到此文,想进一步学习或者和我沟通,加我微信公众号:名字:五十年后
这里写图片描述
蟹蟹你啊!

2
0

查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:37263次
    • 积分:577
    • 等级:
    • 排名:千里之外
    • 原创:18篇
    • 转载:0篇
    • 译文:0篇
    • 评论:6条
    联系方式
    weichat:xiehuadong1 E_mail:xiehuadong1@qq.com qq:1025699566 微信公众号:五十年后
    文章分类
    最新评论