管理表
管理表又称托管表、内部表,默认创建的表都是管理表
案例:
hive> create table if not exists stocks (exchange1 string,symbol string,ymd string) row format delimited fields terminated by ‘,’ ;
hive>load data inpath ‘/user/hadoop/stocks/stocks.csv’ into table stocks;【加载数据】
hive>drop table stocks;
特点:
1)在同一个文件系统(HDFS)中,加载数据的实质是将数据移动至Hive的数据仓库下
2)管理表在做删除操作时,会将数据一并删除,原有数据不存在
外部表
案例:
hive> create external table if not exists stocks (exchange1 string,symbol string) row format delimited fields terminated by ‘,’ ;
hive> load data inpath ‘/user/hadoop/stocks/stocks.csv’ into table stocks;【加载数据】
hive> drop table stocks;
特点:
1)同一个文件系统(HDFS)中,加载数据的实质是将数据移动至Hive的数据仓库下
2)外部表在做删除操作时,数据依然在Hive的数据仓库下。真正删除的是Mysql中的元数据(模型)
分区表
Hive把表组织成分区,是一种根据分区列(partition column)的值对表进行粗略划分的机制
使用分区来加快数据分片(slice)的查询速度,我们使用关键字 partitioned by 来定义分区列
-
静态分区
案例说明:
1)创建表 travel,通过 时间dt,城市city分区,使用 ‘,’ 字段分隔符hive (default)> create table travel (id bigint,intro String ) partitioned by (dt String,city String) row format delimited fields terminated by ‘,’;
2)创建三个文件 travel_beijing,travel_changchun,travel_shanghai
travel_beijing: travel_changchun: travel_shanghai: 1,changcheng 1day 1,diaosu 2days 1,waitan 1day 2,gugong 2days 2,changying 3days 2,mingzhuta 2 days 3,tiantan 1day 3,jingyuetan 1day 3,dishini 3days 4,niaochao 1day 4,weimanhuanggong 2days 4,gouwu 4days 5,shuilifang 2day 6,wangfujing 3 days
3)从本地加载数据到 travel 表
load data local inpath '/home/dragon/travel_beijing' into table travel partition (dt='2018-08-09',city='beijing'); load data local inpath '/home/dragon/travel_changchun' into table travel partition (dt='2018-08-10',city='changchun'); load data local inpath '/home/dragon/travel_shanghai' into table travel partition (dt='2018-08-10',city='shanghai');
4)查看数据组织(可以看到是多层,一层一层去划分的)
hive (default)>dfs -lsr /user/hive/warehouse/
/user/hive/warehouse/travel
/user/hive/warehouse/travel/dt=2018-08-09
/user/hive/warehouse/travel/dt=2018-08-09/city=beijing
/user/hive/warehouse/travel/dt=2018-08-09/city=beijing/travel_beijing
/user/hive/warehouse/travel/dt=2018-08-10
/user/hive/warehouse/travel/dt=2018-08-10/city=changchun
/user/hive/warehouse/travel/dt=2018-08-10/city=changchun/travel_changchun
/user/hive/warehouse/travel/dt=2018-08-10/city=shanghai
/user/hive/warehouse/travel/dt=2018-08-10/city=shanghai/travel_shanghai5)查询数据
hive (default)> select * from travel;
可以看到多了两个表属性travel.dt travel.city,但并不存在于真正的数据中,我们称之为分区列travel.id travel.intro travel.dt travel.city 1 changcheng 1day 2018-08-09 beijing 2 gugong 2days 2018-08-09 beijing 3 tiantan 1day 2018-08-09 beijing 4 niaochao 1day 2018-08-09 beijing 5 shuilifang 2day 2018-08-09 beijing 6 wangfujing 3 days 2018-08-09 beijing 1 diaosu 2days 2018-08-10 changchun 2 changying 3days 2018-08-10 changchun 3 jingyuetan 1day 2018-08-10 changchun 4 weimanhuanggong 2days 2018-08-10 changchun 1 waitan 1day 2018-08-10 shanghai 2 mingzhuta 2 days 2018-08-10 shanghai 3 dishini 3days 2018-08-10 shanghai 4 gouwu 4days 2018-08-10 shanghai
hive (default)>
select * from travel where city='shanghai' and id=2;
travel.id travel.intro travel.dt travel.city 2 mingzhuta 2 days 2018-08-10 shanghai
6)查看分区
hive (default)>show partitions travel;
partition dt=2018-08-09/city=beijing dt=2018-08-10/city=changchun dt=2018-08-10/city=shanghai
7)静态分区的缺点:
针对分区列需要手动设置,如果分区数据比较多的话,将会比较麻烦 -
动态分区
动态分区属性:
先创建无分区的管理表,再根据所需要的属性进行动态分区(创建分区表、insert数据)一个股票案例:
1)设置一些动态分区的属性:
set hiveconf:hive.exec.dynamic.partition=true; set hiveconf:hive.exec.dynamic.partition.mode=nonstrict; set hiveconf:hive.exec.max.dynamic.partitions=100000; set hiveconf:hive.exec.max.dynamic.partitions.pernode=100000; //或者将这些语句直接放入 ~/.hiverc文件中,重启客户端会先加载这个文件
2)创建一个股票表stocks
hive (hive2)> create table stocks (exchange1 string,ymd string,price_open float) row format delimited fields terminated by ',';
3)向stocks表加载一些数据
$hadoop fs -put /mnt/hgfs/share/dataTest/stocks_17.csv /user/hive/warehouse/hive2.db/stocks/
4)创建一个股票分区表stocks_partitions(属性类型都跟stocks相同)
hive (hive2)> create table stocks_partitions (exchange1 string,ymd string,price_open float) partitioned by (year string,month string) row format delimited fields terminated by ','; //通过 year、month分区
5)从stocks表中筛选,再向stocks_partitions表中写入数据
insert overwrite table stocks_partitions partition(year,month) select exchange1,symbol,ymd,price_open, substr(ymd,0,4) as year,substr(ymd,6,2) as month from stocks; //substr()截取字符串,截取ymd属性的前四位为year,从第6位开始截2位作为month
6)等待工作提交完成,查看分区情况
hive(hive2)>dfs -lsr /user/hive/warehouse/hive2.db/
7)如果出现物理内存上限,调大你的虚拟机内存,调整Yarn内存相关参数(调大)
【具体修改哪个参数要看你跑的任务,也可以都修改】<property> <name>mapreduce.map.memory.mb</name> <value>1024</value> <description>为每个map任务从调度程序请求的内存量</description> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>1024</value> <description>每个reduce任务要从调度程序请求的内存的大小</description> </property>
分桶
含义:实质是将数据分成不同的文件。hive中的分桶和hadoop中的reduce个数相同
-
首先应该设置采用分桶(默认是false):
hive>set hive.enforce.bucketing=true;
-
创建分桶表
hive> create table buckets_logs(id int,line string) clustered by (id) into 3 buckets;
clustered by (属性名) into 数量 buckets
【按照 [属性名] 分多个桶】 -
写入数据
hive> insert overwrite table buckets_logs select id,line from logs;
logs表与分桶表buckets_logs属性相同且有数据
-
查看分桶情况
hive>dfs -lsr
/user/hive/warehouse/hive.db;【数据库路径】/user/hive/warehouse/hive.db/buckets_logs /user/hive/warehouse/hive.db/buckets_logs/000000_0 /user/hive/warehouse/hive.db/buckets_logs/000001_0 /user/hive/warehouse/hive.db/buckets_logs/000002_0