hive的初步认识与基本语法一

最新推荐文章于 2020-07-18 14:51:12 发布

jhchengxuyuan

最新推荐文章于 2020-07-18 14:51:12 发布

阅读量256

点赞数

分类专栏： hadoop 大数据 hive 文章标签：大数据 hive

本文链接：https://blog.csdn.net/jhchengxuyuan/article/details/100717485

版权

hadoop 同时被 3 个专栏收录

10 篇文章 1 订阅

订阅专栏

大数据

10 篇文章 2 订阅

订阅专栏

hive

7 篇文章 1 订阅

订阅专栏

hive

hive的背景：

fackbook为解决海量数据分析，避免使用传统mr而开发出来类sql的操作大数据工具。

hive定义

hive是一个数据仓库软件，它能够使用类sql进行读、写、管理基于集群上的海量数据。hive可以对已经存在的数据进行结构，同时hive也提供命令行和jdbc让用进行连接hive。

hive和hadoop的关系：

hive基于hadoop，hive本身没有数据存储能力、也没有数据处理能力，仅是将sql转换成执行计划。

hive的特点：

可扩展 
	Hive可以自由的扩展集群的规模，一般情况下不需要重启服务。
延展性 
	Hive支持用户自定义函数，用户可以根据自己的需求来实现自己的函数。
容错 
	良好的容错性，节点出现问题SQL仍可完成执行。

官网的介绍是酱紫的：

1、Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
2、A mechanism to impose structure on a variety of data formats
3、Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™ 
4、Query execution via Apache Tez™, Apache Spark™, or MapReduce
5、Procedural language with HPL-SQL
6、Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.

hive的架构：

Hive架构简述

[外链图片转存失败(img-6ByVqlpK-1568130933300)(F:/chromdown/day03/%E6%96%B0%E8%B5%84%E6%96%99/Hadoop/hive/hive%E6%9E%B6%E6%9E%84%E5%9B%BE.jpg)]

用户连接客户端：cli、jdbc/odbc 、web gui
thriftserver:第三方服务
metastore：存储hive的元数据（库名、表名、字段名、字段类型、分区、分桶、创建时间、创建人等）。
解析器：将hql抽象成表达式树
编译器：对hql语句进行词法、语法、语义的编译(需要跟元数据关联)，编译完成后会生成一个有向无环的执行计划。hive上就是编译成mapreduce的job。
优化器：将执行计划进行优化，减少不必要的列、使用分区、使用索引等。优化job。
执行器：将优化后的执行计划提交给hadoop的yarn上执行。提交job。

hive的元数据存储：

derby：hive内置数据库，存储量小，只支持单session

mysql(支持jdbc的关系型数据库)：量大，支持多session

hive的安装：

1、自带derby
[root@hadoop01 home]# tar -zxvf /home/apache-hive-1.2.1-bin.tar.gz -C /usr/local/
[root@hadoop01 home]# mv /usr/local/apache-hive-1.2.1-bin/ /usr/local/hive-1.2.1
配置环境变量
vi /etc/profile

export HIVE_HOME=/usr/local/hive-1.2.1/

export PATH=$PATH:$JAVA_HOME/bin:$MONGODB_HOME/bin:$ZK_HOME/bin:$HIVE_HOME/bin:$HBASE_HOME/bin:$FLUME_HOME/bin:$SQOOP_HOME/bin:

source /etc/profile
which hive


元数据存储到derby即可，直接hive即可。

2、使用mysq做原数据库

a、配置：
vi ./conf/hive-site.xml
添加内容如下：

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>

<!--指定根目录-->
<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/user/hive/warehouse</value>
</property>

<!--hive的元数据服务-->
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop01:9083</value>
</property>

<!--配置mysql的连接字符串-->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop01:3306/hive?createDatabaseIfNotExist=true</value>
</property>

<!--配置mysql的连接驱动-->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>

<!--配置登录mysql的用户-->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>

<!--配置登录mysql的密码-->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>

</configuration>

b、将mysql的驱动包拷贝到hive安装目录的lib目录下
[root@hadoop01 hive-1.2.1]# cp /home/mysql-connector-java-5.1.6-bin.jar ./lib/

c、手动创建元数据库，并将数据库的编码设置为latin1.

d、启动元数据库服务
[root@hadoop01 hive-1.2.1]# hive --service metastore &
e、测试
hive

3、如果别的服务器想安装hive，并访metastore的server
a、配置
vi ./conf/hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop01:9083</value>
</property>

<property>
  <name>hive.metastore.local</name>
  <value>false</value>
</property>

<!--hiveserver2的相关属性配置-->
 <property>
    <name>hive.server2.authentication</name>
    <value>NONE</value>
</property>

</configuration>

b、连接测试
hive

sql基本操作：

set  ...;
with tmp as()
select
from(
    select
    a.id id, --二级id
    a.name name,

    from test a
    left join test1 b
    on ...
    join ...
    where 
    group by
    having
    order by/sort by

    union /union all
)

hive不能使用关键字、数字开始的字符串来作库表名，不区分大小写。

test.a1 test_user_userInfo user_id user order

phonx.user_login phonix.user_register

创建库：

create database if not exists gjh24;

**创建库的本质：**在hive的warehouse目录下创建一个目录（库名.db命名的目录）

切换库：

use gp24;

**创建表：**本质是创建目录，并映射到元数据。

create table gjh24.t_user(id int,name string);

create table t_user(id int,name string);

hive的默认的列与列之间分隔符是：

默认：^A 、\u0001 、 \001

^B \002

^C \003

create table if not exists t_user1(
id int comment "this is userid",
name string
)
row format delimited fields terminated by ' '
lines terminated by '\n'
stored as textfile --存储格式
location '/user/hive/warehouse/gjh24.db/t_user'  --只能是hdfs中的目录，直接为表加载数据
;

创建表语法：

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] TABLENAME(
[COLUMNNAME COLUMNTYPE [COMMENT ‘COLUMN COMMENT’],…] )

[COMMENT ‘TABLE COMMENT’]
[PARTITIONED BY (COLUMNNAME COLUMNTYPE [COMMENT ‘COLUMN COMMENT’],…)]
[CLUSTERED BY (COLUMNNAME COLUMNTYPE [COMMENT ‘COLUMN COMMENT’],…) [SORTED BY (COLUMNNAME [ASC|DESC])…] INTO NUM_BUCKETS BUCKETS]
[ROW FORMAT ROW_FORMAT]
[STORED AS FILEFORMAT]
[LOCATION HDFS_PATH]

;

内外部表区别：
1、默认创建内部表，创建外部表需要external。
2、一般使用外部表(长期存在的表、数据量大的、不希望把数据块删除的数据)，临时表或者确定使用即可清空全部数据(数据库和元数据)则可以使用内部表。
3、内部表删除时将会删除元数据和hdfs中表对应的目录，而外部表删除时只会删除元数据，hdfs中的数据目录保留。


create external table t_user2(id int,name string);

hive表的数据加载：

1、直接将hdfs中数据使用命令上传到表所对应的目录即可。
[root@hadoop01 hivedata]# hdfs dfs -put ./t1 /user/hive/warehouse/gjh24.db/t_user/

2、创建表的时候，使用location指定表所对应的目录即可。
create table if not exists gjh24.t_user2(
id int comment "this is userid",
name string
)
row format delimited fields terminated by ' '
lines terminated by '\n'
stored as textfile --存储格式
location '/user/hive/warehouse/gjh24.db/t_user'  --只能是hdfs中的目录，直接为表加载数据
;

3、使用load方式加载数据
load data local inpath '/home/hivedata/t1' into table gjh24.t_user; --默认复制
load data local inpath '/home/hivedata/t' overwrite into table gjh24.t_user;
load data inpath '/t1' overwrite into table gjh24.t_user;   --移动

4、使用insert into方式
法一：
set hive.exec.mode.local.auto=true;--本地跑
insert into table t_user2
select id,name from t_user;

法二：
from t_user
insert into table t_user2
insert into table t_user3
select
id,
name
;

from t_user
insert into table t_user2
select
id,
name
where id > 2
;

法三：
set hive.exec.mode.local.auto=true;
with tmp as(
select
id,
name
from t_user
)
insert into table t_user2
select * from tmp;

5、使用ctas方式来
create table t_user3
as
select
name
from t_user
;

6、使用like方式(克隆)
create table t_user4 like t_user2;
create table t_user4 like t_user2 location '/user/hive/warehouse/gjh24.db/t_user2';

等等 。

注:
1\2\4方法较常用。

查看表的描述：

desc t_user2; 查看表信息
desc extended t_user2;  详细查看表信息
show create table t_user2;  --和mysql一样
describe t_user2;

CREATE TABLE gjh24.log1(
id string COMMENT 'this is id column',
phonenumber bigint,
mac string,
ip string,
url string,
status1 string,
status2 string,
upflow int,
downflow int,
status3 string,
dt string
)
COMMENT 'this is log table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
stored as textfile;

load data local inpath '/home/data.log' into table gjh24.log1; --默认复制


set hive.exec.mode.local.auto=true;
select
l.url url,
round(sum(l.upflow)/1024.0,2) upflows,
round(sum(l.downflow/1024.0),2) downflows,
round(sum(l.upflow+l.downflow)/1024.0,2) total_flows
from gjh024.log1 l
group by l.url
order by total_flows desc
limit 3
;

表修改：

表名：rename to
alter table log1 rename to log3;

字段名\字段类型\字段顺序：
alter table t7 change column name name1 string
alter table log3 change column id myid int after mac;
alter table log3 change column myid id string first;

添加字段：add columns
alter table log3 add columns(dt1 int,dt2 string);


删除字段：replace columns
alter table log3 replace columns(
id string,
phonenumber bigint,
mac string,
ip string,
url string,
status1 string,
status2 string,
upflow int,
downflow int,
status3 string,
dt string
);


内外部表:
alter table log3 set tblproperties('EXTERNAL'='true'); TRUE代表外部表 ###内部表转外部表，true一定要大写;
alter table log3 set tblproperties('EXTERNAL'='false'); ##false大小写都没有关系

注：
没有before和last。
内外部表的false可以大小写，而TRUE必须大写。

删除库

set hive.cli.print.current.db=true;

drop database gjh24；

drop database gjh24 cascade; 强制删除

hive的分区表：

分区意义：
避免全表扫描，从而提高查询效率。默认使用全表扫描。

使用什么分区？
日期、地域、能将数据分散开来？

分区技术：
[PARTITIONED BY (COLUMNNAME COLUMNTYPE [COMMENT 'COLUMN COMMENT'],...)]
1、hive的分区名区分大小写
2、hive的分区字段是一个伪字段，但是可以用来进行操作
3、一张表可以有一个或者多个分区，并且分区下面也可以有一个或者多个分区。
4、分区字段使用表外字段

本质：
在表的目录或者是分区的目录下再创建目录，分区的目录名为指定字段=值(比如:dt=2019-09-09)

案例：
创建1级分区表：
create table if not exists part1(
id int,
name string
)
partitioned by (dt string)
row format delimited fields terminated by ' '
;

加载数据
load data local inpath '/home/hivedata/t1' overwrite into  table part1 partition(dt='2019-09-09');
load data local inpath '/hivedata/user.txt' into table part1 partition(dt='2018-03-20');

创建二级分区：
create table if not exists part(
id int,
name string
)
partitioned by (year int,month int)
row format delimited fields terminated by ' '
;

加载数据
load data local inpath '/home/hivedata/t1' overwrite into  table part2 partition(year=2019,month=9);
load data local inpath '/home/hivedata/t' overwrite into  table part2 partition(year=2019,month=10);

查看分区中个别内容
select * from part2 where year=2019 and month=10;

修改分区：

1、查看分区
show partitions 表名;


2、添加分区
alter table part1 add partition(dt='2019-09-10');
alter table part1 add partition(dt='2019-09-13') partition(dt='2019-09-12');
alter table part1 add partition(dt='2019-09-11') location  '/user/hive/warehouse/gjh1704.db/part1/dt=2019-09-10';

3、分区名称修改
alter table part1 partition(dt='2019-09-10') rename to partition(dt='2019-09-14');

4、修改分区路径
alter table part1 partition(dt='2019-09-14') set location '/user/hive/warehouse/gjh24.db/part1/dt=2019-09-09';    --错误使用
alter table part1 partition(dt='2019-09-14') set location 'hdfs://hadoo01:9000/user/hive/warehouse/gjh24.db/part1/dt=2019-09-09';  --决对路径

5、删除分区
alter table part1 drop partition(dt='2019-09-14');
alter table part1 drop partition(dt='2019-09-12'),partition(dt='2019-09-13');


静态分区：加载数据到指定分区的值。
动态分区：数据未知，根据分区的值来确定需要创建的分区。
混合分区：静态和动态都有。

动态分区的属性：
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=strict/nonstrict
set hive.exec.max.dynamic.partitions=1000
set hive.exec.max.dynamic.partitions.pernode=100

strict:严格模式必须至少一个静态分区
nostrict：可以所有的都为动态分区，但是建议尽量评估动态分区的数量。

案例：
create table part1(

name string
)
partitioned by (id int)
row format delimited fields terminated by ' '
;

load data local inpath '/home/hive/c3.txt' overwrite into  table part1 partition(dt='2019-09-09');

set hive.exec.mode.local.auto=true;
insert into table dy_part1 partition(dt)
select
id,
name,
dt
from part1
;
动态分区
insert into table part3 partition(year)
select
id,
name,
month
from part2;
混合分区：
create table if not exists dy_part2(
id int,
name string
)
partitioned by (year int,month int)
row format delimited fields terminated by ' '
;

set hive.exec.mode.local.auto=true;
set hive.exec.dynamic.partition.mode=strict;
insert into table dy_part2 partition(year=2019,month)
select
id,
name,
month
from part2
where year=2019
;

hive的严格模式：

  <property>
    <name>hive.mapred.mode</name>
    <value>nonstrict</value>
    <description>
      The mode in which the Hive operations are being performed. 
      In strict mode, some risky queries are not allowed to run. They include:
        Cartesian Product.
        No partition being picked up for a query.
        Comparing bigints and strings.
        Comparing bigints and doubles.
        Orderby without limit.
    </description>
  </property>
  
严格模式阻挡5类查询：
1、笛卡尔积
set hive.mapred.mode=strict;
select
*
from dy_part1 d1
join dy_part2 d2
;

2、分区表没有分区字段过滤
set hive.mapred.mode=strict;
select
*
from dy_part1 d1
where d1.dt='2019-09-09'
;

不行
select
*
from dy_part1 d1
where d1.id > 2
;

select
*
from dy_part2 d2
where d2.year >= 2019
;

3、order by不带limit查询会报错
select
*
from log3
order by id desc
;

4、(bigint和string比较)Comparing bigints and strings.

5、(bigint和double比较)Comparing bigints and doubles.

hive读写模式：

Hive是一个严格的读时模式。 写数据不管数据正确性，读的时候，不对则用NULL替代。
mysql是一个的写时模式。 写的时候检查语法，不okay就会报错。

load data local inpath '/home/hivedata/t' into  table t_user;
insert into stu(id,sex) value(1,abc);

t
*
from dy_part1 d1
join dy_part2 d2
;

2、分区表没有分区字段过滤
set hive.mapred.mode=strict;
select
*
from dy_part1 d1
where d1.dt=‘2019-09-09’
;

不行
select
*
from dy_part1 d1
where d1.id > 2
;

select
*
from dy_part2 d2
where d2.year >= 2019
;

3、order by不带limit查询会报错
select
*
from log3
order by id desc
;

4、(bigint和string比较)Comparing bigints and strings.

5、(bigint和double比较)Comparing bigints and doubles.




#### hive读写模式：

Hive是一个严格的读时模式。写数据不管数据正确性，读的时候，不对则用NULL替代。
mysql是一个的写时模式。写的时候检查语法，不okay就会报错。

load data local inpath ‘/home/hivedata/t’ into table t_user;
insert into stu(id,sex) value(1,abc);

jhchengxuyuan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive的初步认识与基本语法一

hivehive的背景：fackbook为解决海量数据分析，避免使用传统mr而开发出来类sql的操作大数据工具。hive定义hive是一个数据仓库软件，它能够使用类sql进行读、写、管理基于集群上的海量数据。hive可以对已经存在的数据进行结构，同时hive也提供命令行和jdbc让用进行连接hive。hive和hadoop的关系：hive基于hadoop，hive本身没有数据存储能...
复制链接

扫一扫