大数据Hive

最新推荐文章于 2024-05-16 20:12:02 发布

晓枫桥亭

最新推荐文章于 2024-05-16 20:12:02 发布

阅读量227

点赞数

分类专栏：大数据分析文章标签：大数据 Hive BigData

本文链接：https://blog.csdn.net/heaterhip/article/details/97366889

版权

大数据分析专栏收录该内容

7 篇文章 0 订阅

订阅专栏

大数据Hive

Hive技术

引言

什么是Hive

hive是facebook开源，并捐献给了apache组织，作为apache组织的顶级项目。 hive.apache.org
hive是一个基于大数据技术的数据仓库技术  DataWareHouse (数仓)
    数据库  DataBase
           数据量级小，数据价值高
    数据仓库 DataWareHouse
           数据体量大，数据价值低
底层依附是HDFS,MapReduce

Hive好处

Hive让程序员应用时，书写SQL语句，最终由Hive把SQL语句转换成MapReduce运行，这样简化了程序员的工作。

在这里插入图片描述
3. Hive运行原理

4. Hive环境的搭建

1. linux服务器  ip 映射  主机名  关闭防火墙  关闭selinux  ssh免密登陆 jdk
2. 搭建hadoop环境
3. 安装Hive
   3.1 解压缩hive 
   3.2 hive_home/conf/hive-env.sh [改名]
       HADOOP_HOME=/opt/install/hadoop-2.5.2
       export HIVE_CONF_DIR=/opt/install/apache-hive-0.13.1-bin/conf
   3.2 hdfs创建2个目录
       /tmp
       /user/hive/warehouse
       bin/hdfs dfs -mkdir /tmp
       bin/hdfs dfs -mkdir /user/hive/warehouse
   3.3 启动hive
       bin/hive 
   3.4 jps
       runjar

Hive基本操作

# 创建数据库
create database [if not exists] hive_1;
# 查看所有数据库
 show databases;
# 使用数据库
 use db_name;
# 删除空数据库 
 drop database db_name;
 drop database db_name cascade;
# 查看数据库的本质
  hive中的数据库 本质是 hdfs的目录 /user/hive/warehouse/hive_1.db
  
# 查看当前数据库下的所有表
  show tables;
# 建表语句 
  create table t_user(
    id int ,
    name string
   )row format delimited fields terminated by '\t';
# 查看表的本质
  hive中的表  本质是 hdfs的目录 /user/hive/warehouse/hive_1.db/t_user
# 删除表
  drop table t_user;
  
# hive中向表导入数据
  load data local inpath '/root/hive/data' into table t_user;
# hive导入数据的本质
  load data local inpath '/root/hive/data' into table t_user;
  1. 导入数据 本质本质上就是 hdfs 上传文件
  bin/hdfs dfs -put /root/hive/data /user/hive/warehouse/hive_1.db/t_user;
  2. 上传了重复数据，hive导数据时，会自动修改文件名
  3. 查询某一个张表时，Hive会把表中这个目录下所有文件的内容，整合查询出来
  
  
# SQL(类SQL 类似于SQL HQL Hive Query Language)
select * from t_user;
select id from t_user;
1. Hive把SQL转换成MapReduce (如果清洗数据 没有Reduce)
2. Hive在绝大多数情况下运行MR,但是在* limit操作时不运行MR

MetaStore的替换问题

Hive中的MetaStore把HDFS对应结构，与表对应结果做了映射（对应）。但是默认情况下hive的metaStore应用的是derby数据库，只支持一个client访问。

Hive中元数据库Derby替换成MySQL(Oracle)

0. 删除hdfs /user/hive/warehouse目录，并重新建立
1. linux mysql
   yum -y install mysql-server
2. 启动mysql服务并设置管理员密码
   service mysqld start
   /usr/bin/mysqladmin -u root password '123456'
3. 打开mysql远程访问权限
   GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123456';
   flush privileges;   
   use mysql 
   delete from user where host like 'hadoop%';
   delete from user where host like 'l%';
   delete from user where host like '1%';
   service mysqld restart
4. 创建conf/hive-site.xml
   mv hive-default.xml.template hive-site.xml
   hive-site.xml
   <property>
	  <name>javax.jdo.option.ConnectionURL</name>
	  <value>jdbc:mysql://hadoop21:3306/metastore?createDatabaseIfNotExist=true</value>
	  <description>the URL of the MySQL database</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionDriverName</name>
	  <value>com.mysql.jdbc.Driver</value>
	  <description>Driver class name for a JDBC metastore</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionUserName</name>
	  <value>root</value>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionPassword</name>
	  <value>123456</value>
	</property>
	
	<property>
  <name>hive.cli.print.header</name>
	  <value>true</value>
	</property>
	
	<property>
	  <name>hive.cli.print.current.db</name>
	  <value>true</value>
	</property>
5. hive_home/lib 上传mysql driver jar包

Hive的语法细节

HQL (SQL)

1. 基本查询
   select * from table_name # 不启动mr
   select id from table_name # 启动mr
2. 条件查询 where
   select id,name from t_users where name = 'suns1';
   2.1 比较查询  =  ！=  >=  <=
       select id,name from t_users where age > 20;
   2.2 逻辑查询  and or  not
       select id,name,age from t_users where name = 'suns' or age>30;
   2.3 谓词运算
       between and
       select name,salary from t_users where salary between 100 and 300;
       in
       select name,salary from t_users where salary in (100,300);
       is null
       select name,salary from t_users where salary is null;
       like
       select name,salary from t_users where name like 'suns%';
       select name,salary from t_users where name like 'suns__';
       select name,salary from t_users where name like 'suns%' and length(name) = 6;
3. 排序 order by [底层使用的是 map sort  group sort  compareto]
   select name,salary from t_users order by salary desc;
4. 去重 distinct
   select distinct(age) from t_users;
5. 分页 [Mysql可以定义起始的分页条目，但是Hive不可以]
   select * from t_users limit 3;  
6. 聚合函数（分组函数） count() avg() max() min() sum() 
   count(*)  count(id) 区别
7. group by
   select max(salary) from t_users group by age;
   规矩： select 后面只能写 分组依据和聚合函数 （Oracle报错，Mysql不报错，结果不对）
8. having 
   分组后，聚合函数的条件判断用having
   select max(salary) from t_users group by age having max(salary) > 800;
9. hive不支持子查询 
10. hive内置函数 
    show functions 

    length(column_name)  获得列中字符串数据长度
    substring(column_name,start_pos,total_count)
    concat(col1,col2)
    to_data('yyyy-mm-dd')
    year(data) 获得年份
    month(data)
    date_add
    ....
    select year(to_date('1999-10-11')) ;
11. 多表操作
    inner join
    select e.name,e.salary,d.dname
    from t_emp as e
    inner join t_dept as d
    on e.dept_id = d.id;
    
    select e.name,e.salary,d.dname
    from t_emp as e
    left join t_dept as d
    on e.dept_id = d.id;
    
    select e.name,e.salary,d.dname
    from t_emp as e
    right join t_dept as d
    on e.dept_id = d.id;
    
    select e.name,e.salary,d.dname [mysql 不支持]
    from t_emp as e
    full join t_dept as d
    on e.dept_id = d.id;

表相关的操作

管理表 (MANAGED_TABLE)

1. 基本管理表的创建
create table if not exists table_name(
column_name data_type,
column_name data_type
)row format delimited fields terminated by '\t' [location 'hdfs_path']

2. as 关键字创建管理表
create table if not exists table_name as select id,name from t_users [location ''];
表结构 由 查询的列决定，同时会把查询结果的数据 插入新表中

3. like 关键字创建管理表
create table if not exists table_name like t_users [location 'hdfs_path'];
表结构 和 like关键字后面的表 一致，但是没有数据是空表

细节

1. 数据类型 int string varchar char double float boolean  
2. location hdfs_path
   定制创建表的位置，默认是 /user/hive/warehouse/db_name.db/table_name
   create table t_suns(
   id,int
   name,string
   )row format delimited fields terminated by '\t' location /xiaohei ;
   启示：日后先有hdfs目录，文件，在创建表进行操作。
3. 查看hive表结构的命令
   desc table_name        describe table_name
   desc extended table_name
   desc formatted table_name

外部表

1. 基本
create external table if not exists table_name(
id int,
name string
) row delimited fields terminated by '\t' [location 'hdfs_path'];
2. as 
create external table if not exists table_name as select id,name from t_users [location ''];
3. like
create external table if not exists table_name like t_users [location 'hdfs_path'];

1. 管理表和外部表的区别
drop table t_users_as; 删除管理表时，直接删除metastore,同时删除hdfs的目录和数据文件
drop table t_user_ex;  删除外部表时，删除metastore的数据。
2. 外部表与管理表使用方式的区别

在这里插入图片描述
3. 分区表【查询优化】
思想：如下图所示

~~~sql
create table t_user_part(
id int,
name string,
age int,
salary int)partitioned by (data string) row format delimited fields terminated by ‘\t’;

  load data local inpath '/root/data15' into table t_user_part partition (date='15');
  load data local inpath '/root/data16' into table t_user_part partition (date='16');
  
  select * from t_user_part  全表数据进行的统计
  
  select id from t_user_part where data='15' and age>20;
  ~~~

桶表【略】抽样
临时表【略】

数据的导入

基本导入【重点】

load data local inpath 'local_path' into table table_name

通过as关键完成数据的导入【重点】

建表的同时，通过查询导入数据
create table if not exists table_name as select id,name from t_users

通过insert的方式导入数据【重点】

#表格已经建好，通过查询导入数据。
create table t_users_like like t_users;

insert into table t_users_like select id,name,age,salary from t_users;

hdfs导入数据【了解】

load data inpath 'hdfs_path' into table table_name

导入数据过程中数据的覆盖【了解】

load data inpath 'hdfs_path' overwrite into table table_name
本质 把原有表格目录的文件全部删除，再上传新的

通过HDFS的API完成文件的上传【了解】

bin/hdfs dfs -put /xxxx  /user/hive/warehouse/db_name.db/table_name

数据的导出

sqoop 【重点】

hadoop的一种辅助工具  HDFS/Hive  <------> RDB (MySQL,Oracle)

insert的方式【了解】

#xiaohei一定不能存在，自动创建
insert overwrite 【local】 directory '/root/xiaohei' select name from t_user;

通过HDFS的API完成文件的下载【了解】

bin/hdfs dfsd -get /user/hive/warehouse/db_name.db/table_name /root/xxxx

命令行脚本的方式【了解】

bin/hive --database 'baizhi_150' -f /root/hive.sql > /root/result

Hive提供导入，导出的工具【了解】

1. export 导出
	export table tb_name to 'hdfs_path'
2. import 导入
	import table tb_name from 'hdfs_path'

Hive相关的配置参数

1. hive-default.xml 
2. hive-site.xml 

javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
javax.jdo.option.ConnectionUserName
javax.jdo.option.ConnectionPassword
hive.cli.print.current.db
hive.cli.print.header

<property>
   <name></name>
   <value></value>
</property>

3. bin/hive --hiveconf hive.cli.print.current.db=false
4. hive>set hive.cli.print.current.db 查看参数
        set hive.cli.print.current.db=true; 设置参数

#与MR相关的参数
Map --> Split  ---> Block 
#reduce相关个数
mapred-site.xml
<property>
     <name>mapreduce.job.reduces</name>
     <value>1</value>
</property>
hive-site.xml
<!--1G-->
<property>
	  <name>hive.exec.reducers.bytes.per.reducer</name>
	  <value>1000000000</value>
</property>
<property>
     <name>hive.exec.reducers.max</name>
     <value>999</value>
</property>

<!--在查询中是否启动MR 是如下参数配置决定的 -->
<property>
  <name>hive.fetch.task.conversion</name>
  <value>minimal</value>
  <description>
    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
    2. more    : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)
  </description>
</property>

Hive相关的启动参数

1. 启动hive终端时，临时设置hive的配置参数
  bin/hive --hiveconf
2. 启动hive时，指定启动的数据库
  bin/hive --database Hive_1
3. 启动hive时，可以执行sql命令，执行完毕后，退出
  bin/hive -e 'sql'
  bin/hive --database Hive_1 -e 'sql'
  bin/hive --database Hive_1 -e 'select * from t_user' > /root/result
  bin/hive --database Hive_1 -e 'select * from t_user' >> /root/result
4. 启动hive是，如果需要执行多条sql可以把sql写在一个独立的文件里，执行。完毕退出
  bin/hive -f /root/hive.sql
  bin/hive --database Hive_1 -f /root/hive.sql > /root/result
  bin/hive --database Hive_1 -f /root/hive.sql >> /root/result