Hive常规操作总结

最新推荐文章于 2022-10-16 16:15:16 发布

wenpu_Di

最新推荐文章于 2022-10-16 16:15:16 发布

阅读量496

点赞数

分类专栏： Hive学习文章标签： hive 存储

本文链接：https://blog.csdn.net/qq_29486343/article/details/53267931

版权

Hive学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Hive目录的说明
  ● bin:    
        包含了各种Hive服务的可执行文件例如CLI命令行界面
  ● .hiverc:
        位于用户的主目录下的文件，如果不存在可以创建一个
        里边的命令可以在启动CLI时，会先自动执行！
  ● metastore(元数据存储)：
    Hive所需要的组件只有元数据信息是hadoop没有的，它存储
    了表的模式和分区信息等元数据信息，用户在执行create table x..
    或者alter table y...时会指定这些信息！
    Hive会将元数据的信息存储到Mysql中
  ● .hivehistory:
        存储执行的历史命令

1.Hive中的数据类型：
基本数据类型：
tinyint:1byte
smalint:2byte
int:4byte
bigint:8byte
boolean:true or false
float:单精度浮点数
double:双精度浮点数
string:字符序列
timestamp:整数，浮点数或者字符串
binary：字节数组

集合数据类型;
struct：
    struct('john', 'doe')
map:
    map('first','join','last','doe')
array:
    array('john', 'doe')
------------------------示例----------------------
create table employee(
   name         string,
   salary       float,
   subordinates array<string>,      //数组类型
   deductions   map<string, float>,     // map类型
   address      struct<street:string, city:string, state:string, zip:int>   // 结构体类型
)
row format delimited          // 这组关键字是必须写在前面的
fields terminated by '\t'   // 每列用‘\t’分隔
collection items terminated by ',';     // 集合间的元素用，分隔

// 说明：struct类型貌似和array类型的区别：
//  struct类型里边可以拥有更多种的数据类型，
//  array类型只有一种数据类型(string类型)

2.Hive中关于库的概念以及操作
  ● Hive中数据库的概念：
        仅仅是存放表数据的一个目录或者命名空间
  ● Hive会为每个数据库创建一个目录：
        数据库中的表将会以这个数据库的子目录形式存储（default例外）
        在/user/hive/warehouse/table_name/表文件    (位于HDFS中)
  ● Hive不支持行级插入操作，更新操作和删除操作。也不支持事务


----------------------操作数据库----------------------------

hive>show tables;   // 显示当前工作目录下的表

hive>show tables in mydb;   // 显示指定数据库下的表

hive>show databases like 'h.*';     // 显示以h开头的数据库

hive>create table person2 like person;      // 拷贝一张一模一样的表

hive>create database mydb location '/home/wenpu_di/';   // 指定创建的数据库目录的存放位置 -> location

hive>create database mydb location '/home/diwenpu/' comment 'my database';  
//  comment(注释部分): my database是关于这个数据库的说明文字

hive>drop database mydb cascade;    // 删除非空的数据库（含有表）

3.Hive中表的基本操作

hive>create table table1(i int, name string);   // 建表操作

hive>desc table1;   // 查看表结构

hive>drop table table1;     // 删除表

// 修改表，只会修改元数据的但不会修改数据本身

hive>alter table mytable rename to MYTABLE;     // 重命名


hive>alter table mytable add columns(name string, age int); // 添加新的字段

hive>alter table add partition(year=2011, month=1, day=1) location '/log/data1';
// 该命令只能操作分区表
  ● 外部表：
Hive不完全拥有该表，删除该表并不会删除掉表中的这份数据，
不过描述表的元数据信息会被删除掉 
//创建一个外部表：
create extenal table stocks(
    age int,
    name string
)
row format delimited 
fields terminated by ','
location '/data/stocks';    // 分号在结尾

// 说明：
// extenal:说明这个表是非分区外部表，后边的location：
// 告诉Hive数据在那个路径下边

hive>desc extended tablename;   // 显示表示管理表还是外部表

// tableType:MANAGED_TABLE:管理表
// tableType:EXTERNAL_TABLE:外部表

  ● 分区表：    
分区表改变了Hive对数据存储的组织方式。

// 分区表的创建
create table employees(
    name string,
    salary float,
    subordinates array<string>,
    deductions map<string, float>
    address     struct<street:string, city:string, state:string, zip:int>

)
partitioned by (country string, state string);

// 分区字段：country state 用户不需要关心这些字段是不是分区字段

hive>describe extended employee;    // 查看分区键

hive>show partitions employees;     // 查看表中存在的所有分区

hive>show partitions employees partition(country='US'); //查看表中的多个分区
country=US/state=AL
country=US/state=AK
...

hive>show partitions employees partition(country='US', state='AL')  //查看表中的一个指定分区
country=US/state=AL


// 增加一个分区：alter table table_name add partition(...)
hive>alter table log_message add partition(year=2012, month=1, day=2)
location 'hdfs://master:9000/S1/data';

// 将分区路径指向其他路径(修改分区路径)，改变表存储路径
hive>alter table log_message partition(year=2011, month=12, day=2)
set location 'hdfs://master:9000/S2/data'

// 删除某个分区
hive>alter table log_message drop partition(year=2011, month=12, day=2);

// 将指定位置的数据拷贝到指定的分区下边去
hive>load data local inpath '/home/wenpu_di/sougou.txt' into table employee partition(country='US', state='AL');
// 注意：指定的路径下边的数据要和你设定的分区匹配才行
// 即：它会去读取数据但是不保证里边的数据都符合制定分区的要求
4.装载数据
  ● load指定文件路径加载
hive>load data local inpath '/home/wenpu_di/data.txt' overwrite into table employee partition(country='US', state='CA');
说明：
// inpath:指定的路径下不能含有文件夹
// 如果目标分区不存在的话，那么先创建这个分区在将数据拷贝到目录下
// 如果目标表是非分区表，那么语句中应该省略partition子句
// local关键字
// 注意：加上local：代表拷贝本地数据到HDFS上的目标位置
// 不加local代表转移数据到目标位置
// 即将"hdfs://master:9000/S/data"  转移到 “表所在的位置”
  ● 单个查询语句中创建表并加载数据
create table ca_employee as 
select name, salary, address from employee 
where state='CA';
// 从一个大的宽表中选取需要的数据集，但是这个功能
// 不能用于外部表。
  ●  以一个表的查询结果作为另一个表的输入
insert overwrite table sougou select uid,url from sougou500;
5.join连接
1.内连接：
    select a.no,b.no from a join b on a.no = b.no;
内连接只会保留两个表中相等的字段

2.left outer join
    select a.no,b.no from a left outer join b on a.no = b.no;
左外连接：以左表为基准表，左表中不符合条件的记录也会保留但是对应的右表的是null
即：左表的记录都会出现，不满足On条件的右边补null

3.right outer join
    select a.no,b.no from a right outer join b on a.no = b.no;
右外连接：以右表为基准，右表中不符合条件的记录也会保留但是对应的左表的是null
即：右表的记录都会出现，不满足On条件的左边补null

4.full outer join
    select a.no,b.no from a full outer join b on a.no = b.no;
左右表中的记录都会出现，左表中不匹配的右表补null，右表中不匹配的左表补null

5.left semi join 
    select a.no,b.no from a left semi join b on a.no = b.no;
把，a表中在b表中出现的过记录统计出来：
    即：以a表为基准只要a表中的记录在b表中出现过就统计出来（重复出现的只统计一次）

6.map 连接
在mapper的内存中执行连接操作
select /*+MAPJOIN(a)*/ a.no,b.no from a join b on a.no = b.no;
7.其他特殊操作
#hive -e "desc database test";      // -e: 执行完这条语句会自动退出CLI

#hive -S -e "desc database test" >>  /home/diwenpu/temp.txt // -S：开启静默模式去掉OK和用时； -e:立即退出CLI

#hive -f /home/wenpu_di/hive.sql        // 运行一个sql命令集,-e和-f不能一起用
#cat hive.sql   
use database1;
select * from sougou limit 5
...

wenpu_Di

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive常规操作总结

Hive目录的说明 ● bin: 包含了各种Hive服务的可执行文件例如CLI命令行界面 ● .hiverc: 位于用户的主目录下的文件，如果不存在可以创建一个里边的命令可以在启动CLI时，会先自动执行！ ● metastore(元数据存储)： Hive所需要的组件只有元数据信息是hadoop没有的，它存储了表的
复制链接

扫一扫

专栏目录