Hive总结

最新推荐文章于 2023-01-23 21:38:33 发布

置顶肖的博客

最新推荐文章于 2023-01-23 21:38:33 发布

阅读量1k

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/ajax_jquery/article/details/23038231

版权

hadoop 专栏收录该内容

18 篇文章

订阅专栏

hvive总结：
简介：
Hive是建立在 Hadoop 上的数据仓库基础构架。它提供了一系列的工具，可以用来进行数据提取转化加载（ETL），这是一种可以存储、查询和分析存储在 Hadoop中的大规模数据的机制。Hive 定义了简单的类 SQL 查询语言，称为 HQL，它允许熟悉 SQL 的用户查询数据。同时，这个语言也允许熟悉MapReduce 开发者的开发自定义的 mapper 和 reducer 来处理内建的 mapper 和 reducer 无法完成的复杂的分析工作。
Hive 没有专门的数据格式。 Hive 可以很好的工作在 Thrift 之上，控制分隔符，也允许用户指定数据格式。
1.更新、事务和索引
这三项是传统数据库最重要的特性，但Hive没有计划支持这些。
  hive不支持记录的更新，于事务问题，Hive并没有对表的并发访问定义清楚的语义，应用程序需要自己实现应用层的并发或加锁机制，与HBase的集成也改变这方面的状况，HBase支持行更新和列索引。
2.托管表（Tabble）和外部表（External Table）
  External Table 指向已经在 HDFS 中存在的数据，可以创建 Partition。它和 Table 在元数据的组织上是相同的，而实际数据的存储则有较大的差异。
  Table 的创建过程和数据加载过程（这两个过程可以在同一个语句中完成），在加载数据的过程中，实际数据会被移动到数据仓库目录中；之后对数据对访问将会直接在数据仓库目录中完成。删除表时，表中的数据和元数据将会被同时删除。
  External Table 只有一个过程，加载数据和创建表同时完成（CREATE EXTERNAL TABLE ……LOCATION），实际数据是存储在 LOCATION 后面指定的 HDFS 路径中，并不会移动到数据仓库目录中。当删除一个 External Table 时，仅删除元数据，表中的数据不会真正被删除。
  托管表会将数据移入Hive的warehouse目录；外部表则不会。经验法则是，如果所有处理都由Hive完成，应该使用托管表；但如果要用Hive和其它工具来处理同一个数据集，则使用外部表。
3.创建表的四种方式
  3.1 创建托管表（内部表）
create table trade_detail(id bigint, account string, income double, expenses double, time string) rowformat delimited fields terminated by '\t';
  3.2 创建分区表
create table td_part(id bigint, account string, income double, expenses double, time string) partitioned by(logdate string) row format delimited fields terminated by '\t';
  3.3 建外部表（注意：location指定的路径是hdfs中的路径，hdfs中的目录名称最好和表名相同方便理解和查找）
create external table td_ext(id bigint, account string, income double) row format delimited fieldsterminated by '\t' location '/td_ext';
  3.4 将查询结果放入到新创建的target表中，target表的字段类型与查询结果的字段类型相同，同时将数据也插入到target表中
create table target AS select col1, col2 from source;
  3.5 创建一个类似其它表的空表，只是表结构一样，并不会拷贝数据
create table new_table like existing_table;
4.导入数据的三种方式
  4.1 通过工具sqoop将关系型数据库中的数据导入hive
sqoop import --connect jdbc:mysql://192.168.35.100:3306/databasename --username root --password 123 --tabletrade_detai --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
  4.2 通过load将本地或HDFS文件导入到hive的表中
将本地文件系统上的数据导入到HIVE当中
load data local inpath '/root/user.txt' into table user;
将HDFS的数据导入到HIVE当中
load data inpath '/user.txt' into table user;
  4.3 使用insert overwrite
  4.3.1 select语句的结果覆盖插入target表的数据
insert overwrite table target select col1, col2 from source;
  4.3.2 指定分区
insert overwrite table partition (dt=’2010-01-01’) target select col1, col2 from source;
  4.3.3 动态分区插入
insert overwrite table partition (dt) target select col1, col2,dt from source;
  4.3.4 从一个表中取数据插入到多个表中（将共同取数据的表放在最开头位置）
from source
insert overwrite table records_by_year select year, count(1) group by year
insert overwrite table good_records_by_year select year, count(1) where quality ==0 group by year;
5.表的修改
  5.1 修改表名
alter table source rename to newsource;
  5.2 给表添加列
alter table source add columns (col1 string,col2 int ...);
  5.3修改列名/类型/注释
前提：我先创建一张表source
create table source (a int comment 'first column',b string)
comment 'test table'
row format delimited fields terminated by '\t';
将列a列的名字为a1 类型改为string 注释改为'alert new column'
alter table source change a a1 string comment 'alert new column'
6.表分区操作
  查看表的分区：show partitions tablename
  6.1 添加分区(注意：分区名不能与表属性名重复，给表添加分区必须在创建表的时候先声明分区模式才能添加分区，否则报如下错误
FAILED: Error in metadata: table is not partitioned but partition spec exists: {logdate=20140406}
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask)
先创建分区表：create table parttest(id string) partitioned by (logdate string) row format delimited fields   terminated by '\t';
添加分区：alter table parttest add partition(logdate='20140406') location  '/user/hive/warehouse/parttest/logdate=20140406';
添加分区之后会在parttest目录下多一个名字为20140406的文件夹
  6.2 删除分区：alter table parttest drop partition(logdate='20140406');
7.删除表
  drop table source
8.排序和聚集
  Order by 能够预期产生完全排序的结果，但是它是通过只用一个reduce来做到这点的。所以对于大规模的数据集它的效率非常低。在很多情况下，并不需要全局排序，此时可以换成Hive的非标准扩展sort by。Sort by为每个reducer产生一个排序文件。在有些情况下，你需要控制某个特定行应该到哪个reducer，通常是为了进行后续的聚集操作。Hive的distribute by 子句可以做这件事。
  //根据年份和气温对气象数据进行排序，以确保所有具有相同年份的行最终都在一个reducer分区中
select year, temperature
From record
distribute by year
sort by year asc, temperature desc;
9.连接
  使用Hive和直接使用MapReduce相比，好处是它简化了常用操作。
  内连接：Hive只支持等值连接，只允许在from子句中出现一个表，但可以使用多个join…on…子句来连接多个表，Hive会智能地以最小MapReduce作业数来执行连接。
  select sales.*, things.* from sales join things on (sales.id = things.id);
  不支持：select sales.*, things.* from sales, things where sales.id = things.id;
  外连接：外连接可以让你找到连接表中不能匹配的数据行。
  select sales.*, things.* from sales left outer join things on (sales.id = things.id);
  select sales.*, things.* from sales right outer join things on (sales.id = things.id);
  select sales.*, things.* from sales full outer join things on (sales.id = things.id);
  select * from things left semi join sales on (sales.id = things.id);
  // 类似于in子查询：select * from things where things.id in (select id from sales);
  // 写left semi join查询时必须遵循一个限制，右表只能在on子句中出现。
10.子查询
  Hive对子查询的支持很有限，只允许出现在select语句的from中。子查询的列必须有唯一的名称，以便外层查询可以引用这些列。
  Select station, year, AVG(max_temperature)
  from (
select station, year, max(temperature) as max_temperature
from record2
where temperature != 999 and quality != 0
group by station, year
  ) mt

  group by station, year;
11.视图
  视图是一种用select语句定义的虚表（virtual table），Hive并不把视图物化存储到磁盘上。创建视图时不执行查询，视图的select语句只是在执行引用视图的语句时才执行。如果一个视图要对基表进行大规模的变换，或视图的查询会频繁执行，可能需要新建一个表，并把视图的内容存储到新表中，相当于物化它(create table…as select)。
  创建视图
  create view viewname as select col1,col2 ... from tablename
  通过视图查询数据
  select * from viewname
  通过表查询数据
  select * from tablename
  注意：通过视图执行全表查询select * 将会执行mapreduce任务，直接通过表查询数据不执行mapreduce任务
12.用户自定义函数
  UDF必须用java语言编写，包括三类：普通UDF，UDAF(用户自定义聚集函数）和UDTF（用户自定义表生成函数），它们接受输入和产生的输出的数据行在数量上不同：
  UDF操作作用于单个数据行，且产生一个数据行输出；
  UDAF接收多个输入数据行，产生一个数据行输出；
  UDTF接收单个数据行，产生多个数据行（一个表）作为输出。

问题：
1.使用sqoop导入数据的过程中我使用ctrl+c终止了任务，然后从新启动任务，报如下错误
  ERROR tool.ImportTool: Encountered IOException running import job:
  org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory user_info already exists
   原因：sqoop导入数据时先将数据文件保存到hive在hadoop默认的目录/user/root目录下，只有任务执行结束才会将/user/root

目录下的数据文件删除，当我重新导入数据时root目录下的文件还在造成文件冲突导致错误发生。
2.给表添加分区报错：
  FAILED: Error in metadata: table is not partitioned but partition spec exists: {logdate=20140406}
  FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
  原因：如果对表添加分区，要在创建时进行声明为分区模式,如: partitioned by(logdate string),才可以对表添加分区。