Hadoop集群的基本操作（四：Hive的基本操作）

最新推荐文章于 2024-05-28 10:55:52 发布

Eyeshort

最新推荐文章于 2024-05-28 10:55:52 发布

阅读量5k

点赞数

分类专栏：操作系统大数据技术 Hadoop学习文章标签： Hive Hadoop 集群操作大数据

本文链接：https://blog.csdn.net/qq_37823605/article/details/90488917

版权

操作系统同时被 3 个专栏收录

34 篇文章 2 订阅

订阅专栏

大数据技术

24 篇文章 2 订阅

订阅专栏

Hadoop学习

19 篇文章 2 订阅

订阅专栏

实验

目的

要求

目的：

（1）掌握数据仓库工具Hive的使用；

要求：

掌握数据仓库Hive的使用；
能够正常操作数据库、表、数据；

实

验

环

境

五台独立PC式虚拟机；
主机之间有有效的网络连接；
每台主机内存2G以上，磁盘50G；
所有主机上安装CentOS7-64位操作系统；
所有主机已完成静态网络地址、主机名、主机地址映射的配置；
已完成Hadoop平台的搭建；
已完成MySQL数据库平台的搭建；
已完成Hbase的安装；
已完成Hive数据仓库的安装；

软件版本：

选用Hive的2.1.1版本，软件包名apache-hive-2.1.1-bin.tar.gz；

练习内容

步骤一：Hive工具安装配置

1、集群的启动；

★ 该项的所有操作步骤使用专门用于集群的用户admin进行。

★ 启动HBase集群之前首先确保Zookeeper集群已被开启状态。（实验5台），Zookeeper的启动需要分别在每个计算机的节点上手动启动。如果家目录下执行启动报错，则需要进入zookeeper/bin目录执行启动命令。

★ 启动HBase集群之前首先确保Hadoop集群已被开启状态。（实验5台）Hadoop只需要在主节点执行启动命令。

a) 在集群中所有主机上使用命令“zkServer.sh status”查看该节点Zookeeper服务当前的状态，若集群中只有一个“leader”节点，其余的均为“follower”节点，则集群的工作状态正常。如果Zookeeper未启动，则在集群中所有主机上使用命令“zkServer.sh start”启动Zookeeper服务的脚本;

b) 在主节点，查看Java进程信息，若有名为“NameNode”、“ResourceManager”的两个进程，则表示Hadoop集群的主节点启动成功。在每台数据节点，若有名为“DataNode”和“NodeManager”的两个进程，则表示Hadoop集群的数据节点启动成功, 如果不存在以上三个进程，则在主节点使用此命令,启动Hadoop集群。

主节点及备用主节点：

通信节点：

c) 确定Hadoop集群已启动状态，然后在主节点使用此命令,启动HBase集群, 在集群中所有主机上使用命令“jps”;

2、在主节点使用命令“hive”启动Hive，启动成功后能够进入Hive的控制台。

3、在控制台中使用命令“show databases;”查看当前的数据库列表。

练习：

1、启动Hive，Hive常用命令；

命令：

$hive #启动Hive，启动成功后能够进入Hive的控制台

>show databases; #查看当前的数据库列表

>create database test1; #创建数据库

>show databases;

>use test1; #使用数据库

>create table testable(id int,name string,age int,tel string)row format delimited fields terminated by’,’stored as textfile;

>show tables;

>drop table testable; #删除表

>drop database test1; #删除数据库

2、Hive的数据模型_内部表

-与数据库中的Table在概念上是类似的。

-每一个Table在Hive中都有一个相应的目录存储数据。

-所有的Table数据（不包括External Table）都保存在这个目录中。

练习：

命令：

$hive

>create database test2;

>use test2;

>create database test3;

>use test3;

>create table t1(tid int, tname string, age int);

>create table t2(tid int, tname string, age int) location '/mytable/hive/t2';

>create table t3(tid int, tname string, age int) row format delimited fields terminated by ';

>create table t4 as select * from t1;

$ hadoop fs -ls /user/hive/warehouse/

$ hadoop fs -ls /user/hive/warehouse/test2.db

$ hadoop fs -ls /user/hive/warehouse/test2.db/t1

$ hadoop fs -ls /mytable/hive/

>desc t1;

>alter table tl add columns(english int);

>desc t1;

>drop table t1;

$hdfs dfs -ls /user/hive/warehouse/test2db

3、Hive的数据模型_分区表

命令：

$hive

>create database test4;

>use test4;

a)准备数据表；

>create table sampledata (sid int, sname string, gender string, language int,math int, english int) row format delimited fields terminated by,' stored astextfile;

b)准备文件数据；

在admin用户家目录下新建sampledata.txt内容：

1,Tom,M,60,80,96

2,Mary,F,ll,22,33

3,Jerry,M,90,11,23

4,Rose,M,78,77,76

5,Mike,F,99,98,98

c)将文本数据插入到数据表；

>load data local inpath ‘/home/admin/sampledata.txt’into table sampledata;

>select * from sampledata;

-partition对应于数据库中的Partition 列的密集索引
-在Hive中, 表中的一个Partition对应于表下的一个目录, 所有的Partition的数据都存储在对应的目录中。

d)创建分区表;

命令：
>create table partition _table(sid int,sname string)partitioned by(gender
string)row format delimited fields terminated by',;
> select*from partition_table;

e)向分区表中插入数据;

命令：
> insert into table partition table partition(gender='M')select sid,sname from sampledata where gender='M';
> insert into table partition table partition(gender='F') select sid, sname from sampledata where gender='F';

> select*from partition table;

> show partitions partition table; #查看表的分区信息

注：select查询中会扫描整个内容，会消耗大量时间。由于相当多的时候人们只关心表中的一部分数据，故建表时引入了区分概念。

登录http://192.168.10.111:8088/cluster/apps可以查看job执行状态；

4、Hive的数据模型_外部表

外部表(External Table)

-指向已经在HDFS中存在的数据, 可以创建Partition

-它和内部表在元数据的组织上是相同的, 而实际数据的存储则有较大的差异。

-外部表只有一个过程, 加载数据和创建表同时完成, 并不会移动到数据仓库目录中, 只是与外部数据建立一个链接。当删除一个外部表时,仅删除该链接。

准备几张相同数据结构的数据txt文件, 放在HDFS的/input 目录下。
在hive下创建一张有相同数据结构的外部表external student,location设置为HDFS的/input 目录。则external_student会自动关连/input下的文件。
查询外部表。
删除/input目录下的部分文件。
查询外部表。删除的那部分文件数据不存在。
将删除的文件放入/input目录。
查询外部表。放入的那部分文件数据重现。

(1)准备数据:

在admin家目录下分别新建studentl.txt内容:

1.Tom,M,60,80,96

2,Mary,F,11,22,33

student2.txt内容:

3,Jerry,M,90,11,23

student3.txt内容:

4,Rose,M,78,77,76

5,Mike,F,99,98,98

Shdfs dfs-ls/

$ hdfs dfs-mkdir /input

将文件放入HDFS文件系统

语法:

$hdfs dfs-put localFileName hdfsFileDir

$hdfs dfs-put studentl.txt /input

$hdfs dfs -put student2.txt /input

$hdfs dfs -put student3.txt/input

$hive

>create database test5;

>use test5;

(2)创建外部表

> create table external_student(sid int,sname string,gender string,language int,math int,english int)row format delimited fields terminatedby''location'/input';

(3)查询外部表

>select*from external_student;

(4)删除HDFS上的student1.txt

$ hdfs dfs-rm /input/studentl.txt

(5)查询外部表

>select*from external_student;

(6)将student1.txt 重新放入HDFS input目录下

$ hdfs dfs-put student1.txt/input

(7)查询外部表

>select*from external_student;

5、Hive的数据模型_桶表

命令：

$hive

>create database test6;

>use test6;

>create table users (sid int,sname string,age int)row delimited fields terminated by’,’;

准备文本数据:

在admin用户家目录下新建users.txt内容:

1,Bear,18

2.Cherry,23

3.Lucky,33

4,Dino,26

5,Janel,28

命令：

hive> load data local inpath'/home/admin/users.txt'into table users;

hive> create table bucket_table(sid int,sname string,age int)clustered by

(sname)into 5 buckets row format delimited fields terminated by ',;

hive>insert overwrite table bucket_table SELECT*FROM users;

hive> select*from bucket_table;

$hadoop fs-ls /user/hive/warehouse/test6.db/bucket_table/

6、Hive的数据模型_视图

语法：

创建视图

Create view viewName as select data from table where condition;

查看视图结构

Desc viewName;

查询视图

Select * from viewName;

删除视图

DROP VIEW [IF EXISTS]view_name

命令：

$hive

>create database test7;

>use test7;

a)创建一个测试表：

hive> create table testO1(id int,name string)row format delimited fields terminated by';

hive> desc test01;

$ vi datal.txt

1,tom

2.jack

hive>load data local inpath'/home/admin/datal.txt'overwrite into table test01;

hive> select*from test01;

b)创建一个View之前，使用explain命令查看创建View的命令是如何被Hive解释执行的；

hive>explain create view test_view(id,name_length)as select id,length(name)from test01;

hive>explain create view test_view (id,name_length)as select id,length(name)from test;

c)实际创建一个View

hive>create view test_view(id,name_length)as select id,length(name)from test01;

d)执行View之前，先explain查看实际被翻译后的执行过程；

hive>explain select sum(name_length)from test_view;

hive>explain extended select sum(name_length)from test_view;

e)最后，对View执行一次查询，显示Stage-1阶段对原始表test进行了MapReduce过程；

hive>select sum(name_length)from test_view;

出现的问题与解决方案

错误1、启动hive ： ls: cannot access/home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-*.jar: No such fileor directory问题

sxc@master ~]$ hivels: cannot access /software/spark/spark-2.2.0-bin-hadoop2.7/lib/spark-assembly-*.jar: No such file or directory17/11/27 13:12:56 WARN conf.HiveConf: HiveConf of name hive.metastore.local does not existLogging initialized using configuration in jar:file:/software/hive/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties

原因：spark升级到spark2以后，原有lib目录下的大JAR包被分散成多个小JAR包，原来的spark-assembly-*.jar已经不存在，所以hive没有办法找到这个JAR包。

解决方法：

打开hive的安装目录下的bin目录，找到hive文件

找到如下的位置

# add Spark assembly jar to the classpath

    if [[ -n "$SPARK_HOME" ]]

    then

      sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`

      CLASSPATH="${CLASSPATH}:${sparkAssemblyPath}"

    fi

原因：

spark升级到spark2以后，原有lib目录下的大JAR包被分散成多个小JAR包，原来的spark-assembly-*.jar已经不存在，所以hive没有办法找到这个JAR包。

解决办法：把红色部分改为如下的样子就可以了

  # add Spark assembly jar to the classpath

    if [[ -n "$SPARK_HOME" ]]

    then

       sparkAssemblyPath=`ls ${SPARK_HOME}/jars/*.jar`

     CLASSPATH="${CLASSPATH}:${sparkAssemblyPath}"

    fi

知识拓展

HIVE和HBASE区别

1. 两者分别是什么？

Apache Hive是一个构建在Hadoop基础设施之上的数据仓库。通过Hive可以使用HQL语言查询存放在HDFS上的数据。HQL是一种类SQL语言，这种语言最终被转化为Map/Reduce. 虽然Hive提供了SQL查询功能，但是Hive不能够进行交互查询--因为它只能够在Haoop上批量的执行Hadoop。

Apache HBase是一种Key/Value系统，它运行在HDFS之上。和Hive不一样，Hbase的能够在它的数据库上实时运行，而不是运行MapReduce任务。Hive被分区为表格，表格又被进一步分割为列簇。列簇必须使用schema定义，列簇将某一类型列集合起来（列不要求schema定义）。例如，“message”列簇可能包含：“to”, ”from” “date”, “subject”, 和”body”. 每一个 key/value对在Hbase中被定义为一个cell，每一个key由row-key，列簇、列和时间戳。在Hbase中，行是key/value映射的集合，这个映射通过row-key来唯一标识。Hbase利用Hadoop的基础设施，可以利用通用的设备进行水平的扩展。

2. 两者的特点：

Hive帮助熟悉SQL的人运行MapReduce任务。因为它是JDBC兼容的，同时，它也能够和现存的SQL工具整合在一起。运行Hive查询会花费很长时间，因为它会默认遍历表中所有的数据。虽然有这样的缺点，一次遍历的数据量可以通过Hive的分区机制来控制。分区允许在数据集上运行过滤查询，这些数据集存储在不同的文件夹内，查询的时候只遍历指定文件夹（分区）中的数据。这种机制可以用来，例如，只处理在某一个时间范围内的文件，只要这些文件名中包括了时间格式。

HBase通过存储key/value来工作。它支持四种主要的操作：增加或者更新行，查看一个范围内的cell，获取指定的行，删除指定的行、列或者是列的版本。版本信息用来获取历史数据（每一行的历史数据可以被删除，然后通过Hbase compactions就可以释放出空间）。虽然HBase包括表格，但是schema仅仅被表格和列簇所要求，列不需要schema。Hbase的表格包括增加/计数功能。

3. 限制

Hive目前不支持更新操作。另外，由于hive在hadoop上运行批量操作，它需要花费很长的时间，通常是几分钟到几个小时才可以获取到查询的结果。Hive必须提供预先定义好的schema将文件和目录映射到列，并且Hive与ACID不兼容。

HBase查询是通过特定的语言来编写的，这种语言需要重新学习。类SQL的功能可以通过Apache Phonenix实现，但这是以必须提供schema为代价的。另外，Hbase也并不是兼容所有的ACID特性，虽然它支持某些特性。最后但不是最重要的--为了运行Hbase，Zookeeper是必须的，zookeeper是一个用来进行分布式协调的服务，这些服务包括配置服务，维护元信息和命名空间服务。

4. 应用场景

Hive适合用来对一段时间内的数据进行分析查询，例如，用来计算趋势或者网站的日志。Hive不应该用来进行实时的查询。因为它需要很长时间才可以返回结果。

Hbase非常适合用来进行大数据的实时查询。Facebook用Hbase进行消息和实时的分析。它也可以用来统计Facebook的连接数。

5. 总结

Hive和Hbase是两种基于Hadoop的不同技术--Hive是一种类SQL的引擎，并且运行MapReduce任务，Hbase是一种在Hadoop之上的NoSQL 的Key/vale数据库。当然，这两种工具是可以同时使用的。就像用Google来搜索，用FaceBook进行社交一样，Hive可以用来进行统计查询，HBase可以用来进行实时查询，数据也可以从Hive写到Hbase，设置再从Hbase写回Hive。