hdfs WeBUl的使用

隔壁老K~

于 2024-04-17 08:32:14 发布

阅读量849

点赞数 22

文章标签： hdfs hadoop 大数据

本文链接：https://blog.csdn.net/weixin_48098867/article/details/137854353

版权

hdfs WeBUl的使用

我们可以使用node1：9870打开hdfs的网页端，但是前提必须配置好windows系统下的主机映射，否则只能使用ip访问‘192.168.88.161:9870’

打开c:\windows\system32\drivers\etc\hosts文件
在下方写入主机映射内容
192.168.88.161 node1
192.168.88.162 node2
192.168.88.164 node3

注意：在企业开发中其实很少使用页面上传文件

1、在大数据开发中，文件一般是从一个服务器传递到另一个服务器，很少从本地上传，因为本地资源有限

2、在使用页面上传一般也需要较高的权限，因为整个权限不能给到所有人，所以一般刚进入公司没有这个权限

3、所以在企业中，一般测试环境下可以使用，正式环境不可以

数据加载

从其他位置将数据导入到表中叫做数据加载

--数据加载
--数据准备
CREATE TABLE test_db.test_load(
                             dt string comment '时间（时分秒）',
                             user_id string comment '用户ID',
                             word string comment '搜索词',
                             url string comment '用户访问网址'
) comment '搜索引擎日志表' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
--1、在linux中将数据上传到hdfs的指定目录下，映射数据
--步骤一：先将数据存放在/root/hive_data目录中
--步骤二：使用shell指令将数据文件上传到hdfs目录中（/user/hive/warehouse/test_db.db/test_load）
hadoop fs -put /root/hive_date/search_log.txt /user/hive/warehouse/test_db.db/test_load
--步骤三：查看表中是否成功映射该文件
select * from test_db.test_load;

--2、load  data  sql  语句。可以直接从本地文件中加载数据
--步骤一：复制一个test_load的表结构
create table test_db.tset_load1 like test_db.test_load;

--步骤二：使用load  data  语法将数据从本地加载到表中
--加载格式：load  data local  inpath  ‘文件路径’  into  table 表名；
load data  local  inpath '/home/hadoop/search_log.txt' into table test_db.tset_load1;
--步骤三：查看表中数据是都映射成功
    select  * from test_db.tset_load1;
--结论：load  data 的操作方式其实和put基本一致，上传文件后数据源文件依然存在

--3、load  data sql语句，可以从hdfs上将数据加载到表目录中
--步骤1：复制一个test_load表结构
create table test_db.test_load2 like test_db.test_load;
--步骤二：将数据上传到/tmp/test_data/
--步骤三：将hdfs上的/tmp/test_data/serach_log.txt文件加载到表中
--加载格式：load  data  inpath  ‘hdfs中的文件路径’  into  table  表名
load  data  inpath 'hdfs://node1:8020/tmp/test_data/search_log.txt'into  table test_db.test_load2;
--步骤4：查看数据是否映射成功
select * from test_db.test_load2;
--结论：load  data的操作方式，如果将hdfs上的数据进行上传，则移动该文件到表目录中，原位置文件消失

--4、load  data ...overwrite...覆盖加载数据
--可以对同一个表加载多次数据，数据会累加在虚拟表中，但我们有时会防止其重复加载，会使用overwrite进行覆盖加载
--刚才对于test_load1进行了多次加载，此时我们要使用覆盖加载，将原有的数据清空没插入新数据、
load data local  inpath '/root/hive_data/search_log.txt'  overwrite into table test_db.tset_load1;
--查看文件是否加载成功，此时之前的数据被清空，仅保留本次插入的内容
select * from test_db.tset_load1

注意：1、如果创建外部表，我们一般不会将表创建好后，将数据移动到表中

2、我们会创建一个外部表，并指定该表映射数据的位置为该数据所在的位置

3、加上local关键字，我们就是从本地目录中加载数据，如果不加local就是从hdfs中加载数据，在开发中我们一般从hdfs中加载数据

数据加载的其他方式

--1、inser into  values
--步骤1：重复test_load的表结构
create  table test_db.test_load3 like test_db.test_load;
--步骤2：使用insert  into  values  加载两条数据
insert  into  test_db.test_load3 values ('00:00:01','1233215666','淘宝','http://www.taobao.cn'	),('01:06:00','3233217666','大数据','http://www.itcast.cn');
--步骤3：查看数据内容是否加载完成建表时指定的分隔符是什么，此时加载数据就是用什么分隔符
select * from test_db.test_load3;
--结论：该操作的方式加载数据过于缓慢，我们一般不用

--2、inset into  ....select
--将我们查询或者清洗后的数据结果加载到表中
--步骤1、复制test_load的表结构
create table test_db.test_load4 like test_db.test_load;
--步骤2、将test_load中的数据全部读取出来，并且存放在test_load4中
insert into test_db.test_load4 select * from test_db.test_load where user_id>'123';
--步骤3：查询表中的数据是否加载成功
select * from test_db.test_load4;
--结论：这种方式用的比上面的多，因为我们不希望再插入数据这种机械操作中浪费时间，但是我们清洗后将结果插入浪费的时间是有价值的

--3、insert  overwrite  ...select
--将我们查询或者清洗后的数据结果覆盖加载到目标表中
--步骤一：将test_load中的数据读取出来，并且筛选user_id<'13452' 覆盖加载到test_load4表中
insert overwrite table test_db.test_load4 select * from test_db.test_load where user_id<'13452';
--步骤二：查看加载后的数据是否准确覆盖
select * from test_db.test_load4;
--结论：覆盖方法其实使用的也不是很多，因为我们在开发中要保留历史数据

create table …select…from 语法可以将查询结果存在一个新表里，但是相比较于 insert into…select…from…效率更高，因为 create table是DDL语言，而insert into是DML语言，优化方式不一样，但是create table 不能细化的进行表的创建，insert into会先创建表再导入

数据导出

从表中导出到其他位置就是数据导出

思考：如果直接向表中用insert into插入数据，效率太低，我们一般情况下使用什么方式加载数据？

--数据导出
--1、insert  overwrite local  directory ' '  select* from ....
--将查询的结果保存到本地目录中
--有local就是保存或者加载本地数据
--步骤1：将test_load中的数据导出到node1的/root/aaa目录中
--注意：指定的导出位置只能是目录，不能是文件
insert overwrite  local  directory '/root/aaa' select * from test_db.test_load;
--步骤2：观察导出文件分隔符为\t导出文件后，目标目录被覆盖，所有文件被清除，仅保留导出结果使用默认文件名
--步骤3：导出文件时增加分割符
insert overwrite  local  directory '/root/aaa' row format delimited fields terminated by '\t' select * from test_db.test_load;


--2、insert overwrite directory  ''  select *  from ....
--将查询到的数据结果保存到hdfs上的目录中
insert overwrite directory 'tmp/small' select * from test_db.test_load;

--3、导出文件没有追加方法
insert  into  directory  'tmp/small'  select  * from tset_db.test_load;

注意：

1、使用insert overwrite directory 进行数据导出，一定要注意谨慎操作，因为会将该目录中原有的数据全部清空，且无法恢复

2、我们导出数据时导出到指定的目录下，指定的文件名称无用，最终会形成一个默认名称的文件

3、导出数据只能覆盖不能追加。

使用hive的shell指令来运行hive sql

思考：hive中的shell客户端有几个？

在hive中只有第一代客户端是shell客户端，第二代客户端是用jdbc协议远程连接，不是

1、使用终端运行sql语句
hive  -e  'select * from db_test.test_load'
2、使用脚本文件运行sql语句
先创建一个hive.hql文件，在内部书写 'select * from hive_day3.test_load'
执行：hive  -f hive.hql

-e 执行命令

-f 执行脚本文件

隔壁老K~

关注

22
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
hdfs WeBUl的使用

效率更高，因为 create table是DDL语言，而insert into是DML语言，优化方式不一样，但是create table 不能细化的进行表的创建，insert into会先创建表再导入。1、使用insert overwrite directory 进行数据导出，一定要注意谨慎操作，因为会将该目录中原有的数据全部清空，且无法恢复。 3、加上local关键字，我们就是从本地目录中加载数据，如果不加local就是从hdfs中加载数据，在开发中我们一般从hdfs中加载数据。
复制链接

扫一扫