Hadoop之命令操作

最新推荐文章于 2023-03-05 15:18:25 发布

李东要努力

最新推荐文章于 2023-03-05 15:18:25 发布

阅读量507

点赞数

本文链接：https://blog.csdn.net/weixin_45034355/article/details/102613584

版权

在虚拟机上删除hdfs文件：
删除hdfs下的文件： hadoop fs -rm /文件夹名
删除hdfs下的文件夹：hadoop fs -rm -r 文件名
查看根目录下的内容： hadoop fs -cat /文件名
增加权限：chmod +x 文件名
ps -ef|grep 文件名：查看某个进程号
复制文件： cp 文件名文件名
mv 既可以改文件名也可以移动文件（在同一台机器上可以）
将一台机器上的文件复制到另一台机器：scp 文件路径 hdp-2:$PWD

-r 递归移动文件夹

批量替换文件里面的内容：
1。vi a.txt(文件名)
2。按esc键
3。按shift+：
4。在后面输入：%s/a/b/g
注：s:代表 substitute（替换）%:代表所有行 a替换成b g:代表global（全体的）

在hive中进行操作：

（hive的数据分为两部分：一部分真实数据，存放在HDFS上，另一部分是真实数据的元数据，一边存放在MySQL上，可以修改 hive-site.xml来完成）

首先进入hive：

1.启动hiveserver

2.启动hive服务端（克隆一台）

另一种启动hive （直接启动）：

小提示：在做小数据量查询测试时，可以让hive将mrjob提交给本地运行器运行，可以在hive会话中设置如下参数：(语法执行的快)

hive> set hive.exec.mode.local.auto=true;

hive> set hive.exec.mode.local.auto=false;(退出 )

设置一些基本参数，让hive使用起来更便捷，比如：

让提示符显示当前库： hive>set hive.cli.print.current.db=true;

显示查询结果时显示字段名称：hive>set hive.cli.print.header=true;

创建表：create table 库名.表名（id string，name string ,age string);

例：create table person.teacher(id string,name string,age string) row format delimited fields terminated by ',';

往表中插入数据：insert into 表名 values（'id' ，'name' ,'age' );

例：insert into teacher values ('1','zhangsan','25');

删除表：drop table 表名；

查看表结构：desc 表名；

更加详细查看表的信息： desc formatted 表名；

创建外部表（external table）：要在创建表的时候自己定义路径，不可以是hive默认的存储路径：/user/hive/warehouse

在hive上删除外部表时（drop 表名）：只会删除元数据，在hdfs上表仍会存在

创建内部表(managed table)：再创建表的时候不用定义表路径，

在hive上删除内部表时，会将元数据和hdfs上的数据都删除

元数据：就是数据的描述信息，例如表的字段、存储位置、时间、大小

一个分区字段的实例：

示例如下：

1.创建带分区的表

create table person.t_access(ip string,url string,access_time string)
partitioned by(dt string)   #分区
row format delimited
fields terminated by ',';

2.做假数据在/root/hive_data 里面创建access.log1文件

192.168.33.2,http://www.sina.com/a,2019-10-17 20:08:24
192.168.33.3,http://www.sina.com/b,2019-10-17 21:08:24
192.168.33.4,http://www.sina.com/a,2019-10-17 22:08:24
192.168.33.5,http://www.sina.com/c,2019-10-17 23:08:24
192.168.33.2,http://www.sina.com/a,2019-10-17 24:08:24
192.168.33.7,http://www.sina.com/b,2019-10-17 20:09:24
192.168.33.2,http://www.sina.com/a,2019-10-17 20:12:24

另一组假数据 access.log2

192.168.33.2,http://www.sina.com/a,2019-10-18 20:08:24
192.168.33.3,http://www.sina.com/b,2019-10-18 21:08:24
192.168.33.4,http://www.sina.com/a,2019-10-18 22:08:24
192.168.33.5,http://www.sina.com/c,2019-10-18 23:08:24
192.168.33.2,http://www.sina.com/a,2019-10-18 24:08:24
192.168.33.7,http://www.sina.com/b,2019-10-18 20:09:24
192.168.33.2,http://www.sina.com/a,2019-10-18 20:12:24

3.向分区中导入数据：（在hive下进行），将两组假数据传入到表中

load data local inpath '/root/hive_data/access.log1' into table t_access partition(dt='20191017');（20191017带引号原因是string类型）

load data local inpath '/root/hive_data/access.log2' into table t_access partition(dt='20191017');

4.针对分区数据进行查询

a.统计10月17号的总PV

select count(*) from t_access where dt='20170804';

5.通过已存在表来建表：

create table 表1 like 表2；

（新建的表1的结构定义和表2一致，但是没有数据），可以 desc 查看

select count(*) from 表名;

select max(ip) from 表名;

6.可以在建表的同时插入数据

create table 表1 
as
select * from 表2；

1.数据的导入导出

1.1将数据文件导入hive的表

方式一：导入数据的一种方式：（手动用hdfs命令，将文件放入表目录（hive））

hadoop fs -put 本地文件 hdfs路径（hive里也会出现数据）

例：hadoop fs -put access.log3 /user/hive/warehouse/person.db/t_access

方式二：在hive的交互式shell中用hive命令来导入本地文件到表路径中

load data local inpath 本地路径 into table 表名；(hive和hdfs上都会有数据)

例： load data local inpath '/root/hive_data/p3.data' into table teacher;

方式三：用hive命令导入hdfs中的数据文件到表目录中

load data inpath ‘hdfs上文件路径’ into table 表名 partition（分区名='用什么进行分区'）

例：load data inpath '/user/hive/warehouse/person.db/t_access/access.log3' into table t_access partition(dt='20191018');

注意：导本地文件和导hdfs文件区别：

本地文件导入表：复制。hdfs导入表：移动

（命令中由local是导入的本地文件，没有的话导入的是hdfs文件）

数字类型;

java : boolean 、char 、byte、 short 、int 、long 、float、 double

2.复合类型

2.1 array数组类型

假如有如下数据需要用hive表去映射。

Zhan 2,wujing:liushishi:yangyang,2017-08-16
Zhan 3,fanbingbing:lichen,2017-08-20

设想：如果主演信息用一个数组来映射比较方便

在hive中建表：

create table t_movie(moive_name string,actors array<string>,first_show date)
row format delimited fields terminated by ','
collection items terminated by ':';

导入数据(在hive下)：load data local inpath '/root/hive_data/movie.dat' into table t_movie;

查询：

select t_movie.movie_name,actors[0] from t_movie;（查询电影名字，和第一个演员）

select t_movie.movie_name,actors from t_movie where array_contains(actors,'fanbingbing');（包含fanbingbing的）

select t_movie.movie_name,size(actors) as actor_num from t_movie;（as 起别名。查看电影中演员总人数）

2.2 map类型

1）假如有以下数据（在本地里创建）

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

可以用一个map类型来对上述数据中的家庭成员进行描述

在hive下建表：

create table t_person(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';

导入数据（在hive下）：load data local inpath ‘/root/hive_data/relation_data’ into table t_person;

查询

select * from t_person;

+--------------+----------------+----------------------------------------------------------------+---------------+--+
| t_person.id | t_person.name | t_person.family_members | t_person.age |
+--------------+----------------+----------------------------------------------------------------+---------------+--+
| 1 | zhangsan | {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"} | 28 |
| 2 | lisi | {"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22 |
| 3 | wangwu | {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"} | 29 |
| 4 | mayun | {"father":"mayongzhen","mother":"angelababy"} | 26 |
+--------------+----------------+----------------------------------------------------------------+---------------+--+

## 取map字段的指定key的值

select t_person.family_members['father'] as father from t_person;

##取map字段中所有key值（取values值一样）

select id, name，map_keys(t_person.family_members) as relation from t_person;

+-----+-----------+--------------------------------+--+
| id | name | relation |
+-----+-----------+--------------------------------+--+
| 1 | zhangsan | ["father","mother","brother"] |
| 2 | lisi | ["father","mother","brother"] |
| 3 | wangwu | ["father","mother","sister"] |
| 4 | mayun | ["father","mother"] |
+-----+-----------+--------------------------------+--+

## 综合：查询有brother的用户信息

select id,name,father from

(select id,name,t_person.family_members['brother'] as father from t_person) tmp

where father is not null;

先查询括号里的，命名为tmp ，然后在where father is not null ，最后select id,name,father

+-----+-----------+---------+--+
| id | name | father |
+-----+-----------+---------+--+
| 1 | zhangsan | xiaoxu |
| 2 | lisi | guanyu |
+-----+-----------+---------+--+

2.3struct 类型

1)假设有以下数据

1,zhangsan,18:male:beijing
2,lisi,28:female:shanghai

其中的用户信息包含：年龄：整数，性别：字符串，地址：字符串

设想用一个字段来描述整个用户信息，可以采用struct

2）建表

create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)
row format delimited fields terminated by ','
collection items terminated by ':';

导入数据（在hive中）：load data local inpath '/root/hive_data/t_person/' into table t_person_struct;

+---------------------+-----------------------+----------------------------------------------+--+
| t_person_struct.id | t_person_struct.name | t_person_struct.info |
+---------------------+-----------------------+----------------------------------------------+--+
| 1 | zhangsan | {"age":18,"sex":"male","addr":"beijing"} |
| 2 | lisi | {"age":28,"sex":"female","addr":"shanghai"} |
+---------------------+-----------------------+----------------------------------------------+--+

查询：select id,name,info.age from t_person_struct;

+-----+-----------+------+--+
| id | name | age |
+-----+-----------+------+--+
| 1 | zhangsan | 18 |
| 2 | lisi | 28 |
+-----+-----------+------+--+

3.修改表定义

3.1修改表名

alter table t_1 rename to t_x;

3.2 修改分区表中分区字段的值

alter table t_1 partition（department='xiangsheng',sex='male',howold=20) rename to partition(department='1',sex='1',howold=20);

注意：如果要修改的值有相同的话，都会被修改

3.3 删除分区表中数据

alter table t_1 partition (dt=20190804');

4.hive 查询语法

4.1 基本查询示例

假如有 a.txt

a,1
b,2
c,3
d,4

假如有b.txt

a,xx
b,yy
d,zz
e,pp

建表：

create table t_a (name string,num string) row format delimited fields terminated by ',';

create table t_b(name string,num string) row format delimited fields terminated by ',';

导入数据（在hive中）

内连接(右连接)

select a.*，b.* from t_a a join t_b b;

+---------+--------+---------+--------+--+
| a.name | a.num | b.name | b.num |
+---------+--------+---------+--------+--+
| a | 1 | a | xx |
| b | 2 | a | xx |
| c | 3 | a | xx |
| d | 4 | a | xx |
| a | 1 | b | yy |
| b | 2 | b | yy |
| c | 3 | b | yy |
| d | 4 | b | yy |
| a | 1 | d | zz |
| b | 2 | d | zz |
| c | 3 | d | zz |
| d | 4 | d | zz |
| a | 1 | e | pp |
| b | 2 | e | pp |
| c | 3 | e | pp |
| d | 4 | e | pp |
+---------+--------+---------+--------+--+

左连接；select a.*,b.* from t_a a left outer join t_b b;

+---------+--------+---------+--------+--+
| a.name | a.num | b.name | b.num |
+---------+--------+---------+--------+--+
| a | 1 | a | xx |
| a | 1 | b | yy |
| a | 1 | d | zz |
| a | 1 | e | pp |
| b | 2 | a | xx |
| b | 2 | b | yy |
| b | 2 | d | zz |
| b | 2 | e | pp |
| c | 3 | a | xx |
| c | 3 | b | yy |
| c | 3 | d | zz |
| c | 3 | e | pp |
| d | 4 | a | xx |
| d | 4 | b | yy |
| d | 4 | d | zz |
| d | 4 | e | pp |
+---------+--------+---------+--------+--+

group by 分组聚合

（有Group by的时候，那个group by 会在select、alter、update 等之前运行）

select dt,count(*),max(ip) from t_access group by dt;

max 最大

select dt,count(*),max(ip) as cnt from t_access group by dt having dt>'20170804';

having dt>'20170804' 的意思是表中的dt大于20170804这个数字

执行顺序：先groupby 再having，在再select。having是对group by聚合之后的结果进行再过滤；

李东要努力

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop之命令操作

在虚拟机上删除hdfs文件：删除hdfs下的文件： hadoop fs -rm /文件夹名删除hdfs下的文件夹：hadoop fs -rm -r 文件名查看根目录下的内容： hadoop fs -cat /文件名增加权限：chmod +x 文件名ps -ef|grep 文件名：查看某个进程号复制文件： cp 文件名文件名mv 既可以改文件名也可以移动文件（在同一台机器上可以）...
复制链接

扫一扫