Hive进行词频统计

 HIve对某热搜数据进行词频统计

1.创建database和table,并把数据放到db2.sogou中

hive> show databases;
OK
db1
db2
default
Time taken: 0.057 seconds, Fetched: 3 row(s)
hive> drop database db2 cascade; //删除某个以存在的database;加cascade是强制删除,hive不允许删除含有内容的database;
OK
Time taken: 3.491 seconds
hive> create database db2;
OK
Time taken: 0.162 seconds
hive> create table if not exists db2.sogou(  //创建
    > time string,
    > uid string,
    > keyword string,
    > rank int,
    > seq int
    > )
    > row format delimited fields terminated by '\t';
OK
Time taken: 0.39 seconds
hive> load data local inpath '/root/sogou.500w.utf8' into table db2.sogou;

2.启动mysql

[root@master ~]# mysql -u hive -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 31
Server version: 5.7.26 MySQL Community Server (GPL)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> shows tables;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'shows tables' at line 1
mysql> show tables;
ERROR 1046 (3D000): No database selected
mysql> use hive;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+---------------------------+
| Tables_in_hive            |
+---------------------------+
| AUX_TABLE                 |
| BUCKETING_COLS            |
| CDS                       |
| COLUMNS_V2                |
| COMPACTION_QUEUE          |
| COMPLETED_COMPACTIONS     |
| COMPLETED_TXN_COMPONENTS  |
| DATABASE_PARAMS           |
| DBS                       |
| DB_PRIVS                  |
| DELEGATION_TOKENS         |
| FUNCS                     |
| FUNC_RU                   |
| GLOBAL_PRIVS              |
| HIVE_LOCKS                |
| IDXS                      |
| INDEX_PARAMS              |
| KEY_CONSTRAINTS           |
| MASTER_KEYS               |
| NEXT_COMPACTION_QUEUE_ID  |
| NEXT_LOCK_ID              |
| NEXT_TXN_ID               |
| NOTIFICATION_LOG          |
| NOTIFICATION_SEQUENCE     |
| NUCLEUS_TABLES            |
| PARTITIONS                |
| PARTITION_EVENTS          |
| PARTITION_KEYS            |
| PARTITION_KEY_VALS        |
| PARTITION_PARAMS          |
| PART_COL_PRIVS            |
| PART_COL_STATS            |
| PART_PRIVS                |
| ROLES                     |
| ROLE_MAP                  |
| SDS                       |
| SD_PARAMS                 |
| SEQUENCE_TABLE            |
| SERDES                    |
| SERDE_PARAMS              |
| SKEWED_COL_NAMES          |
| SKEWED_COL_VALUE_LOC_MAP  |
| SKEWED_STRING_LIST        |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES             |
| SORT_COLS                 |
| TABLE_PARAMS              |
| TAB_COL_STATS             |
| TBLS                      |
| TBL_COL_PRIVS             |
| TBL_PRIVS                 |
| TXNS                      |
| TXN_COMPONENTS            |
| TYPES                     |
| TYPE_FIELDS               |
| VERSION                   |
| WRITE_SET                 |
+---------------------------+
57 rows in set (0.00 sec)

mysql> select * from COLUMNS_V2;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
|     1 | NULL    | ip          | string    |           0 |
|     1 | NULL    | size        | string    |           3 |
|     1 | NULL    | time        | string    |           1 |
|     1 | NULL    | url         | string    |           2 |
|     2 | NULL    | ip          | string    |           0 |
|     2 | NULL    | size        | string    |           3 |
|     2 | NULL    | time        | string    |           1 |
|     2 | NULL    | url         | string    |           2 |
|     6 | NULL    | ip          | string    |           0 |
|     6 | NULL    | time        | string    |           1 |
|     6 | NULL    | url         | string    |           2 |
|     7 | NULL    | ip          | string    |           0 |
|     7 | NULL    | time        | string    |           1 |
|     7 | NULL    | url         | string    |           2 |
|    12 | NULL    | keyword     | string    |           2 |
|    12 | NULL    | rank        | int       |           3 |
|    12 | NULL    | seq         | int       |           4 |
|    12 | NULL    | time        | string    |           0 |
|    12 | NULL    | uid         | string    |           1 |
+-------+---------+-------------+-----------+-------------+
19 rows in set (0.00 sec)

mysql> 

3.hive查看table

hive> select * from db2.sogou limit 10;(hive模拟sql语句,文件数量太大,只输出前十行)
OK
20111230000005  57375476989eea12893c0c3811607bcf        奇艺高清        1       1
20111230000005  66c5bb7774e31d0a22278249b26bc83a        凡人修仙传      3       1
20111230000007  b97920521c78de70ac38e3713f524b50        本本联盟        1       1
20111230000008  6961d0c97fe93701fc9c0d861d096cd9        华南师范大学图书馆      1       1
20111230000008  f2f5a21c764aebde1e8afcc2871e086f        在线代理        2       1
20111230000009  96994a0480e7e1edcaef67b20d8816b7        伟大导演        1       1
20111230000009  698956eb07815439fe5f46e9a4503997        youku   1       1
20111230000009  599cd26984f72ee68b2b6ebefccf6aed        安徽合肥365房产网       1       1
20111230000010  f577230df7b6c532837cd16ab731f874        哈萨克网址大全  1       1
20111230000010  285f88780dd0659f5fc8acc7cc4949f2        IQ数码  1       1
Time taken: 18.74 seconds, Fetched: 10 row(s)

 

4.创建keyword_count用于存放词频统计的结果

hive> create table if not exists db2.keyword_count(
    > keyword string,
    > count int
    > )
    > row format delimited fields terminated by '\t';
OK
Time taken: 1.294 seconds

5.进行统计

hive> insert overwrite table db2.keyword_count
    > select keyword,count(time) from db2.sogou group by keyword;
Kill Command = /usr/hadoop/bin/hadoop job  -kill job_1558881978093_0001
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3
2019-05-27 00:42:46,578 Stage-1 map = 0%,  reduce = 0%
2019-05-27 00:43:29,284 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 17.5 sec
2019-05-27 00:43:36,215 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU 38.47 sec
·······
Total MapReduce CPU Time Spent: 1 minutes 47 seconds 690 msec
OK

6.输出结果的前几行

hive> select * from db2.keyword_count limit 20;
OK
! ( NEW ) ! gay japan - [coat] Male Body Undress 2 [CTO340] (2 029 125 632).avi 1935M 
"http://10.1.5.211/"    1
"http://17tmp3.com/"    1
"http://192.168.1.1/"   34
"http://192.168.1.101/" 1
"http://210.38.198.40/" 1
"http://221.136.78.219:8080/sionline"   1
"http://222.90.111.150/invoicing"       1
"http://222.90.111.150/invoicing//"     2
Time taken: 0.51 seconds, Fetched: 20 row(s)

7.按次数降序排列,修改select语句,count(time) as c:time是唯一的,把time当做c,group by:按什么统计

order by c:出现个数按降序排列,desc:降序简写

hive> insert overwrite table db2.keyword_count
    > select keyword,count(time) as c from db2.sogou group by keyword order by c desc;
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3
2019-05-27 00:51:46,489 Stage-1 map = 0%,  reduce = 0%
2019-05-27 00:52:17,217 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU 17.39 sec
2019-05-27 00:53:26,775 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 90.19 sec
MapReduce Total cumulative CPU time: 1 minutes 30 seconds 190 msec
Selecting local mode for task: Stage-2
Job running in-process (local Hadoop)
2019-05-27 00:53:31,291 Stage-2 map = 0%,  reduce = 0%
2019-05-27 00:53:53,484 Stage-2 map = 100%,  reduce = 100%
Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 90.19 sec   HDFS Read: 573709281 HDFS Write: 62953072 SUCCESS
Stage-Stage-2:  HDFS Read: 125931322 HDFS Write: 176545197 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 30 seconds 190 msec
OK
Time taken: 150.705 seconds
hive> 

处理过程有两个stage,1个MapReduce完成不了,要分两个阶段

8.查看统计结果,只看前几行

hive> select * from db2.keyword_count limit 10;
OK
百度    38441
baidu   18312
人体艺术        14475
4399小游戏      11438
qq空间  10317
优酷    10158
新亮剑  9654
馆陶县县长闫宁的父亲    9127
公安卖萌        8192

Time taken: 0.529 seconds, Fetched: 10 row(s)
hive> 

数据地址

链接:https://pan.baidu.com/s/1iEjTtCOEP4pPjYzlA6LoKw 
提取码:uhlo 


Hive简单命令

1.创建hiveTestTable:

hive> CREATE TABLE IF NOT EXISTS db1.hiveTestTable(
    > ip string,
    > time string,
    > url string,
    > size string
    > )
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 2.858 seconds

2.删除database,和table

Drop Table语句

语法:

DROP TABLE [IF EXISTS] table_name;

 一下查询删除一个名为hello的表

hive> DROP TABLE IF EXISTS hello;

删除刚才创建的表

hive> DROP TABLE hiveTestTable;
OK
Time taken: 0.094 seconds
hive> 

 删除一个名为db2的database

hive> show databases;
OK
db1
db2
default
Time taken: 0.057 seconds, Fetched: 3 row(s)
hive> drop database db2 cascade;
OK
Time taken: 3.491 seconds
hive> create database if not exists db1;
ovide truststore for server certificate verification.
OK
Time taken: 78.89 seconds

3.载入数据到hiveTestTable1表中

数据

95096	20111230000009	698956eb07815439fe5f46e9a4503997	http://www.youku.com/
95144	20111230000009	599cd26984f72ee68b2b6ebefccf6aed	http://hf.house365.com/
95192	20111230000010	f577230df7b6c532837cd16ab731f874	http://www.kz321.com/
95192	20111230000010	285f88780dd0659f5fc8acc7cc4949f2	http://www.iqshuma.com/
95048	20111230000010	f4ba3f337efb1cc469fcd0b34feff9fb	http://mobile.zol.com.cn/148/1487938.html
95144	20111230000010	3d1acc7235374d531de1ca885df5e711	http://baike.baidu.com/view/6500.htm
95144	20111230000010	dbce4101683913365648eba6a85b6273	http://zhidao.baidu.com/question/38626533
95096	20111230000011	58e7d0caec23bcb4daa7bbcc4d37f008	http://tv.sogou.com/vertical/2xc3t6wbuk24jnphzlj35zy.html?p=40230600
95096	20111230000011	a3b83dc38b2bbc35660dffcab4ed9da8	http://www.7183.info/
95192	20111230000011	b89952902d7821db37e8999776b32427	http://wenwen.soso.com/z/q131927207.htm
95048	20111230000011	7c54c43f3a8a0af0951c26d94a57d6c8	http://www.baidu.com/
95192	20111230000011	2d6c22c084a501c0b8f7f0a845aefd9f	http://www.dy241.com/
95048	20111230000011	11097724dae8b9fdcc60bd6fa4ce4df2	http://118123.net/
95144	20111230000012	1d374b57fbbc81aa0cc38e6f4efb88ec	http://tui.qihoo.com/28302631/article_2893190.html
95144	20111230000012	76029a8965e815b413cba0b50d2ec2b0	http://baike.baidu.com/view/1941330.htm
95192	20111230000013	22201bdc15845bfb33384efc3a283ef4	http://cf.qq.com/
95144	20111230000013	e0d255845fc9e66b2a25c43a70de4a9a	http://hanyu.iciba.com/wiki/1230433.shtml
95048	20111230000013	b89952902d7821db37e8999776b32427	http://zhidao.baidu.com/question/224925866
95096	20111230000013	072fa3643c91b29bd586aff29b402161	http://download.csdn.net/detail/think1919/3935722

操作

①载入数据到hiveTestTable1中

hive> LOAD DATA local inpath '/root/hiveTestData.txt' into table db1.hiveTestTable1;
Loading data to table db1.hivetesttable1
OK
Time taken: 20.104 seconds
hive> SELECT * from db1.hiveTestTable1;//查看hiveTableTest1表中的数据
OK
95096   20111230000009  698956eb07815439fe5f46e9a4503997        http://www.youku.com/
95144   20111230000009  599cd26984f72ee68b2b6ebefccf6aed        http://hf.house365.com/
95192   20111230000010  f577230df7b6c532837cd16ab731f874        http://www.kz321.com/
95192   20111230000010  285f88780dd0659f5fc8acc7cc4949f2        http://www.iqshuma.com/
95048   20111230000010  f4ba3f337efb1cc469fcd0b34feff9fb        http://mobile.zol.com.cn/148/1487938.html
95144   20111230000010  3d1acc7235374d531de1ca885df5e711        http://baike.baidu.com/view/6500.htm
95144   20111230000010  dbce4101683913365648eba6a85b6273        http://zhidao.baidu.com/question/38626533
Time taken: 4.585 seconds, Fetched: 19 row(s)

 ②从表hiveTestTable1中查询部分数据,并将其作为依据创建表hiveTestTable2

hive> CREATE TABLE db1.hiveTestTable2 AS SELECT ip,time,url from db1.hiveTestTable1;
Automatically selecting local only mode for query
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20190526225930_78b79294-34da-4369-86a0-224189628dee
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-05-26 22:59:35,283 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local1707359481_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/hive/warehouse/db1.db/.hive-staging_hive_2019-05-26_22-59-30_592_3294678209563217364-1/-ext-10002
Moving data to directory hdfs://master:9000/hive/warehouse/db1.db/hivetesttable2
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 3388 HDFS Write: 2796 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 7.071 seconds
hive> SELECT * from db1.hiveTestTable2;
OK
95096   20111230000009  698956eb07815439fe5f46e9a4503997
95144   20111230000009  599cd26984f72ee68b2b6ebefccf6aed
95192   20111230000010  f577230df7b6c532837cd16ab731f874
95192   20111230000010  285f88780dd0659f5fc8acc7cc4949f2
95048   20111230000010  f4ba3f337efb1cc469fcd0b34feff9fb
Time taken: 0.61 seconds, Fetched: 19 row(s)

③利用表hiveTestTable2的表结构创建hiveTestTable3,并查看表的结构

hive> CREATE TABLE IF NOT EXISTS db1.hiveTestTable3 LIKE db1.hiveTestTable2;
OK
Time taken: 0.397 seconds
hive> DESC db1.hiveTestTable3;
OK
ip                      string                                      
time                    string                                      
url                     string                                      
Time taken: 0.19 seconds, Fetched: 3 row(s)
hive>

 

 

 

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值