HIve对某热搜数据进行词频统计
1.创建database和table,并把数据放到db2.sogou中
hive> show databases;
OK
db1
db2
default
Time taken: 0.057 seconds, Fetched: 3 row(s)
hive> drop database db2 cascade; //删除某个以存在的database;加cascade是强制删除,hive不允许删除含有内容的database;
OK
Time taken: 3.491 seconds
hive> create database db2;
OK
Time taken: 0.162 seconds
hive> create table if not exists db2.sogou( //创建
> time string,
> uid string,
> keyword string,
> rank int,
> seq int
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 0.39 seconds
hive> load data local inpath '/root/sogou.500w.utf8' into table db2.sogou;
2.启动mysql
[root@master ~]# mysql -u hive -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 31
Server version: 5.7.26 MySQL Community Server (GPL)
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> shows tables;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'shows tables' at line 1
mysql> show tables;
ERROR 1046 (3D000): No database selected
mysql> use hive;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_hive |
+---------------------------+
| AUX_TABLE |
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| COMPACTION_QUEUE |
| COMPLETED_COMPACTIONS |
| COMPLETED_TXN_COMPONENTS |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| DELEGATION_TOKENS |
| FUNCS |
| FUNC_RU |
| GLOBAL_PRIVS |
| HIVE_LOCKS |
| IDXS |
| INDEX_PARAMS |
| KEY_CONSTRAINTS |
| MASTER_KEYS |
| NEXT_COMPACTION_QUEUE_ID |
| NEXT_LOCK_ID |
| NEXT_TXN_ID |
| NOTIFICATION_LOG |
| NOTIFICATION_SEQUENCE |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_EVENTS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| ROLES |
| ROLE_MAP |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TXNS |
| TXN_COMPONENTS |
| TYPES |
| TYPE_FIELDS |
| VERSION |
| WRITE_SET |
+---------------------------+
57 rows in set (0.00 sec)
mysql> select * from COLUMNS_V2;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
| 1 | NULL | ip | string | 0 |
| 1 | NULL | size | string | 3 |
| 1 | NULL | time | string | 1 |
| 1 | NULL | url | string | 2 |
| 2 | NULL | ip | string | 0 |
| 2 | NULL | size | string | 3 |
| 2 | NULL | time | string | 1 |
| 2 | NULL | url | string | 2 |
| 6 | NULL | ip | string | 0 |
| 6 | NULL | time | string | 1 |
| 6 | NULL | url | string | 2 |
| 7 | NULL | ip | string | 0 |
| 7 | NULL | time | string | 1 |
| 7 | NULL | url | string | 2 |
| 12 | NULL | keyword | string | 2 |
| 12 | NULL | rank | int | 3 |
| 12 | NULL | seq | int | 4 |
| 12 | NULL | time | string | 0 |
| 12 | NULL | uid | string | 1 |
+-------+---------+-------------+-----------+-------------+
19 rows in set (0.00 sec)
mysql>
3.hive查看table
hive> select * from db2.sogou limit 10;(hive模拟sql语句,文件数量太大,只输出前十行)
OK
20111230000005 57375476989eea12893c0c3811607bcf 奇艺高清 1 1
20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙传 3 1
20111230000007 b97920521c78de70ac38e3713f524b50 本本联盟 1 1
20111230000008 6961d0c97fe93701fc9c0d861d096cd9 华南师范大学图书馆 1 1
20111230000008 f2f5a21c764aebde1e8afcc2871e086f 在线代理 2 1
20111230000009 96994a0480e7e1edcaef67b20d8816b7 伟大导演 1 1
20111230000009 698956eb07815439fe5f46e9a4503997 youku 1 1
20111230000009 599cd26984f72ee68b2b6ebefccf6aed 安徽合肥365房产网 1 1
20111230000010 f577230df7b6c532837cd16ab731f874 哈萨克网址大全 1 1
20111230000010 285f88780dd0659f5fc8acc7cc4949f2 IQ数码 1 1
Time taken: 18.74 seconds, Fetched: 10 row(s)
4.创建keyword_count用于存放词频统计的结果
hive> create table if not exists db2.keyword_count(
> keyword string,
> count int
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 1.294 seconds
5.进行统计
hive> insert overwrite table db2.keyword_count
> select keyword,count(time) from db2.sogou group by keyword;
Kill Command = /usr/hadoop/bin/hadoop job -kill job_1558881978093_0001
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3
2019-05-27 00:42:46,578 Stage-1 map = 0%, reduce = 0%
2019-05-27 00:43:29,284 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 17.5 sec
2019-05-27 00:43:36,215 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 38.47 sec
·······
Total MapReduce CPU Time Spent: 1 minutes 47 seconds 690 msec
OK
6.输出结果的前几行
hive> select * from db2.keyword_count limit 20;
OK
! ( NEW ) ! gay japan - [coat] Male Body Undress 2 [CTO340] (2 029 125 632).avi 1935M
"http://10.1.5.211/" 1
"http://17tmp3.com/" 1
"http://192.168.1.1/" 34
"http://192.168.1.101/" 1
"http://210.38.198.40/" 1
"http://221.136.78.219:8080/sionline" 1
"http://222.90.111.150/invoicing" 1
"http://222.90.111.150/invoicing//" 2
Time taken: 0.51 seconds, Fetched: 20 row(s)
7.按次数降序排列,修改select语句,count(time) as c:time是唯一的,把time当做c,group by:按什么统计
order by c:出现个数按降序排列,desc:降序简写
hive> insert overwrite table db2.keyword_count
> select keyword,count(time) as c from db2.sogou group by keyword order by c desc;
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3
2019-05-27 00:51:46,489 Stage-1 map = 0%, reduce = 0%
2019-05-27 00:52:17,217 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 17.39 sec
2019-05-27 00:53:26,775 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 90.19 sec
MapReduce Total cumulative CPU time: 1 minutes 30 seconds 190 msec
Selecting local mode for task: Stage-2
Job running in-process (local Hadoop)
2019-05-27 00:53:31,291 Stage-2 map = 0%, reduce = 0%
2019-05-27 00:53:53,484 Stage-2 map = 100%, reduce = 100%
Stage-Stage-1: Map: 3 Reduce: 3 Cumulative CPU: 90.19 sec HDFS Read: 573709281 HDFS Write: 62953072 SUCCESS
Stage-Stage-2: HDFS Read: 125931322 HDFS Write: 176545197 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 30 seconds 190 msec
OK
Time taken: 150.705 seconds
hive>
处理过程有两个stage,1个MapReduce完成不了,要分两个阶段
8.查看统计结果,只看前几行
hive> select * from db2.keyword_count limit 10;
OK
百度 38441
baidu 18312
人体艺术 14475
4399小游戏 11438
qq空间 10317
优酷 10158
新亮剑 9654
馆陶县县长闫宁的父亲 9127
公安卖萌 8192
Time taken: 0.529 seconds, Fetched: 10 row(s)
hive>
数据地址
链接:https://pan.baidu.com/s/1iEjTtCOEP4pPjYzlA6LoKw
提取码:uhlo
Hive简单命令
1.创建hiveTestTable:
hive> CREATE TABLE IF NOT EXISTS db1.hiveTestTable(
> ip string,
> time string,
> url string,
> size string
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 2.858 seconds
2.删除database,和table
Drop Table语句
语法:
DROP TABLE [IF EXISTS] table_name;
一下查询删除一个名为hello的表
hive> DROP TABLE IF EXISTS hello;
删除刚才创建的表
hive> DROP TABLE hiveTestTable;
OK
Time taken: 0.094 seconds
hive>
删除一个名为db2的database
hive> show databases;
OK
db1
db2
default
Time taken: 0.057 seconds, Fetched: 3 row(s)
hive> drop database db2 cascade;
OK
Time taken: 3.491 seconds
hive> create database if not exists db1;
ovide truststore for server certificate verification.
OK
Time taken: 78.89 seconds
3.载入数据到hiveTestTable1表中
数据
95096 20111230000009 698956eb07815439fe5f46e9a4503997 http://www.youku.com/
95144 20111230000009 599cd26984f72ee68b2b6ebefccf6aed http://hf.house365.com/
95192 20111230000010 f577230df7b6c532837cd16ab731f874 http://www.kz321.com/
95192 20111230000010 285f88780dd0659f5fc8acc7cc4949f2 http://www.iqshuma.com/
95048 20111230000010 f4ba3f337efb1cc469fcd0b34feff9fb http://mobile.zol.com.cn/148/1487938.html
95144 20111230000010 3d1acc7235374d531de1ca885df5e711 http://baike.baidu.com/view/6500.htm
95144 20111230000010 dbce4101683913365648eba6a85b6273 http://zhidao.baidu.com/question/38626533
95096 20111230000011 58e7d0caec23bcb4daa7bbcc4d37f008 http://tv.sogou.com/vertical/2xc3t6wbuk24jnphzlj35zy.html?p=40230600
95096 20111230000011 a3b83dc38b2bbc35660dffcab4ed9da8 http://www.7183.info/
95192 20111230000011 b89952902d7821db37e8999776b32427 http://wenwen.soso.com/z/q131927207.htm
95048 20111230000011 7c54c43f3a8a0af0951c26d94a57d6c8 http://www.baidu.com/
95192 20111230000011 2d6c22c084a501c0b8f7f0a845aefd9f http://www.dy241.com/
95048 20111230000011 11097724dae8b9fdcc60bd6fa4ce4df2 http://118123.net/
95144 20111230000012 1d374b57fbbc81aa0cc38e6f4efb88ec http://tui.qihoo.com/28302631/article_2893190.html
95144 20111230000012 76029a8965e815b413cba0b50d2ec2b0 http://baike.baidu.com/view/1941330.htm
95192 20111230000013 22201bdc15845bfb33384efc3a283ef4 http://cf.qq.com/
95144 20111230000013 e0d255845fc9e66b2a25c43a70de4a9a http://hanyu.iciba.com/wiki/1230433.shtml
95048 20111230000013 b89952902d7821db37e8999776b32427 http://zhidao.baidu.com/question/224925866
95096 20111230000013 072fa3643c91b29bd586aff29b402161 http://download.csdn.net/detail/think1919/3935722
操作
①载入数据到hiveTestTable1中
hive> LOAD DATA local inpath '/root/hiveTestData.txt' into table db1.hiveTestTable1;
Loading data to table db1.hivetesttable1
OK
Time taken: 20.104 seconds
hive> SELECT * from db1.hiveTestTable1;//查看hiveTableTest1表中的数据
OK
95096 20111230000009 698956eb07815439fe5f46e9a4503997 http://www.youku.com/
95144 20111230000009 599cd26984f72ee68b2b6ebefccf6aed http://hf.house365.com/
95192 20111230000010 f577230df7b6c532837cd16ab731f874 http://www.kz321.com/
95192 20111230000010 285f88780dd0659f5fc8acc7cc4949f2 http://www.iqshuma.com/
95048 20111230000010 f4ba3f337efb1cc469fcd0b34feff9fb http://mobile.zol.com.cn/148/1487938.html
95144 20111230000010 3d1acc7235374d531de1ca885df5e711 http://baike.baidu.com/view/6500.htm
95144 20111230000010 dbce4101683913365648eba6a85b6273 http://zhidao.baidu.com/question/38626533
Time taken: 4.585 seconds, Fetched: 19 row(s)
②从表hiveTestTable1中查询部分数据,并将其作为依据创建表hiveTestTable2
hive> CREATE TABLE db1.hiveTestTable2 AS SELECT ip,time,url from db1.hiveTestTable1;
Automatically selecting local only mode for query
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20190526225930_78b79294-34da-4369-86a0-224189628dee
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-05-26 22:59:35,283 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local1707359481_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/hive/warehouse/db1.db/.hive-staging_hive_2019-05-26_22-59-30_592_3294678209563217364-1/-ext-10002
Moving data to directory hdfs://master:9000/hive/warehouse/db1.db/hivetesttable2
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 3388 HDFS Write: 2796 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 7.071 seconds
hive> SELECT * from db1.hiveTestTable2;
OK
95096 20111230000009 698956eb07815439fe5f46e9a4503997
95144 20111230000009 599cd26984f72ee68b2b6ebefccf6aed
95192 20111230000010 f577230df7b6c532837cd16ab731f874
95192 20111230000010 285f88780dd0659f5fc8acc7cc4949f2
95048 20111230000010 f4ba3f337efb1cc469fcd0b34feff9fb
Time taken: 0.61 seconds, Fetched: 19 row(s)
③利用表hiveTestTable2的表结构创建hiveTestTable3,并查看表的结构
hive> CREATE TABLE IF NOT EXISTS db1.hiveTestTable3 LIKE db1.hiveTestTable2;
OK
Time taken: 0.397 seconds
hive> DESC db1.hiveTestTable3;
OK
ip string
time string
url string
Time taken: 0.19 seconds, Fetched: 3 row(s)
hive>