hive——3. hive实例：搜狗用户搜索日志

最新推荐文章于 2023-11-20 21:44:42 发布

ant_yi

最新推荐文章于 2023-11-20 21:44:42 发布

阅读量2.7k

点赞数 1

分类专栏： hadoop学习文章标签： hive

本文链接：https://blog.csdn.net/weixin_42490528/article/details/90708464

版权

hadoop学习专栏收录该内容

13 篇文章 1 订阅

订阅专栏

数据来源：

搜狗实验室官方网站的用户查询日志，http://www.sogou.com/labs/resource/q.php

第一列：搜索时间

第二列：用户ID

第三列：搜索内容

第四列：搜索内容出现在搜索结果页面的第几行

第五列：用户点击的是页面的第几行

第六列：用户点击的超链接

可以看到第四列和第五列之间是空格不是tap，使用记事本查找替换，记事本输入tap无效，可以复制前面的tap。

数据导入：

1. 新建数据库hive

[root@hadoop01 ~]# hive

hive> show databases;

default

Time taken: 7.305 seconds, Fetched: 1 row(s)

hive> create database hive;

Time taken: 0.53 seconds

hive> show databases;

default

hive

Time taken: 0.01 seconds, Fetched: 2 row(s)

hive> use hive;

Time taken: 0.013 seconds

2. 新建表Sogou

hive> create table Sogou(Time string,ID string,word string,location1 int,location2 int,website string) row format delimited fields terminated by '\t' lines terminated by '\n';

Time taken: 1.277 seconds

3. 本地数据导入到表Sogou

hive> load data local inpath '/test/SogouQ.sample' into table Sogou;

Loading data to table hive.sogouq1

Time taken: 1.788 seconds

4. 查看Sogou表

hive>select * from Sogou;

有bug，中文乱码，待会查询的时候只能查英文。

（hive中文乱码问题：https://www.cnblogs.com/DreamDrive/p/7469476.html）

扩展：

导入hdfs的数据则去掉local关键字

内部表和外部表：内部表是数据存储在hive数据仓库中的表，外部表的数据不存储在hive数据仓库中。

删除内部表时，会删除表数据和表文件；删除外部表时，只删除表数据，表所在的文件不会删除，依然存在。

使用hive进行分析搜索数据：

1. count

统计总样本数，可以看到Hive的具体执行过程

hive> select count(*) from Sogou;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = root_20190527162925_920c8536-f6d8-4246-8d03-89ba851ba58f

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1558939718678_0002, Tracking URL = http://hadoop01:8088/proxy/application_1558939718678_0002/

Kill Command = /export/servers/hadoop/bin/hadoop job -kill job_1558939718678_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2019-05-27 16:29:43,241 Stage-1 map = 0%, reduce = 0%

2019-05-27 16:29:59,731 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.53 sec

2019-05-27 16:30:24,303 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.91 sec

MapReduce Total cumulative CPU time: 6 seconds 910 msec

Ended Job = job_1558939718678_0002

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.91 sec HDFS Read: 885689 HDFS Write: 105 SUCCESS

Total MapReduce CPU Time Spent: 6 seconds 910 msec

10000

Time taken: 61.229 seconds, Fetched: 1 row(s)

hive>

2. 查看搜索关键字baidu的记录有多少条

hive> select count(*) from Sogou where word like '%baidu%';

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = root_20190527163553_998bf42e-2eb6-437a-b7c5-4ddeb72a2f90

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1558939718678_0003, Tracking URL = http://hadoop01:8088/proxy/application_1558939718678_0003/

Kill Command = /export/servers/hadoop/bin/hadoop job -kill job_1558939718678_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2019-05-27 16:36:23,228 Stage-1 map = 0%, reduce = 0%

2019-05-27 16:36:49,593 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.11 sec

2019-05-27 16:37:05,957 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.94 sec

MapReduce Total cumulative CPU time: 6 seconds 940 msec

Ended Job = job_1558939718678_0003

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.94 sec HDFS Read: 886490 HDFS Write: 102 SUCCESS

Total MapReduce CPU Time Spent: 6 seconds 940 msec

Time taken: 75.007 seconds, Fetched: 1 row(s)

hive>

2. 查看搜索关键字baidu且排名和点击的都是第一行的记录有多少条

hive> select count(*) from Sogou where word location1=1 and location2=1 and like '%baidu%';

FAILED: ParseException line 1:38 missing EOF at 'location1' near 'word'

hive> select count(*) from Sogou where location1=1 and location2=1 and word like '%baidu%';

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = root_20190527164056_0135e26b-58d3-409f-99e5-cbda737139d2

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1558939718678_0004, Tracking URL = http://hadoop01:8088/proxy/application_1558939718678_0004/

Kill Command = /export/servers/hadoop/bin/hadoop job -kill job_1558939718678_0004

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2019-05-27 16:41:18,492 Stage-1 map = 0%, reduce = 0%

2019-05-27 16:41:43,808 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.68 sec

2019-05-27 16:42:04,197 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.79 sec

MapReduce Total cumulative CPU time: 11 seconds 790 msec

Ended Job = job_1558939718678_0004

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.79 sec HDFS Read: 887049 HDFS Write: 102 SUCCESS

Total MapReduce CPU Time Spent: 11 seconds 790 msec

Time taken: 70.628 seconds, Fetched: 1 row(s)

hive>

实验证明：hive表名、数据库名都是不分大小写的

ant_yi

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
2
评论
hive——3. hive实例：搜狗用户搜索日志

数据来源：搜狗实验室官方网站的用户查询日志，http://www.sogou.com/labs/resource/q.php第一列：搜索时间第二列：用户ID第三列：搜索内容第四列：搜索内容出现在搜索结果页面的第几行第五列：用户点击的是页面的第几行第六列：用户点击的超链接可以看到第四列和第五列之间是空格不是tap，使用记事本查找替换，记事本输入tap无效，可以复制...
复制链接

扫一扫

专栏目录