关于搜狗实验室用户查询日志的分析

同学可以自行下载原始数据集合:http://download.labs.sogou.com/resource/q.php
这里主要利用Sql Server数据库来演示基本的分析方法

1)数据导入:
可以将全部日志文件合并为一个单一文件,以方便导入,方法是在命令提示符界面中运行   copy *.filter data.dat

数据格式为Linux系统文本存储格式,便于处理可以利用linux2dos(下载)工具直接转换,方法是在命令提示符界面中运行    linux2dos data.dat data2.dat

在Sql Server查询管理器中运行:


CREATE TABLE [data] (

[userid] [varchar] (100) COLLATE Chinese_PRC_CI_AS NULL ,
[kw] [varchar] (1200) COLLATE Chinese_PRC_CI_AS NULL ,
[ab] [varchar] (1200) COLLATE Chinese_PRC_CI_AS NULL ,
[url] [varchar] (500) COLLATE Chinese_PRC_CI_AS NULL
) ON [PRIMARY]
GO

BULK INSERT data FROM 'C:\data2.dat'
WITH (
DATAFILETYPE = 'char',
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n',
MAXERRORS=100000
)

2)数据处理

由于存在错误切分数据,所以需要进行必要的数据整理,错误数据量很少

delete from data where ab like '%]%'
delete from data where ascii(ab)>127
delete from data where ascii(lower(ab))>=ascii('a') and ascii(lower(ab))<=ascii('z')
delete from data where charindex(' ',ab)<=0

这些删除记录都为一些由于切分错误引发的错误格式

3)在data表中增加rank和seq字段,并填充(其实是将导入进来的原本两个字段再次处理分开,并去除查询词语前后的方括号)

update data set rank=CONVERT(int, substring([ab],1,charindex(' ',[ab])-1))
update data set seq=CONVERT(int, substring([ab],charindex(' ',[ab])+1,10))
update data set kw=substring([kw],2,len([kw])-2)

delete from data where len(kw)>900
建立各个字段的索引

4)常见查询分析:
用户访问情况分析
select top 100 userid,count(*) from data group by userid order by count(*) desc
存在爬虫
select * from data where userid='13505142608748166' order by kw,seq

查询词语情况
select top 100 kw,count(*) from data group by kw order by count(*) desc
select top 100 kw,count(distinct userid) from data group by kw order by count(distinct userid) desc
select avg(len(kw)) from data
select len(kw),count(*) from data group by len(kw) order by count(*) desc

查询浏览结果的情况
select rank,count(*) from data group by rank order by count(*) desc

了解词语命中的URL情况
select url,count(*) from data where kw='张玉凤' group by url order by count(*) desc

展开阅读全文

hive 不能使用mapreduce进行查询

04-26
hadoop版本:chd5.1.0 hive版本:hive0.12-chd5.1.0 web页面:hadoop,yarn都正常启动,页面监控正常 hive我做测试用,使用的是默认的derby数据库,hive-env.sh,配置了hadoop的路径, hive-site.xml使用默认的,没有做任何修改 接着做简单的测试: hive> select count(*)from hive_sum; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120) at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:472) at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:450) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:402) at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1485) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1263) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1091) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:921) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:790) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:623) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Job Submission failed with exception 'java.io.IOException(Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
©️2020 CSDN 皮肤主题: 大白 设计师: CSDN官方博客 返回首页
实付0元
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值