hadoop环境搭建详情见hadoop系列第一篇与第三篇博客(hadoop配置直接影响到本程序的运行)
数据准备(https://download.csdn.net/download/elmo66/10636257):
[hadoop@yourname ~]$ hadoop dfs -mkdir /UrlViewerCounter [hadoop@yourname ~]$ hadoop dfs -mkdir /UrlViewerCounter/input [hadoop@yourname ~]$ hadoop dfs -copyFromLocal access.log.10 /UrlViewerCounter/input/
yourname详见hadoop系列第一篇博客;hadoop是登录linux系统的用户名;~指/home/hadoop目录;test.txt是在/home/hadoop目录下,上传到hdfs中/UrlViewerCounter/input/目录下
数据样例:
60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
根据样例设计类:
public class KPI { private String clientAddr;// 记录客户端的ip地址 private String clientName;// 记录客户端用户名称,忽略属性"-" private String clientRequestTime;// 记录访问时间与时区 private String clientRequestMethod;// 记录请求的方式 private String clientRequestUrl;// 记录请求的url private String clientRequestProtocol;// 记录请求的http协议 private String responseStatus;// 记录请求状态;成功是200 private String responseBytes;// 记录发送给客户端文件主体内容大小 private String urlReferer;// 用来记录从那个页面链接访问过来的 private String httpAgent;// 记录客户浏览器的相关信息 private boolean valid = true;// 判断数据是否合法 public String getClientAddr() { return clientAddr; } public void setClientAddr(String clientAddr) { this.clientAddr = clientAddr; } public String getClientName() { return clientName; } public void set