pig 自定义加载函数加载apache 的access.log中的数据

发布一个k8s部署视频:https://edu.csdn.net/course/detail/26967

课程内容:各种k8s部署方式。包括minikube部署,kubeadm部署,kubeasz部署,rancher部署,k3s部署。包括开发测试环境部署k8s,和生产环境部署k8s。

腾讯课堂连接地址https://ke.qq.com/course/478827?taid=4373109931462251&tuin=ba64518

第二个视频发布  https://edu.csdn.net/course/detail/27109

腾讯课堂连接地址https://ke.qq.com/course/484107?tuin=ba64518

介绍主要的k8s资源的使用配置和命令。包括configmap,pod,service,replicaset,namespace,deployment,daemonset,ingress,pv,pvc,sc,role,rolebinding,clusterrole,clusterrolebinding,secret,serviceaccount,statefulset,job,cronjob,podDisruptionbudget,podSecurityPolicy,networkPolicy,resourceQuota,limitrange,endpoint,event,conponentstatus,node,apiservice,controllerRevision等。

第三个视频发布:https://edu.csdn.net/course/detail/27574

详细介绍helm命令,学习helm chart语法,编写helm chart。深入分析各项目源码,学习编写helm插件
————————————————------------------------------------------------------------------------------------------------------------------

access.log数据:

 

127.0.0.1 - - [08/Jan/2012:21:46:31 +0800] "GET / HTTP/1.1" 200 44
127.0.0.1 - - [08/Jan/2012:21:46:31 +0800] "GET /favicon.ico HTTP/1.1" 404 209
127.0.0.1 - - [08/Jan/2012:22:47:15 +0800] "GET /aa.php HTTP/1.1" 200 61261
127.0.0.1 - - [08/Jan/2012:22:47:15 +0800] "GET /aa.php?=PHPE9568F34-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 2524
127.0.0.1 - - [08/Jan/2012:22:47:15 +0800] "GET /aa.php?=PHPE9568F35-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 2146
127.0.0.1 - - [08/Jan/2012:22:47:15 +0800] "GET /favicon.ico HTTP/1.1" 404 209
127.0.0.1 - - [08/Jan/2012:22:49:39 +0800] "GET /aa.php HTTP/1.1" 200 61496
127.0.0.1 - - [08/Jan/2012:22:49:39 +0800] "GET /aa.php?=PHPE9568F34-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 2524
127.0.0.1 - - [08/Jan/2012:22:49:39 +0800] "GET /aa.php?=PHPE9568F35-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 2146
127.0.0.1 - - [08/Jan/2012:22:49:39 +0800] "GET /favicon.ico HTTP/1.1" 404 209
127.0.0.1 - - [08/Jan/2012:23:05:28 +0800] "GET /tiki HTTP/1.1" 301 230
127.0.0.1 - - [08/Jan/2012:23:05:28 +0800] "GET /tiki/ HTTP/1.1" 200 30566
127.0.0.1 - - [08/Jan/2012:23:05:28 +0800] "GET /favicon.ico HTTP/1.1" 404 209
127.0.0.1 - - [08/Jan/2012:23:06:23 +0800] "GET /tiki/index.php HTTP/1.1" 302 -
127.0.0.1 - - [08/Jan/2012:23:06:24 +0800] "GET /tiki/tiki-install.php HTTP/1.1" 200 10974
127.0.0.1 - - [08/Jan/2012:23:06:25 +0800] "GET /tiki/lib/tiki-js.js HTTP/1.1" 200 54004
127.0.0.1 - - [08/Jan/2012:23:06:25 +0800] "GET /tiki/styles/fivealive.css HTTP/1.1" 200 21404
127.0.0.1 - - [08/Jan/2012:23:06:26 +0800] "GET /tiki/lib/jquery_tiki/tiki-jquery.js HTTP/1.1" 200 94701
127.0.0.1 - - [08/Jan/2012:23:06:26 +0800] "GET /tiki/img/tiki/Tiki_WCG.png HTTP/1.1" 200 9362
127.0.0.1 - - [08/Jan/2012:23:06:26 +0800] "GET /tiki/pics/icons/help.png HTTP/1.1" 200 740


pig ApacheLoader:

 

 

package pig;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.pig.LoadFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;

public class ApacheLoader extends LoadFunc {
	protected RecordReader recordReader = null;

	@Override
	public InputFormat getInputFormat() throws IOException {

		return new TextInputFormat();
	}

	@Override
	public Tuple getNext() throws IOException {
		try {
			if (!recordReader.nextKeyValue()) {
				return null;
			}
			List<String> list = new ArrayList<String>();
			Text value = (Text) recordReader.getCurrentValue();
			String pattern = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+)";
			Pattern p = Pattern.compile(pattern);
			Matcher matcher = p.matcher(value.toString());
			if (!matcher.matches()) {
				return null;
			}
			list.add(matcher.group(1));
			list.add(matcher.group(4));
			list.add(matcher.group(5));
			list.add(matcher.group(6));
			list.add(matcher.group(7));
			return TupleFactory.getInstance().newTuple(list);
		} catch (Exception e) {
			throw new ExecException(e);
		}
	}

	@Override
	public void prepareToRead(RecordReader reader, PigSplit split)
			throws IOException {
		this.recordReader = reader;
	}

	@Override
	public void setLocation(String location, Job job) throws IOException {
		FileInputFormat.setInputPaths(job, location);

	}

}


到pig grunt运行:

 

 register apacheLoader.jar 

A =load 'access.log' using pig.ApacheLoader();

dump A;

 

说明:如果access.log有9个列则正则表达式为

 

 String logEntryPattern = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"";


上面我的程序是7个列的。

 

则则参考:

http://nc100.blog.sohu.com/148887042.html

http://www.cnblogs.com/csurn/archive/2010/06/22/1762791.html

 

改进:

如果最后一个也就是大小为-

String pattern = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+|-)";

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

hxpjava1

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值