在写爬虫的过程中,最麻烦的就是写正则表达式,还要一个一个的尝试,一次次的调试,很是费时间。于是我就写了一个网页版的,只需要输入要爬的网址,和正则式,网页上就可以显示爬到的数据。
思路:其实很简单,将网址和正则式传到服务器,服务器解析之后,将结果返回到前端。我用的是bootcss(前端)+bottle(后台用python处理),代码很简单,就是过程有些复杂。由于传递的参数是一个网址,而后台判断参数结束的标志是/......./,所以每次都是传值失败,后来想到用先用base64加密再传递
webRegx.py
import urllib2
import re
import json
def getHtml(url):
html = urllib2.urlopen(url).read()
return html
def getResult(url,reg):
html = urllib2.urlopen(url).read()
reg = re.compile(reg)
results = reg.findall(html)
if len(results)>0:
for result in results:
print result
else:
print "not result"
return json.dumps(results)
注意:最后要返回一个json结构的数据
main.py
from bottle import route,request,template,run,Bottle,static_file
from webRegx import getResult
import base64
app = Bottle()
@app.route('/')
def show():
return template('templates/index')
@app.route('/jiexi/:webstr#.*?#',method='post')
def test(webstr):
#return "hello{}!".format(name)
#webstr = webstr.replace(',','?')
base64_url,base64_reg =webstr.split(",")
url=base64.decodestring(base64_url)#解密
reg=base64.decodestring(base64_reg)
return getResult(url,reg)
@app.route('/templates/:filename')
def send_static(filename):
return static_file(filename, root='./templates')
run(app, host='localhost', port=8080)
index.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<title>Sticky Footer Template for Bootstrap</title>
<!-- 新 Bootstrap 核心 CSS 文件 -->
<link rel="stylesheet" href="http://cdn.bootcss.com/bootstrap/3.2.0/css/bootstrap.min.css">
<!-- 可选的Bootstrap主题文件(一般不用引入) -->
<link rel="stylesheet" href="http://cdn.bootcss.com/bootstrap/3.2.0/css/bootstrap-theme.min.css">
<!-- jQuery文件。务必在bootstrap.min.js 之前引入 -->
<script src="http://cdn.bootcss.com/jquery/1.11.1/jquery.min.js"></script>
<script src="./templates/base64.js"></script>
<!-- 最新的 Bootstrap 核心 JavaScript 文件 -->
<script src="http://cdn.bootcss.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<!-- Custom styles for this template -->
<style type="text/css">
/* Sticky footer styles
-------------------------------------------------- */
html {
position: relative;
min-height: 100%;
}
body {
/* Margin bottom by footer height */
margin-bottom: 60px;
font-family: 'microsoft yahei', 'Times New Roman', 宋体, Times, serif;
}
.footer {
position: absolute;
bottom: 0;
width: 100%;
/* Set the fixed height of the footer here */
height: 60px;
background-color: #f5f5f5;
}
/* Custom page CSS
-------------------------------------------------- */
/* Not required for template or sticky footer method. */
.container {
width: auto;
max-width: 800px;
padding: 0 15px;
}
.container .text-muted {
margin: 20px 0;
}
</style>
</head>
<body>
<!-- Begin page content -->
<div class="container">
<div class="page-header">
<h1>正则匹配</h1>
</div>
<div>
<div class="input-group input-group-lg">
<span class="input-group-addon">url</span>
<input type="text" class="form-control" placeholder="输入网址" id="url" name ="url">
</div><br/>
<div class="input-group input-group-lg">
<span class="input-group-addon">reg</span>
<input type="text" class="form-control" placeholder="输入正则表达式" id="reg" name ="reg">
<span class="input-group-btn">
<button class="btn btn-default" type="submit" οnclick="HtmlRegx()" id="myButton">搜索</button>
</span>
</div>
<div class="modal fade" id="tip">
<div class="modal-dialog">
<div class="modal-content">
<h3 class="modal-title">提示</h3>
<div class="modal-body"><p><h3>正在加载...</h3></p></div>
</div>
</div>
</div>
</div>
<br/>
<div>
<ul class="list-group" id="data-table">
</ul>
</div>
</div>
<div class="footer">
<div class="container">
<p class="text-muted">Place sticky footer content here.</p>
</div>
</div>
</body>
<script type="text/javascript">
function HtmlRegx()
{
$('#tip').modal('show');
var url = document.getElementById("url").value; //网址
var reg = document.getElementById("reg").value; //正则式
if(url=="" || reg=="")
{
alert("网址或者正则式为空");
return;
}
var base64 = new Base64();
var base64_url = base64.encode(url);
var base64_reg = base64.encode(reg);
//var posturl = "/jiexi/"+ url.split("?")+""+reg;
var posturl = "/jiexi/"+base64_url+","+base64_reg;
postdata(posturl,reg);
}
function postdata(url,reg)
{
$.ajax({
type:"POST",
url:url,
dataType:"json",
success:function(data)
{
console.log(data[0]);
/* $("#table").append('<tr><td>' + data.length + '</td></tr>')*/
show(data);
}
});
}
function show(data)
{
$('#tip').modal('hide');
for(var i=0;i<data.length;i++)
{
$("#data-table").append('<li class="list-group-item">'+data[i]+'</li>');
}
}
</script>
</html>
查询用的是ajax方式。
最后效果: