这一阵子在看爬虫,特别是看了黄忆华的webMagic,理解了一下他设计爬虫的思想,我也尝试着写出了一个简单的爬虫,给大家分享一下。
首先,如果让我们去获取某一网页的信息,给我们直观的感觉就是要获取当前页和当前页的URL;爬虫就是基于这两个核心,发展成为
1.
对抓取目标的描述或定义(获取当前页);
2.
对网页或数据的分析与过滤(指定URL);
3.
对
URL
的搜索策略(URL的去重等)。
这三大模块;对于不同模块我们又有不同 的工具
这三大模块;对于不同模块我们又有不同 的工具
一、页面下载
1.
通过模拟
HTTP
请求,接收并分析响应来完成
(Apache
HttpClient
)
2.
内置浏览器,直接获取最后加载完的页面
(Selenium)
- 当页面是由JS动态生成的内容时使用
二、页面分析
1.
Jsoup
2.
Xpath
这些内容比较简单,可以自己了解
PriorityBlockingQueue 用于多线程,存放未爬取的URL。
这些内容比较简单,可以自己了解
三、URL管理
1.
去重
(
减少内存空间
)
存放已经爬取的URL。为了不重复爬取曾经爬取过的URL,我们使用基于HashSet的Bloom filter算法来去重;HashSet有一个缺点就是可能会产生冲突;Bloom filter就是将这种概率尽可能降低。
存放已经爬取的URL。为了不重复爬取曾经爬取过的URL,我们使用基于HashSet的Bloom filter算法来去重;HashSet有一个缺点就是可能会产生冲突;Bloom filter就是将这种概率尽可能降低。
-HashSet
-Bloom filter
2.优先级阻塞队列
下图是去重的步骤,首先初始化队列和HashSet,当初始URL进入程序后,先在HashSet中判断是否存在;由于HashSet刚刚初始化,URL并不在其中;
所以将URL存入优先级队列中,队列通过poll()方法获取当前URL通过HttpClient方法进行下载;接下来分析下载的网页,获取想要的URL再去HashSet中判断是否存在。。。
好了,爬虫的基本原理就讲这么一点。接下来就说说如何实现爬虫吧。
首先需要最小的包有①httpclient-4.3.6.jar(实现HTTP的通信)
②commons-logging-1.1.3.jar
③httpcore-4.3.3.jar(实现参数的名值配对对)
④jsoup-1.7.2.jar(实现网页分析)
我几乎用几个简单的方法实现的各个模块,权当抛砖引玉了。
RenRenHttp.java
package com.http;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.List;
import org.apache.http.Consts;
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class RenRenHttp {
public String realizeHttpPost(CloseableHttpClient httpclient,HttpPost httpPost,List<NameValuePair> nvps){
String line = "";
String result = "";
CloseableHttpResponse response;
httpPost.setEntity(new UrlEncodedFormEntity(nvps,Consts.UTF_8));
try {
response= httpclient.execute(httpPost);
HttpEntity entity = response.getEntity();
InputStream is = entity.getContent();
BufferedReader br = new BufferedReader(new InputStreamReader(is, Consts.UTF_8));
while((line=br.readLine()) != null){
result +=line;
}
EntityUtils.consume(entity);
is.close();
httpPost.abort();
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return result;
}
public String realizeHttpGet(CloseableHttpClient httpclient,HttpGet httpGet){
String result = "";
InputStream is;
String line = "";
CloseableHttpResponse response;
try {
response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
int statusCode = response.getStatusLine().getStatusCode();
// System.out.println("状态码:"+statusCode);
if(statusCode==200){
is = entity.getContent();
BufferedReader br = new BufferedReader(new InputStreamReader(is,Consts.UTF_8));
while((line=br.readLine()) != null){
result += line;
}
EntityUtils.consume(response.getEntity());
is.close();
httpGet.abort();
}else if(statusCode==302||statusCode==301){
is = entity.getContent();
BufferedReader br = new BufferedReader(new InputStreamReader(is,Consts.UTF_8));
while((line=br.readLine()) != null){
result += line;
}
Document document = Jsoup.parse(result);
String url = document.select("a").attr("href").toString();
HttpGet hg = new HttpGet(url);
RenRenHttp rrh = new RenRenHttp();
rrh.realizeHttpGet(httpclient, hg);
EntityUtils.consume(response.getEntity());
is.close();
httpGet.abort();
}
// System.out.println("状态:"+response.getStatusLine());
} catch (IllegalStateException e) {
e.printStackTrace();
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return result;
}
}
Request.java
package com.model;
public class Request {
private String url;
private long priority;
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public long getPriority() {
return priority;
}
public void setPriority(long priority) {
this.priority = priority;
}
}
ReqComparator.java
package com.test;
import java.util.Comparator;
import com.model.Request;
public class ReqComparator implements Comparator<Request>{
public int compare(Request o1, Request o2) {
long numbera = o1.getPriority();
long numberb = o2.getPriority();
if(numberb > numbera)
{
return -1;
}
else if(numberb<numbera)
{
return 1;
}
else
{
return 0;
}
}
}
RenRen.java
package com.test;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.PriorityQueue;
import java.util.Set;
import org.apache.http.NameValuePair; //来自httpcore包
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.BasicCookieStore;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.http.RenRenHttp;
import com.model.Request;
public class RenRen {
private static String URL = "http://www.renren.com/PLogin.do";
Set<String> urls = new HashSet<String>();
static ReqComparator OrderIsdn = new ReqComparator();
final static PriorityQueue<Request> queue = new PriorityQueue<Request>(10,OrderIsdn);
public void login(CloseableHttpClient httpClient){
HttpPost httpPost = new HttpPost(URL);
httpPost.addHeader("Connection", "keep-alive");
httpPost.addHeader("Host", "www.renren.com");
httpPost.addHeader("Referer", "http://www.renren.com/SysHome.do");
httpPost.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0");
httpPost.addHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
httpPost.addHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
List<NameValuePair> nvps = new ArrayList<NameValuePair>();
nvps.add(new BasicNameValuePair("domain", "renren.com"));
nvps.add(new BasicNameValuePair("key_id", "1"));
nvps.add(new BasicNameValuePair("captcha_type", "web_login"));
nvps.add(new BasicNameValuePair("email", "*******"));
nvps.add(new BasicNameValuePair("password", "*****"));
RenRenHttp rrh = new RenRenHttp();
rrh.realizeHttpPost(httpClient, httpPost, nvps);
}
//通过筛选获得URL
public void process(CloseableHttpClient httpClient){
Request request = new Request();
request.setUrl("http://share.renren.com/share/hotlist/v7?t=1");
request.setPriority(0);
queue.add(request);
String param1 = "share.renren.com/share/";
String param2 = "hot";
String param3 = "www.renren.com/profile.do";
String param4 = "*()!";
RenRen rr = new RenRen();
//在真正的爬虫中是用优先级阻塞队列来实现URL的管理
while(queue.size()>0){
Request requestCur = queue.poll();
String str = requestCur.getUrl();
System.out.println("请求URL:"+requestCur.getUrl());
System.out.println(requestCur.getPriority());
if(str.equals("http://share.renren.com/share/hotlist/v7?t=1")){
String content = rr.downloadPage(httpClient, str);
rr.getUrl(content, param1, param2,2);
}else if(str.indexOf(param1)>0&&str.indexOf(param2)<0){
String content = rr.downloadPage(httpClient, str);
rr.getUrl(content, param3, param4,1);
}else if(str.indexOf(param3)>0&&str.indexOf(param4)<0){
String content = rr.downloadPage(httpClient, str);
if(content!=""){
String address = rr.analyzePage(content);
System.out.println(address);
}
}
}
}
//获取当URL的内容
public boolean getUrl(String page,String param1,String param2,long priority){
Document document = Jsoup.parse(page);
Elements elements = document.select("a");
boolean flag = false;
// 一般爬虫都要实现url的去重(HashSet)
for (Element element : elements) {
String url = element.attr("href");
flag = url.indexOf(param1)>0&&url.indexOf(param2)<0;
if(flag){
Request request = new Request();
if(urls.add(url)==true){ //判断是否在HashSet urls中
request.setUrl(url);
request.setPriority(priority);
queue.add(request); //加入队列
}
}
}
return flag;
}
//下载页面
public String downloadPage(CloseableHttpClient httpClient,String url){
String content = "";
RenRenHttp rrh = new RenRenHttp();
HttpGet getTitle = new HttpGet(url);
content = rrh.realizeHttpGet(httpClient, getTitle);
return content;
}
//分析页面
public String analyzePage(String page){
Document doc = Jsoup.parse(page);
String address = doc.select(".address").text();
return address;
}
public static void main(String[] args) {
BasicCookieStore cookieStore = new BasicCookieStore();
CloseableHttpClient httpClient = HttpClients.custom().setDefaultCookieStore(cookieStore).build();
RenRen renren = new RenRen();
renren.login(httpClient);
renren.process(httpClient);
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
希望大家不吝赐教。