网络爬虫的基本实现

          这一阵子在看爬虫,特别是看了黄忆华的webMagic,理解了一下他设计爬虫的思想,我也尝试着写出了一个简单的爬虫,给大家分享一下。

          首先,如果让我们去获取某一网页的信息,给我们直观的感觉就是要获取当前页和当前页的URL;爬虫就是基于这两个核心,发展成为

         1. 对抓取目标的描述或定义(获取当前页);
         2. 对网页或数据的分析与过滤(指定URL);
         3. URL 的搜索策略(URL的去重等)。
这三大模块;对于不同模块我们又有不同 的工具
        
一、页面下载
        1. 通过模拟 HTTP 请求,接收并分析响应来完成 (Apache HttpClient )
        2. 内置浏览器,直接获取最后加载完的页面 (Selenium)
          - 当页面是由JS动态生成的内容时使用 
二、页面分析
       1. Jsoup
       2. Xpath
        这些内容比较简单,可以自己了解
三、URL管理
1. 去重 ( 减少内存空间 )
       存放已经爬取的URL。为了不重复爬取曾经爬取过的URL,我们使用基于HashSet的Bloom filter算法来去重;HashSet有一个缺点就是可能会产生冲突;Bloom filter就是将这种概率尽可能降低。
       -HashSet
       -Bloom filter
2.优先级阻塞队列
       PriorityBlockingQueue 用于多线程,存放未爬取的URL。

        下图是去重的步骤,首先初始化队列和HashSet,当初始URL进入程序后,先在HashSet中判断是否存在;由于HashSet刚刚初始化,URL并不在其中;

所以将URL存入优先级队列中,队列通过poll()方法获取当前URL通过HttpClient方法进行下载;接下来分析下载的网页,获取想要的URL再去HashSet中判断是否存在。。。


好了,爬虫的基本原理就讲这么一点。接下来就说说如何实现爬虫吧。

首先需要最小的包有①httpclient-4.3.6.jar(实现HTTP的通信)

                                    ②commons-logging-1.1.3.jar

                                    ③httpcore-4.3.3.jar(实现参数的名值配对对)

                                    ④jsoup-1.7.2.jar(实现网页分析)

我几乎用几个简单的方法实现的各个模块,权当抛砖引玉了。

RenRenHttp.java

package com.http;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.List;

import org.apache.http.Consts;
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class RenRenHttp {
	public String realizeHttpPost(CloseableHttpClient httpclient,HttpPost httpPost,List<NameValuePair> nvps){
		String line = "";
		String result = "";
		CloseableHttpResponse  response;
		httpPost.setEntity(new UrlEncodedFormEntity(nvps,Consts.UTF_8));
		try {
			response= httpclient.execute(httpPost);
			HttpEntity entity = response.getEntity();
			InputStream is = entity.getContent();
			BufferedReader br = new BufferedReader(new InputStreamReader(is, Consts.UTF_8));
			while((line=br.readLine()) != null){
				result +=line;
			}
			EntityUtils.consume(entity);
			is.close();
			httpPost.abort();
		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		return result;
	}
	public String realizeHttpGet(CloseableHttpClient httpclient,HttpGet httpGet){
		String result = "";
		InputStream is;
		String line = "";
		CloseableHttpResponse  response;
		try {
			response = httpclient.execute(httpGet);
			HttpEntity entity = response.getEntity();
			int statusCode = response.getStatusLine().getStatusCode();
//			System.out.println("状态码:"+statusCode);
			if(statusCode==200){
				is = entity.getContent();
				BufferedReader br = new BufferedReader(new InputStreamReader(is,Consts.UTF_8));
				while((line=br.readLine()) != null){
					result += line;
				}
				EntityUtils.consume(response.getEntity());
				is.close();
				httpGet.abort();
			}else if(statusCode==302||statusCode==301){
				is = entity.getContent();
				BufferedReader br = new BufferedReader(new InputStreamReader(is,Consts.UTF_8));
				while((line=br.readLine()) != null){
					result += line;
				}
				Document document = Jsoup.parse(result);
				String url = document.select("a").attr("href").toString();
				HttpGet hg = new HttpGet(url);
				RenRenHttp rrh = new  RenRenHttp();
				rrh.realizeHttpGet(httpclient, hg);
				EntityUtils.consume(response.getEntity());
				is.close();
				httpGet.abort();
			}
//			System.out.println("状态:"+response.getStatusLine());
			
		} catch (IllegalStateException e) {
			e.printStackTrace();
		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		return result;
		
	}
}
Request.java

package com.model;

public class Request {
	private String url;
	private long priority;
	public String getUrl() {
		return url;
	}
	public void setUrl(String url) {
		this.url = url;
	}
	public long getPriority() {
		return priority;
	}
	public void setPriority(long priority) {
		this.priority = priority;
	}
}

ReqComparator.java
package com.test;

import java.util.Comparator;

import com.model.Request;

public class ReqComparator implements Comparator<Request>{

	public int compare(Request o1, Request o2) {  
        long numbera = o1.getPriority();  
        long numberb = o2.getPriority();
        if(numberb > numbera)  
        {  
            return -1;  
        }  
        else if(numberb<numbera)  
        {  
            return 1;  
        }  
        else  
        {  
            return 0;  
        }
	 }
	
}

RenRen.java

package com.test;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;








import java.util.PriorityQueue;
import java.util.Set;

import org.apache.http.NameValuePair;       //来自httpcore包
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.BasicCookieStore;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;












import com.http.RenRenHttp;
import com.model.Request;



public class RenRen {
	private static String URL = "http://www.renren.com/PLogin.do";
		Set<String> urls = new HashSet<String>();
		static ReqComparator OrderIsdn = new ReqComparator();
		final static PriorityQueue<Request> queue = new PriorityQueue<Request>(10,OrderIsdn); 
	public void login(CloseableHttpClient httpClient){
		HttpPost httpPost = new HttpPost(URL);
		httpPost.addHeader("Connection", "keep-alive");
		httpPost.addHeader("Host", "www.renren.com");
		httpPost.addHeader("Referer", "http://www.renren.com/SysHome.do"); 
		httpPost.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0");
		httpPost.addHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
		httpPost.addHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
		List<NameValuePair> nvps = new ArrayList<NameValuePair>();
		nvps.add(new BasicNameValuePair("domain", "renren.com"));
		nvps.add(new BasicNameValuePair("key_id", "1"));
		nvps.add(new BasicNameValuePair("captcha_type", "web_login"));
		nvps.add(new BasicNameValuePair("email", "*******"));
		nvps.add(new BasicNameValuePair("password", "*****"));
		RenRenHttp rrh = new RenRenHttp();
		rrh.realizeHttpPost(httpClient, httpPost, nvps);
	}
	//通过筛选获得URL
	public void process(CloseableHttpClient httpClient){
		Request request = new Request();
		request.setUrl("http://share.renren.com/share/hotlist/v7?t=1");
		request.setPriority(0);
		queue.add(request);
		String param1 = "share.renren.com/share/";
		String param2 = "hot";
		String param3 = "www.renren.com/profile.do";
		String param4 = "*()!";
		RenRen rr = new RenRen();
		//在真正的爬虫中是用优先级阻塞队列来实现URL的管理
		while(queue.size()>0){
			Request requestCur = queue.poll();
			String str = requestCur.getUrl();
			System.out.println("请求URL:"+requestCur.getUrl());
			System.out.println(requestCur.getPriority());
			if(str.equals("http://share.renren.com/share/hotlist/v7?t=1")){
				String content = rr.downloadPage(httpClient, str);
				rr.getUrl(content, param1, param2,2);
			}else if(str.indexOf(param1)>0&&str.indexOf(param2)<0){
				String content = rr.downloadPage(httpClient, str);
				rr.getUrl(content, param3, param4,1);
			}else if(str.indexOf(param3)>0&&str.indexOf(param4)<0){
				String content = rr.downloadPage(httpClient, str);
				if(content!=""){
					String address = rr.analyzePage(content);
					System.out.println(address);
				}
			}
		}
	}
	//获取当URL的内容
	public boolean getUrl(String page,String param1,String param2,long priority){
		Document document = Jsoup.parse(page);
		Elements elements = document.select("a");
		boolean flag = false;
//		一般爬虫都要实现url的去重(HashSet)
		for (Element element : elements) {
			String url = element.attr("href");
			flag = url.indexOf(param1)>0&&url.indexOf(param2)<0;
			if(flag){
				Request request = new Request();
				if(urls.add(url)==true){ 		//判断是否在HashSet urls中
					request.setUrl(url);
					request.setPriority(priority);
					queue.add(request);				//加入队列
				}
			}
		}
		return flag;
	}
	
	//下载页面
	public String downloadPage(CloseableHttpClient httpClient,String url){
				String content = "";
				RenRenHttp rrh = new RenRenHttp();
				HttpGet getTitle = new HttpGet(url);
				content = rrh.realizeHttpGet(httpClient, getTitle);
			return content;
	}
	
	//分析页面
	public String analyzePage(String page){
		Document doc = Jsoup.parse(page);
		String address = doc.select(".address").text();
		return address;
	}
	
	public static void main(String[] args) {
		BasicCookieStore cookieStore = new BasicCookieStore();
		CloseableHttpClient httpClient = HttpClients.custom().setDefaultCookieStore(cookieStore).build();
		RenRen renren = new RenRen();
		renren.login(httpClient);
		renren.process(httpClient);
		try {
			httpClient.close();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

希望大家不吝赐教。

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值