对于CSDN博客文章不能爬取的问题

最新推荐文章于 2024-03-22 20:47:57 发布

原创最新推荐文章于 2024-03-22 20:47:57 发布 · 2.7k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#string #.net #header #浏览器 #buffer #null

我的研究方向----搜索引擎专栏收录该内容

11 篇文章

订阅专栏

本文介绍了一种针对CSDN网站反爬虫机制的解决方案。通过模拟浏览器的User-Agent，成功绕过了CSDN对爬虫的限制。文中详细展示了如何监听并获取浏览器发送的请求头信息，并给出了具体的Java代码实现。

看过Robin的一篇文章，就是反爬虫的。他提到了几种反爬虫的方法：1.手工拒绝，即爬虫的并发量相当高，那么按照80端口进行并发排序，然后手动的把爬虫的IP给禁掉。2.根据User-Agent拒绝，比如如果我们用Java程序进行爬取时，如果没有设header的话，User-Agent就是java，那么就禁掉User-Agent不为浏览器那样的请求。3.根据流量统计和日志分析来屏蔽爬虫，封掉流量特别大的爬虫。4.实时屏蔽，即如果一个IP在一段时间内请求特别频繁，就为爬虫，加入黑名单，不再响应后续请求。

高并发的爬虫却是会对网站的服务器造成很大的压力，但是有时候我们需要从ITEYE或者CSDN上爬取一些东西时，也被拒绝掉了。（CSDN博客爬取时报403拒绝请求）

很明显，1,3,4，条对我们无效，因为我们的爬取不是高并发的频繁的；第2条，User-Agent的判别，才是封掉我们爬取的真正原因。那么，我们就只能加入头部，让自己的爬取像是一个浏览器请求的样子。那么浏览器请求时，发出的是怎样的数据包呢？

我们可以写个程序在10086端口监听一下（端口你自己随便取）：

package com.JavaUtil.IESimilator;
import java.io.*;
import java.net.*;
import java.util.*;
public class IEHeaderTest {

	//在端口10086上监听,得到IE发送的数据包
	public IEHeaderTest() {
		
		int port = 10086;
		ServerSocket serverSocket = null;
		Socket client = null;
		BufferedInputStream bis = null;
		
		try{
			
			serverSocket = new ServerSocket(port);
			client = serverSocket.accept();
			bis = new BufferedInputStream(client.getInputStream());
			int index = -1;
			byte[] buffer = new byte[1024];
			
			while((index=bis.read(buffer))!=-1){
				
				System.out.println(new String(buffer,0,index));
			}
		}catch(Exception ex){
			
			ex.printStackTrace();
		}finally{
			
			if(bis!=null){
				try {
					bis.close();
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}
			}
			if(client!=null){
				
				try {
					client.close();
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}
			}
			if(serverSocket!=null){
				
				try {
					serverSocket.close();
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}
			}
		}
	}
	
	public static void main(String[] args) {
		
		IEHeaderTest headerTest = new IEHeaderTest();
	}
}

然后再在浏览器中输入：http://localhost:10086/ 我们就可以得到IE发送的信息：

GET / HTTP/1.1
Accept: image/jpeg, application/x-ms-application, image/gif, application/xaml+xml, image/pjpeg, application/x-ms-xbap, application/msword, application/vnd.ms-excel, application/vnd.ms-powerpoint, */*
Accept-Language: zh-CN
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; .NET4.0C)
Accept-Encoding: gzip, deflate
Host: localhost:10086
Connection: Keep-Alive

我们可以看到，只有我们把User-Agent设置好，就不会出现爬取被拒绝的问题了：

package com.JavaUtil.IESimilator;
import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.httpclient.Header;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.cookie.CookiePolicy;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.DefaultHttpParams;
/*
 * author:Tammy Pi
 */
public class IESimilatorFetchCSDN {

	private HttpClient httpClient = new HttpClient();
	private GetMethod getMethod = null;
	private BufferedReader bis = null;
	private String rtn  = null;
	
	//get page
	public String getPage(String url){
		
		StringBuilder sb = new StringBuilder();
		
		getMethod = new GetMethod(url);
		//set http header
		List<Header> headers = new ArrayList<Header>();
		headers.add(new Header("Accept"," image/jpeg, application/x-ms-application, image/gif, application/xaml+xml, image/pjpeg, application/x-ms-xbap, application/msword, application/vnd.ms-excel, application/vnd.ms-powerpoint, */*"));
		headers.add(new Header("User-Agent","Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; .NET4.0C)"));
		headers.add(new Header("Connection","Keep-Alive"));

		//设置Cookie，解决cookie reject问题
		DefaultHttpParams.getDefaultParams().setParameter("http.protocol.cookie-policy", CookiePolicy.BROWSER_COMPATIBILITY);
		
		httpClient.getHostConfiguration().getParams().setParameter("http.default-headers",headers);
		
		try {
			//设置编码格式
			int status = httpClient.executeMethod(getMethod);
			System.out.println("status:"+status);
			
			bis = new BufferedReader(new InputStreamReader(getMethod.getResponseBodyAsStream(),getMethod.getResponseCharSet()));
			String line = null;
			while((line=bis.readLine())!=null){
				
				sb.append(line);
			}
			
			try {
				rtn = new String(sb.toString().getBytes(getMethod.getResponseCharSet()),"utf-8");
			} catch (UnsupportedEncodingException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		} catch (HttpException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} finally{
			
			if(getMethod!=null){
				
				getMethod.releaseConnection();
			}
		}
		
		return rtn;
	}
	
	public static void main(String[] args){
		
		IESimilatorFetchCSDN similator = new IESimilatorFetchCSDN();
		String rtn = similator.getPage("http://blog.csdn.net/hdhtqq/article/details/6088461");
		System.out.println(rtn);
	}
}