POST获取网易博客数据(网页抓取，模拟登陆资料学习备份）

最新推荐文章于 2022-05-16 19:05:38 发布

天心蓝

最新推荐文章于 2022-05-16 19:05:38 发布

阅读量4.8k

点赞数

本文链接：https://blog.csdn.net/zhong36060123/article/details/17714457

版权

下面这个日志网站（http://www.crifan.com/）的类别“Category Archives: Crawl_emulatelogin”：

http://www.crifan.com/category/work_and_job/web/crawl_emulatelogin/

里有很多网页解析和抓取以及模拟登陆的学习资料，并给出了个博客搬家的工具：BlogsToWordPress，功能很强大，但也因为过于强大，需要很多时间去折腾，我当时主要用到下载网易博客数据的功能。想详细了解可以去根据标题找相关信息。

因为网易博客（http://blog.163.com）博主日志目录的数据是动态加载的，例如清华大学肖鹰的博客日志目录：

http://xying1962.blog.163.com/blog/ (通常显示后面还有"#m=0"：http://xying1962.blog.163.com/blog/#m=0)

如图所示：

直接通过HttpClient一次请求“http://xying1962.blog.163.com/blog/”是得不到博客的数据的（如图红色方框所示），而是需要另外一次POST请求

"http://api.blog.163.com/xying1962/dwr/call/plaincall/

BlogBeanNew.getBlogs.dwr",下面这篇日志就是分析如何去POST请求网易的".dwr"数据：

【教程】以抓取网易博客帖子中的最近读者信息为例，手把手教你如何抓取动态网页中的内容

该日志是分析抓取网易博客读者信息的，请求的是：VisitBeanNew.getBlogReaders.dwr，抓取博客内容则请求：BlogBeanNew.getBlogs.dwr，都是通过POST请求，原理是类似，设置基本一样。

看完了分析，就该看代码了，有兴趣的可以去看整个BlogsToWordPress工具的Python代码，如果想只看POST代码，可以看这篇日志：

【记录】用Python解析网易163博客的心情随笔FeelingCard返回的DWR-REPLY数据

其实这篇说得还繁琐的，想看更简洁的，可以看下面这篇：

【记录】给BlogsToWordPress添加支持导出网易的心情随笔

我列出的这三篇日志基本把解析网易博客日志数据如何设置并请求POST说清楚了，里面用的是Python写的。下面呢，是我参考后用Java实现的请求用户博客数据的完整代码。

首先说下，网易博客的目录数据是动态加载的，需要POST请求.dwr，但博客内容是静态的，可以通过GET请求网址就可获取，例如肖鹰的一篇博客：

肖鹰：晚明文人为何发狂？

地址是：

http://xying1962.blog.163.com/blog/static/138445490201310207320529/

我的目的是获得“肖鹰：晚明文人为何发狂”这篇日志的内容，只需要通过一次GET请求它的地址就可以获取，然后这个地址又是比较格式化的，例如只要解析出了最后这串数字“138445490201310207320529”就可以拼接出完整地址，整个地址格式是：

http://[userName].blog.163.com/blog/static/[blogId]

肖鹰博客的username：“xying1962”是可以通过入口地址“http://xying1962.blog.163.com/blog/”获取的，后面的blogId就需要解析目录数据才能获取了，所以才需要POST请求.dwr。

另外，说明下网易博客地址，地址格式有两种（具体到博客目录地址）：

1. http://[username].blog.163.com/blog/

2. http://blog.163.com/[username]/blog/

在给出Java代码前，我得说下，Google的Chrome浏览器真是好产品，连请求监测也做得那么好，是网页分析的好帮手，个人觉得比Wireshark好用，详细使用如下：

1、右键单击网页某处，选择最末项的“Inspect Element”，好像中文叫“审查元素”，如图：

出来了“Inspect element”审查元素框后，点击“Network”，中文版应该是“网络”，并刷新网页，就可以看到网页监测情况，如下图所示：

可以查看HTTP请求的名字（name），请求的方式（Method），请求的状态（Status）和请求的返回结果类型（Type）。单击最左侧的Name，就可以查看详细的信息，例如单击“blog/”，图示如下:

可以查看Headers信息，返回的结果“Response”以及Cookies，有时候模拟登陆进行网页请求需要用到Cookies，但很多时候Headers和Response就够用了，如果想清楚当前的信息，重新查看，点击底部的“Clear”按钮（如图，红色方框圈出）就可以了。具体怎么使用，如果学过计算机网络，做过抓包分析，自己查看一下就都明白了。如果没有，还真需要花点时间了解下。

下面就说明如何在Java里设置POST请求，先按照类似原文Python格式上Java代码

public Set<String> post163Blog(String username, String userId, int startIndex, int returnNumber){
		/**
		* entityBody用于保存字符串格式的返回结果
		*/
		String entityBody = null;
		/**
		* 实例化一个HttpPost，并设置请求dwr地址，username表示博主的用户名，例如肖鹰的username是“xying1962”
		*/
		HttpPost httppost = new HttpPost("http://api.blog.163.com/" + username + "/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr");
		
		/*
		* 设置参数，除了c0-param0、c0-param1和c0-param2外都一样。
		* c0-param0 ：博主的userId，例如肖鹰的userId是“138445490”
		* c0-param1 ：返回博客数据的起始项，从0开始
		* c0-param2 ：一次返回博客的数量，最大值好像是500，具体多少我没有完全去试，600肯定不行，我一般设置500，600以上就不返回数据了。
		* 如果一个博主写了超过500篇博客，那就可以分多次请求，只要合理设置c0-param1和c0-param2就可以。
		*/
		List<NameValuePair> nvp = new ArrayList<NameValuePair>();
		nvp.add(new BasicNameValuePair("callCount", "1"));
		nvp.add(new BasicNameValuePair("scriptSessionId", "${scriptSessionId}187"));
		nvp.add(new BasicNameValuePair("c0-scriptName", "BlogBeanNew"));
		nvp.add(new BasicNameValuePair("c0-methodName", "getBlogs"));
		nvp.add(new BasicNameValuePair("c0-id", "0"));
		nvp.add(new BasicNameValuePair("c0-param0", "number:" + userId));
		nvp.add(new BasicNameValuePair("c0-param1", "number:" + startIndex));
		nvp.add(new BasicNameValuePair("c0-param2", "number:" + (returnNumber <= 500 ? returnNumber : 500)));
		nvp.add(new BasicNameValuePair("batchId", "1"));
		
		try{
			httppost.setEntity(new UrlEncodedFormEntity(nvp, "UTF8"));
			httppost.addHeader("Referer", "http://api.blog.163.com/crossdomain.html?t=20100205");
			httppost.addHeader("Content-Type", "text/plain");
			//httppost.addHeader("User-Agent", "Mozilla/5.0 Firefox/3.5.9 Chrome/26.0.1410.64");
			
			HttpResponse response = httpclient.execute(httppost);
			
			HttpEntity entity = response.getEntity();
			if(entity != null){
				/**
				* 把返回结果转换成字符串的形式，这里编码设置其实无所谓，因为我只需要解析出blogId，而且POST请求返回的是unicode，还需要转码，我嫌麻烦就没有去弄，也没必要去弄。
				*/
				entityBody = EntityUtils.toString(entity, "UTF8");
			}
		} catch (Exception e){
			e.printStackTrace();
		} finally {
			/**
			* 请求结束，关闭httppost，释放空间，注意，一定要在获取返回结果(response.getEntity())之后再释放，因为一旦关闭了httppost，
			* response也就关闭了，把返回结果也释放了。
			*/
			httppost.abort();
		}		
		
		/**
		* blogIdSet用来保存blogId，POST请求返回结果里，blogId以三种形式出现：
		* 1. permalink="blo/static/[blogId]"
		* 2. trackbackUrl="blog/[blogId].track"
		* 3. permaSerial="[blogId]"
		* 其中第三种的permaSerial=后面肯定是紧跟blogId的，用这种方式可以解析得到纯净的blogId，而且进一步提取blogId也比较简单，其他两种具体我没有去试，
		* 但应该也是可以得到纯净的blogId，有兴趣的可以把entityBody值打印出来自己去看看，下面是解析POST请求返回结果提取blogId，使用HashSet的一个好处是
		* 可以不用每次都判断blogId是否已经出现，可以少些几行代码，不要用ArrayList，因为每个blogId的permaSerial="[blogId]"形式会出现两次，如果需要提取
		* 其他信息诸如标题可以考虑用HashMap<String, InfoStruct>（HashMap<blogId, 数据信息>）
		*/
		Set<String> blogIdSet = new HashSet<String>();
		
		/**
		* 设置匹配的正则表达式，其中\"[0-9]+?\"中的问号"?"是最小匹配的意思，如果不用?，就可能得不到纯净的blogId。
		*/
		Pattern pattern = Pattern.compile("permaSerial=\"[0-9]+?\"");
		
		/**
		* 先对返回结果进行分句，再对每一句进行匹配，其实也可以不用分句，直接匹配，只是个人习惯先分句而已，防止跨句。
		*/
		String[] sents = entityBody.split("(\n|\r\n)+");
		for(int i = 0; i < sents.length; i++){
			Matcher matcher = pattern.matcher(sents[i]);
			while(matcher.find()){
				blogIdSet.add(matcher.group().replaceAll("permaSerial=|\"", ""));
			}
		}
		return blogIdSet;
	}

获取了blogId后就可以拼接博客地址并请求博客内容数据了。【哎，我得感慨下，为了写这篇日志，还把英文注释改成了中文注释，并添加了很多新的注释】

post163Blog(String username, String userId, int startIndex, int returnNumber)中的参数里，startIndex和returnNumber可以根据需要设定，而username，userId是传进去的，但给定一个博客入口地址，我们只能从入口地址获取username，userId是没有的，这就需要另外去解析提取userId了。

userId可以在一次GET请求博客入口地址的返回结果里找到。例如在肖鹰例子里，GET请求

“http://xying1962.blog.163.com/blog/”的返回结果里看到“userId:138445490”，如下图所示（可以用上面的网页分析神器Chrome查看，在Response里）：

这个userId信息是保存在<script>...</script>里的，可以使用HtmlCleaner进行解析或者直接用字符串正则匹配就可以提取出来，例如上述post163Blog函数里提取blogId用到的正则匹配。正则表达式模板是：

Pattern pattern = Pattern.compile("userId:[0-9]+");

我这里也给出根据GET请求博客目录地址并解析返回结果获取userId的代码，以供参考。

/**
	 * Get the html text through a GET request, the default encoding is "UTF8"
	 * */
	public String getText(String inputUrl){
		return getText(inputUrl, "UTF8");
	}
	public String getText(String inputUrl, String encoding){
		/**
		* 实例化一个新的HttpGet，并添加Header
		*/
		HttpGet httpget = new HttpGet();
		httpget.addHeader("User-Agent", "Mozilla/5.0 Firefox/3.5.9 Chrome/26.0.1410.64");
		String entityBody = null;
		try{
			/**
			* 设置要请求的页面地址
			*/
			httpget.setURI(new URI(inputUrl));
			HttpResponse response = httpclient.execute(httpget);
			/**
			* 获取返回结果并转换成字符串形式
			*/	
			HttpEntity entity = response.getEntity();
			if(entity != null){
				entityBody = EntityUtils.toString(entity, encoding);
			}
			/**
			* 关闭httpget，释放资源，及时释放资源是个好习惯。
			*/
			httpget.abort();
		} catch (Exception e) {
			e.printStackTrace();
		} finally {}
		/**
		* 返回请求的返回结果，entityBody一般是个html页面的源代码，也可能不是，看对方网站服务器以什么形式返回结果。
		*/
		return entityBody;		
	}

/**
	 * 解析GET请求博客目录返回结果，获取博主的userId，userId是博主的唯一标识。
	 * userId隐藏在script代码里。这里会用到工具包HtmlCleaner。
	 * 这个代码做的检查是过于小心了，因为我没有详细去分析返回结果是否包含其他人的userId，
	 * 但我的检查可以保证提取出来的是博主正确的userId
	 * */
	public String parseReturnHtml(String htmlText){
		if(htmlText == null)
			return null;
		TagNode rootNode = htmlcleaner.clean(htmlText);
		try {
			/**
			* 提取<script>...</script>内容，从后往前是因为看userId藏在较低端的script代码里。
			*/
			Object[] scriptNodes = rootNode.evaluateXPath("//script");
			for(int i = scriptNodes.length - 1; i >= 0; i--){
				TagNode scriptNode = (TagNode) scriptNodes[i];
				String text = scriptNode.getText().toString().trim();
				
				if(! text.startsWith("window.N"))
					continue;
				if(! text.contains("userId"))
					continue;
				/**
				* 分句
				*/
				String[] sents = text.split("\n|\r\n");
				for(int j = sents.length - 1; j >= 0; j--){
					if(! sents[j].contains("userId"))
						continue;
					sents[j] = sents[j].trim();
					
					String[] items = sents[j].split(":");
					if(items.length != 2)
						return null;
					String userId = items[1];
					/**
					 * userId是一个数字串
					 * */
					return userId;
				}
				break;
			}
		} catch (XPatherException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} finally {}
		return null;
	}

其中，getText函数是对网页进行GET请求，获得返回结果，这个函数是通用的。parseReturnHtml只是解析GET请求网易博客目录的返回结果而已。

这就是获取网易博客数据的关键代码了。

下面给出完整可执行代码，需要去下载两个jar软件包：

htmlcleaner

httpclient

可能还需要下面httpcore这个jar软件包，如果用上面两个还不够，就把这个也加上。【注，貌似httpclient和httpcore是一块放在httpcomponents的，我记不得了，自己看看就清楚了】

import java.net.URI;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.http.HeaderElement;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.NameValuePair;
import org.apache.http.client.HttpClient;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.htmlcleaner.XPatherException;

public class WangyiBlogCrawler {
	
	/**
	 * For http request and html cleaning and parsing
	 * */
	private HttpClient httpclient;
	private HtmlCleaner htmlcleaner;
	
	private int STARTINDEX;
	private int RETURNNUMBER;
	
	public WangyiBlogCrawler(){
		httpclient = new DefaultHttpClient();
		htmlcleaner = new HtmlCleaner();
		
		STARTINDEX = 0;
		RETURNNUMBER = 100;
	}

	public static void main(String[] args) {
		// TODO Auto-generated method stub
		
		String contentUrl = "http://xying1962.blog.163.com/blog/";
		WangyiBlogCrawler wyBlogCrawler = new WangyiBlogCrawler();
		
		wyBlogCrawler.run(contentUrl);

	}
	
	public void run(String contentUrl){
		
		String username = contentUrl.replaceAll("http://|.?blog.163.com/?|/?blog/|#m=0", "");;
		String returnEntity = getText(contentUrl);
		String userId = parseReturnHtml(returnEntity);
		
		int startIndex = STARTINDEX;
		int returnNumber = RETURNNUMBER;
		
		Set<String> blogIdSet = new HashSet<String>();
		
		Set<String> temIdSet = null;
		do{
			startIndex += returnNumber;
			returnNumber = RETURNNUMBER;
			temIdSet = post163Blog(username, userId, startIndex, returnNumber);
			blogIdSet.addAll(temIdSet);
		}while(temIdSet.size() == returnNumber);
		
		processBlogIdSet(contentUrl, blogIdSet);
		
	}
	
	public void processBlogIdSet(String contentUrl, Set<String> blogIdSet){
		contentUrl = contentUrl.replaceAll("#m=0", "");
		
		for(Iterator<String> iter = blogIdSet.iterator(); iter.hasNext(); ){
			String blogId = iter.next();
			
			
			/**
			* 拼接产生博客内容的地址
			*/
			String blogUrl = contentUrl + "static/" + blogId + "/";
			
			/**
			 * output the blog url
			 * */
			System.out.println(blogUrl);
			
			/**
			 * output the blog entity
			 * */
			 /**
			 * 下面两行代码请求每一篇博客内容并打印出完整的html文本
			 *
			//String blogEntity = getText(blogUrl, "gbk");
			//System.out.println(blogEntity);
		}
		
	}
	
	/**
	 * Parsing the entry html in order to extract the unique userId.
	 * The unique userId is hidden in the script codes.
	 * */
	public String parseReturnHtml(String htmlText){
		if(htmlText == null)
			return null;
		TagNode rootNode = htmlcleaner.clean(htmlText);
		try {
			Object[] scriptNodes = rootNode.evaluateXPath("//script");
			for(int i = scriptNodes.length - 1; i >= 0; i--){
				TagNode scriptNode = (TagNode) scriptNodes[i];
				String text = scriptNode.getText().toString().trim();
				
				if(! text.startsWith("window.N"))
					continue;
				if(! text.contains("userId"))
					continue;
				
				String[] sents = text.split("\n|\r\n");
				for(int j = sents.length - 1; j >= 0; j--){
					if(! sents[j].contains("userId"))
						continue;
					sents[j] = sents[j].trim();
					
					String[] items = sents[j].split(":");
					if(items.length != 2)
						return null;
					String userId = items[1];
					/**
					 * the userId is a sequence numbers.
					 * */
					return userId;
				}
				break;
			}
		} catch (XPatherException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} finally {}
		return null;
	}
	
	public Set<String> post163Blog(String username, String userId, int startIndex, int returnNumber){
		
		String entityBody = null;
		
		HttpPost httppost = new HttpPost("http://api.blog.163.com/" + username + "/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr");
		
		List<NameValuePair> nvp = new ArrayList<NameValuePair>();
		nvp.add(new BasicNameValuePair("callCount", "1"));
		nvp.add(new BasicNameValuePair("scriptSessionId", "${scriptSessionId}187"));
		nvp.add(new BasicNameValuePair("c0-scriptName", "BlogBeanNew"));
		nvp.add(new BasicNameValuePair("c0-methodName", "getBlogs"));
		nvp.add(new BasicNameValuePair("c0-id", "0"));
		nvp.add(new BasicNameValuePair("c0-param0", "number:" + userId));
		nvp.add(new BasicNameValuePair("c0-param1", "number:" + startIndex));
		nvp.add(new BasicNameValuePair("c0-param2", "number:" + (returnNumber <= 500 ? returnNumber : 500)));
		nvp.add(new BasicNameValuePair("batchId", "1"));
		
		try{
			httppost.setEntity(new UrlEncodedFormEntity(nvp, "UTF8"));
			httppost.addHeader("Referer", "http://api.blog.163.com/crossdomain.html?t=20100205");
			httppost.addHeader("Content-Type", "text/plain");
			httppost.addHeader("User-Agent", "Mozilla/5.0 Firefox/3.5.9 Chrome/26.0.1410.64");
			
			HttpResponse response = httpclient.execute(httppost);
			
			HttpEntity entity = response.getEntity();
			if(entity != null){
				entityBody = EntityUtils.toString(entity, "UTF8");
			}
		} catch (Exception e){
			e.printStackTrace();
		} finally {
			httppost.abort();
		}		
		
		Set<String> blogIdSet = new HashSet<String>();
		
		Pattern pattern = Pattern.compile("permaSerial=\"[0-9]+?\"");
		String[] sents = entityBody.split("(\n|\r\n)+");
		for(int i = 0; i < sents.length; i++){
			Matcher matcher = pattern.matcher(sents[i]);
			while(matcher.find()){
				blogIdSet.add(matcher.group().replaceAll("permaSerial=|\"", ""));
			}
		}
		return blogIdSet;
	}
	
	/**
	 * Get the html text through a GET request
	 * */
	public String getText(String inputUrl){
		return getText(inputUrl, "UTF8");
	}
	public String getText(String inputUrl, String encoding){
		
		HttpGet httpget = new HttpGet();
		httpget.addHeader("User-Agent", "Mozilla/5.0 Firefox/3.5.9 Chrome/26.0.1410.64");
		String entityBody = null;
		try{
			httpget.setURI(new URI(inputUrl));
			HttpResponse response = httpclient.execute(httpget);
			HttpEntity entity = response.getEntity();
			if(entity != null){
				/**
				 * If you want extract the charset automatically, unannotated the following
				 * the statements
				 * getMeta函数和getCharset函数是用于自动获取编码的，在getText里调用，在抓取具体博客内容时可能或产生乱码，
				 * 即EntityUtils.toString(entity, encoding)这条语句执行过程中可能会出现乱码，因此在不知道编码方式的时候
				 * 可以使用下面的语句自动获取，属于两次解析，第一次是用getCharset获取，使用html的标签结果来提取，即一般
				 * 的网页都有<head>里都有这条语句，
				 * <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				 * 但用解析器解析有时候得不到charset，或者有些网页就不是这种形式，而是很简单的
				 * <meta charset="utf-8">
				 * 这就需要用自己用字符串处理的方式去提取，这样一般都能解析到，但是先把response返回的结果转换成字符串，
				 * 而response貌似只能保存一次，因而用字符串提取charset又需要一次GET请求，代价比较高，因此我才想这种笨重
				 * 的多次解析多次请求，为的是解决乱码问题。如果是抓同一个网站的东西，可以直接设好编码方式。
				 */
				/**
				String charset = getCharset(entity);
				if(charset == null){
					entityBody = EntityUtils.toString(entity);
					charset = getMeta(entityBody);
					 
					response = httpclient.execute(httpget);
					entity = response.getEntity();
				}
				if(charset != null)
					encoding = charset;
				*/
				entityBody = EntityUtils.toString(entity, encoding);
			}
			httpget.abort();
		} catch (Exception e) {
			e.printStackTrace();
		} finally {}
		return entityBody;		
	}
	
	public String getMeta(String htmlEntity){
		String charset = null;
		if(htmlEntity == null)
			return charset;
		Pattern pattern = Pattern.compile("charset=\"?.*?\"");
		String[] lines = htmlEntity.split("(\n|\r\n)+");
		for(int i = 0; i < lines.length; i++){
			Matcher matcher = pattern.matcher(lines[i]);
			if(matcher.find()){
				String[] items = matcher.group().split("=");
				charset = items[1].replaceAll("\"", "");
				break;
			}
		}
		return charset;
	}
	
	public String getCharset(HttpEntity entity){
		String charset = null;
		if(entity == null)
			return charset;
		if(entity.getContentType() != null){
			HeaderElement[] values = entity.getContentType().getElements();
			if(values != null && values.length > 0){
				for(HeaderElement value : values){
					NameValuePair param = value.getParameterByName("charset");
					if(param != null){
						charset = param.getValue();
						break;
					}
				}
			}
		}
		return charset;
	}
	
}