抓取万方数据内海量论文

前几天在实验室看到师姐们在下载文档,听她们说老师要她们下载几万篇论文。。天啊,有这么折磨人的吗???有点看不过去的我,便开始琢磨怎么用程序去完成这件无聊而且烦人的事情,于是乎,便诞生了这个下载器。

为了使该篇文章更加具有价值,我就不把只针对这次的下载代码之类的东西放上来了,取而带之的是向大家分享一下思路和具体的分析步骤,希望大家以后在遇到此类问题时能用上,下面便开始我的讲解吧。

总体思路:通过拼合请求头部信息来分析服务器返回信息来不断分析,最终获得目标文件流。

第一步:利用火狐的HttpFox来分析请求,注意不要用firebug,因为有些url可能没列出来,启动HttpFox,点击下载链接:
下面是我获取到的数据:
由于太多数据,就不全部贴上了啦,贴最关键的:

找到最终下载的那条URL:在这里,也就是Content-Type为application/pdf 的那一条,然后在浏览器上键入验证,没错.于是往回分析:

第一个请求URL:

http://f.wanfangdata.com.cn/download/Periodical_zgqkyx2010z1006.aspx

返回结果:
(Status-Line) HTTP/1.1 302 Found
Date Fri, 13 Apr 2012 05:34:17 GMT
Server Microsoft-IIS/6.0
X-Powered-By ASP.NET
X-AspNet-Version 4.0.30319
Location http://tran.wanfangdata.com.cn/Transaction.aspx?webTransactionRequest=%7b%22DisplayInfo%22%3a%22http%3a%5c%2f%5c%2fd.wanfangdata.com.cn%5c%2fResourceDisplayInfo.aspx%3fid%3dPeriodical_zgqkyx2010z1006%26type%3dfulltext%22%2c%22Request%22%3a%7b%22AuthenticationContext%22%3anull%2c%22ExtraData%22%3a%5b%5d%2c%22ProductDetail%22%3a%22Periodical_zgqkyx2010z1006%22%2c%22TransferIn%22%3a%7b%22AccountType%22%3a%22Income%22%2c%22Key%22%3a%22PeriodicalFulltext%22%7d%2c%22TransferOut%22%3anull%2c%22Turnover%22%3a3.00000%7d%2c%22ReturnUrl%22%3a%22http%3a%5c%2f%5c%2ff.wanfangdata.com.cn%5c%2fDownload.aspx%22%7d
Set-Cookie WFKS.Auth=%7b%22AuthenticationContext%22%3a%7b%22AccountIds%22%3a%5b%7b%22AccountType%22%3a%22Group%22%2c%22Key%22%3a%22gdgydxtsg%22%7d%2c%7b%22AccountType%22%3a%22GTimeLimit%22%2c%22Key%22%3a%22gdgydxtsg%22%7d%5d%2c%22AuthenticationSign%22%3a%22jp1nHmR0SBcKMkcB4OeuHQdQcgGJFKeLfzRQIreuFR9s8ZfT3kS8jIdRCLXxlXQd%22%2c%22Data%22%3a%5b%7b%22Key%22%3a%22Group.gdgydxtsg.DisplayName%22%2c%22Value%22%3a%22%e5%b9%bf%e4%b8%9c%e5%b7%a5%e4%b8%9a%e5%a4%a7%e5%ad%a6%e5%9b%be%e4%b9%a6%e9%a6%86%22%7d%5d%2c%22SessionId%22%3a%2248b9071c-4bd2-4494-8444-aed2181fb7a7%22%7d%2c%22LastUpdate%22%3a%22%5c%2fDate(1334295257000%2b0800)%5c%2f%22%2c%22Sign%22%3a%22nPs4agDnJY%5c%2fVy%2bOSCMbwQQ%3d%3d%22%7d; domain=.wanfangdata.com.cn; path=/
Cache-Control private
Content-Type text/html; charset=utf-8
Content-Length 721

注意红色的信息:Location 是重定向(302)的URL,也就是第二个请求,set-cookie是第二次请求也要发出去的cookie信息..

第二步:根据服务器返回的信息我重新组装URL,发出第二个请求,在这里我主要是组装了URL地址和cookie,一般这两个应该就够了,如果不行,那就继续分析,这里的分析应该结合HttpFox来观察第二次请求的头部信息。

第二个请求URL,也就是第一次返回的Location:

http://tran.wanfangdata.com.cn/Transaction.aspx?webTransactionRequest=%7b%22DisplayInfo%22%3a%22http%3a%5c%2f%5c%2fd.wanfangdata.com.cn%5c%2fResourceDisplayInfo.aspx%3fid%3dPeriodical_zgqkyx2010z1006%26type%3dfulltext%22%2c%22Request%22%3a%7b%22AuthenticationContext%22%3anull%2c%22ExtraData%22%3a%5b%5d%2c%22ProductDetail%22%3a%22Periodical_zgqkyx2010z1006%22%2c%22TransferIn%22%3a%7b%22AccountType%22%3a%22Income%22%2c%22Key%22%3a%22PeriodicalFulltext%22%7d%2c%22TransferOut%22%3anull%2c%22Turnover%22%3a3.00000%7d%2c%22ReturnUrl%22%3a%22http%3a%5c%2f%5c%2ff.wanfangdata.com.cn%5c%2fDownload.aspx%22%7d

返回的信息:
(Status-Line) HTTP/1.1 302 Found
Date Fri, 13 Apr 2012 05:34:18 GMT
Server Microsoft-IIS/6.0
X-Powered-By ASP.NET
X-AspNet-Version 4.0.30319
Location http://f.wanfangdata.com.cn/Download.aspx?transaction=%7b%22ExtraData%22%3a%5b%5d%2c%22Transaction%22%3a%7b%22DateTime%22%3a%22%5c%2fDate(1334295258451%2b0800)%5c%2f%22%2c%22Id%22%3a%22b3aa6b38-84ee-479a-98a5-a03200dfa800%22%2c%22ProductDetail%22%3a%22Periodical_zgqkyx2010z1006%22%2c%22SessionId%22%3a%2248b9071c-4bd2-4494-8444-aed2181fb7a7%22%2c%22Signature%22%3a%22FcoYhEyWToFJeoVgX4AIp612iOz%5c%2fRDUJPW9WS4QjsKWG9mnuHUY%2b3pz1s6ruZm4Z%22%2c%22TransferIn%22%3a%7b%22AccountType%22%3a%22Income%22%2c%22Key%22%3a%22PeriodicalFulltext%22%7d%2c%22TransferOut%22%3a%7b%22AccountType%22%3a%22GTimeLimit%22%2c%22Key%22%3a%22gdgydxtsg%22%7d%2c%22Turnover%22%3a3.00000%2c%22User%22%3anull%7d%2c%22TransferOutAccountsStatus%22%3a%5b%5d%7d
Set-Cookie WFKS.Auth=%7b%22AuthenticationContext%22%3a%7b%22AccountIds%22%3a%5b%7b%22AccountType%22%3a%22Group%22%2c%22Key%22%3a%22gdgydxtsg%22%7d%2c%7b%22AccountType%22%3a%22GTimeLimit%22%2c%22Key%22%3a%22gdgydxtsg%22%7d%5d%2c%22AuthenticationSign%22%3a%22jp1nHmR0SBcKMkcB4OeuHQdQcgGJFKeLfzRQIreuFR9s8ZfT3kS8jIdRCLXxlXQd%22%2c%22Data%22%3a%5b%7b%22Key%22%3a%22Group.gdgydxtsg.DisplayName%22%2c%22Value%22%3a%22%e5%b9%bf%e4%b8%9c%e5%b7%a5%e4%b8%9a%e5%a4%a7%e5%ad%a6%e5%9b%be%e4%b9%a6%e9%a6%86%22%7d%5d%2c%22SessionId%22%3a%2248b9071c-4bd2-4494-8444-aed2181fb7a7%22%7d%2c%22LastUpdate%22%3a%22%5c%2fDate(1334295257000%2b0800)%5c%2f%22%2c%22Sign%22%3a%22nPs4agDnJY%5c%2fVy%2bOSCMbwQQ%3d%3d%22%7d; domain=.wanfangdata.com.cn; path=/
Cache-Control private
Content-Type text/html; charset=utf-8
Content-Length 31322

一样,这里还是一个重定向(302):根据返回信息提取关键信息点,然后组合数据,发出第三条请求。

http://f.wanfangdata.com.cn/Download.aspx?transaction=%7b%22ExtraData%22%3a%5b%5d%2c%22Transaction%22%3a%7b%22DateTime%22%3a%22%5c%2fDate(1334295258451%2b0800)%5c%2f%22%2c%22Id%22%3a%22b3aa6b38-84ee-479a-98a5-a03200dfa800%22%2c%22ProductDetail%22%3a%22Periodical_zgqkyx2010z1006%22%2c%22SessionId%22%3a%2248b9071c-4bd2-4494-8444-aed2181fb7a7%22%2c%22Signature%22%3a%22FcoYhEyWToFJeoVgX4AIp612iOz%5c%2fRDUJPW9WS4QjsKWG9mnuHUY%2b3pz1s6ruZm4Z%22%2c%22TransferIn%22%3a%7b%22AccountType%22%3a%22Income%22%2c%22Key%22%3a%22PeriodicalFulltext%22%7d%2c%22TransferOut%22%3a%7b%22AccountType%22%3a%22GTimeLimit%22%2c%22Key%22%3a%22gdgydxtsg%22%7d%2c%22Turnover%22%3a3.00000%2c%22User%22%3anull%7d%2c%22TransferOutAccountsStatus%22%3a%5b%5d%7d

第三步:终于返回结果页面(200)啦:我们把页面的信息全部拷贝到html文件里面,用文本处理软件打开,这里我推荐用notepad++,非常好用。
这里有两个页面,我们只能一个一个来啦,通过进一步分析:我终于找到了关键点:
就是第一个text\html请求里面的:
iframe src=”Fulltext.ashx?fileId=Periodical_zgqkyx2010z1006&type=download&transaction=%7b%22ExtraData%22%3a%5b%5d%2c%22Transaction%22%3a%7b%22DateTime%22%3a%22%5c%2fDate(1334295258451%2b0800)%5c%2f%22%2c%22Id%22%3a%22b3aa6b38-84ee-479a-98a5-a03200dfa800%22%2c%22ProductDetail%22%3a%22Periodical_zgqkyx2010z1006%22%2c%22SessionId%22%3a%2248b9071c-4bd2-4494-8444-aed2181fb7a7%22%2c%22Signature%22%3a%22FcoYhEyWToFJeoVgX4AIp612iOz%5c%2fRDUJPW9WS4QjsKWG9mnuHUY%2b3pz1s6ruZm4Z%22%2c%22TransferIn%22%3a%7b%22AccountType%22%3a%22Income%22%2c%22Key%22%3a%22PeriodicalFulltext%22%7d%2c%22TransferOut%22%3a%7b%22AccountType%22%3a%22GTimeLimit%22%2c%22Key%22%3a%22gdgydxtsg%22%7d%2c%22Turnover%22%3a3.00000%2c%22User%22%3anull%7d%2c%22TransferOutAccountsStatus%22%3a%5b%5d%7d” frameborder=”0″ width=”0″ height=”0″

这就是目标的URL啦。。功夫不负有心人,终于找到啦。。

第四步:
以上就是一个下载链接的分析过程,在真正实践的过程中,往往是要通过很多次分析和实验才能得到结果的。接下来的工作就是分析搜索的请求结构:这里就很简单啦:例如我的:
http://s.wanfangdata.com.cn/Paper.aspx?q=%E8%82%9D%E7%97%85&p=174; q=%E8%82%9D%E7%97%85 就是q=肝病,这些码的转化在百度上搜:URL解码就可以找到相应的查看工具了。p就是页数。
根据这些URL可以返回相关的html页面,截取之中的所有下载链接,然后重复上面的那个算法就可以实现根据搜索关键字的全部链接下载啦,就一个for循环的事情。。

Java中实现关键类和方法:
获取Location:
URL url = new URL(urlString);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod(“HEAD”);
connection.setInstanceFollowRedirects(false);// 不让重定向,这样才会返回Location字段信息
connection.connect();
String location = connection.getHeaderField(“Location”);
String cookie = connection.getHeaderField(“Set-Cookie”);
//URL编码或解码:
URLEncoder.encode(key, “UTF-8″);//编码
URLDecoder.decode(title, “UTF-8″);//解码

转载于:https://my.oschina.net/daxia/blog/53482

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值