采集器经验

最新推荐文章于 2023-03-03 14:48:41 发布

gbstack08

最新推荐文章于 2023-03-03 14:48:41 发布

阅读量1.5k

点赞数

文章标签： webbrowser firefox插件 browser firebug google 浏览器

本文链接：https://blog.csdn.net/gbstack08/article/details/7332853

版权

需要得到webmaster的外链列表, 而webmaster api没有提供这个数据的获取方法, 只能写采集器了

以前没写过需要用户登录的采集器, 先拿一些简单的来试试

首先试了bccn, POST数据有username和password, 成功了

百度和google的登录页面都是https, POST时会出现错误, 解决方法是

http://stackoverflow.com/questions/560804/how-do-i-use-webrequest-to-access-an-ssl-encrypted-site-using-https

但百度的POST还会出现错误, underlying connection was closed, the connection was closed unexpectedly.

google的POST总返回登录页面的内容, 看了这篇文章(http://everydayscripting.blogspot.com/2009/10/python-fixes-to-google-login-script.html)才知道: google的POST数据中有两个得到登录页面提取:dsh和GALX

提取出来然后POST便返回: Your browser's cookie functionality is turned off. Please turn it on.

发现这个response的header的Set-Cookie中只有GAPS这一项(而firebug捕获的POST response的Set-Cookie中有NID, SID, LSID, SSID, HSID, APISID, )再来看firebug捕获的登录POST的request的cookie里面已经有了GAPS, 而我的提交的request没有设置任何cookie, 所以可能是看到我的request的cookie中没有GAPS, 便得出我把浏览器的cookie给关了的结论

接着实验了下, 把cookie清空然后访问登录页面, 发现response的set-cookie为GAPS, 即POST之前需要有GAPS这个cookie.

根据上面的, 先GET一次登录页面获取cookie, 然后把cookie作为下次POST的cookie, 登录成功了, 进了Accounts Overview页面, 但是response里面没有任何cookie..

这次response html中有:

You are using an old browser version which Google accounts no longer supports. Some features may not work correctly. Please upgrade to a modern browser, such asGoogle Chrome.

不是功能不支持, 而是浏览器过老. 我能想到的看浏览器版本什么的只有user agent了(js: navigator.userAgent)

加上userAgent然后POST, 得到的response页面是Account settings页面, 但仍然没有set-cookie..

把response html加载到webbrowser中(browser.navigateToString()), 出现js错误, 而该页面的js已经被压缩了, 根本没法看

后来又发现有一个post parameter没加上去(checkConnection=youtube:1012:1), 加上之后立即发生WebException:Unable to connect the remote server. inner exception是:{"由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。 46.82.174.68:443"}

现在打算用webbrowser来模拟登录然后采集, 原理和之前的youku投票的firefox插件差不多, 写到采集器(2)里

gbstack08

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
采集器经验

需要得到webmaster的外链列表, 而webmaster api没有提供这个数据的获取方法, 只能写采集器了以前没写过需要用户登录的采集器, 先拿一些简单的来试试首先试了bccn, POST数据有username和password, 成功了百度和google的登录页面都是https, POST时会出现错误, 解决方法是http://stackoverflow.com
复制链接

扫一扫