爬了好多天,今天开始做模拟登陆:
其实,模拟登陆爬取思路很简单——>首先申请一个账户,然后将浏览器登陆的过程切换成自己手动请求登陆数据,登陆成功后,保持状态,爬取需要的链接数据。
根据我的理解,大致是这样的。
发送请求(包含登陆信息)->验证,返回响应数据->接受返回数据,成功则继续爬取,失败就找找问题。
在网上只找到这个,但是尝试了一下,还是没有成功
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
class MySpider(InitSpider):
name = 'myspider'
allowed_domains = ['domain.com']
login_page = 'http://www.domain.com/login'
start_urls = ['http://www.domain.com/useful_page/',
'http://www.domain.com/another_useful_page/']
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
callback='parse_item', follow=True),
)
def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'name': 'herman', 'password': 'password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Hi Herman" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
self.initialized()
else:
self.log("Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_item(self, response):
# Scrape data from page
话说一句,此代码摘自:
http://outofmemory.cn/code-snippet/16528/scrapy-again-to-code
http://www.sharejs.com/codes/python/8544
居然一模一样。。无语了已经,虽说千古文章一大抄把,好歹注名出处。
然后试着写了下,调试半天,终于可以成功。
class testSpier(CrawlSpider):
name = 'abc'
allowed_domains = ['abc.com']
start_urls = ['http://www.abc.com']
rules = (
Rule(LinkExtractor(allow='abc', deny='detail'),),
Rule(LinkExtractor(allow='id=',), callback='parse_item', follow = True),
)
def start_requests(self):
return [Request("http://www.abc.com", callback = self.post_login)]
def post_login(self, response):
return [FormRequest.from_response(response,
formdata = {'username': 'abc', 'password': 'abc'},
callback = self.after_login,
dont_filter = True
)]
def after_login(self, response) :
for url in self.start_urls :
yield self.make_requests_from_url(url)
def parse_item(self,response):
.............省略.................
注意几个问题:
一个是请求参数除了用户密码,还会有一个关键字或者说套接字、验证ID等等,需要在登陆页面找到隐藏的值,添加到请求数据里,比方说input里的_token;
一个是headers,有时候需要伪装头部;
一个是cookies,有时候需要cookies保持。
补充:
测试的时候,可以用FormRequest直接测试登陆,
但是写的时候要用先请求URL,再用response传递,这样能保持登陆状态。具体原因待考究。。
坑壁的
参考资料:
http://ju.outofmemory.cn/entry/105646
http://my.oschina.net/chengye/blog/124162?p=2#comments
http://www.tuicool.com/articles/3y6ba2
PS:此文仅供记录交流,不作他用。