很多网站是需要登陆的,并且有自己的登录逻辑,通过selenium可以实现模拟网站登录以及事件的点击,是一种比较难拦截的爬虫方案。
先决条件:
(1)首先需要引入selenium和requests类,在requirements.txt中加上这两个以后,在venv中执行pip install即可,一定要在venv中执行,否则可能报错。
(2)安装对应的chromedriver:在http://npm.taobao.org/mirrors/chromedriver/网站中下载对应的chromedriver,注意版本一定要和你机器里面的chrome的版本对应上,否则会报版本不一致的错误。
由于Downloader middlewares承担了网页下载的任务,所以我们通过修改downloader middlewares的api来实现模拟登陆的目的。具体步骤如下:
(1)大部分登陆是有cookie机制的,需要在settings.py中将cookie打开,设置如下:
COOKIES_ENABLED = True
如果在爬取工程中有robots.txt无法找到之类的错误,可以将ROBOTSTXT_OBEY关闭,配置如下:
ROBOTSTXT_OBEY = False
将DOWNLOADER_MIDDLEWARES开启,设置如下:
DOWNLOADER_MIDDLEWARES = {
'projiectName.middlewares.'projiectName_DownloaderMiddleware': 543,
}
注意设置的内容随着工程名的不同而各不相同
(2)设置登陆入口:
通过https://mp.csdn.net/postedit/103910538可以将登陆url作为网站入口
(3)修改middlewares.py文件中的****DownloaderMiddleware(object)类,处理登陆请求:
修改process_request(self, request, spider)方法:
判断url是否是登陆,如果是登陆,开启webdriver,找到用户名和密码的框,触发点击事件,输入用户名和密码,模拟登陆
按钮点击,实现登陆。
非登陆url,为了包含cookie,通过requests库的session将cookie带入,发送请求。
具体的代码例子如下:
class MaimaispiderDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
# 判断是哪个爬虫
if spider.name == 'projectName': #需要自己根据自己的工程名称修改
# 判断是否是登陆
if request.url.find('login') != -1: #需要自己根据自己的登陆url修改条件
spider.driver = webdriver.Chrome()
spider.driver.get(request.url)
spider.driver.find_element_by_xpath(
'/html/body/div[@class="wrap"]/div[@class="matter clearfix"]/div[@class="content ft"]/div[@class="contactInfor loginBox"]/form[@id="form"]/div[@class="arrow clearfix"]/div[@class="loginPhone"]/input[@class="loginPhoneInput"]'
).click()
time.sleep(2)
#模拟输入账号密码
username = spider.driver.find_element_by_xpath(
'//*[@class="loginPhoneInput"]')
password = spider.driver.find_element_by_xpath(
'//*[@id="login_pw"]')
username.send_keys('*************') #需要自己根据自己的登陆名称修改
password.send_keys('*************') #需要自己根据自己的登陆密码
#模拟点击“登录”按钮
spider.driver.find_element_by_xpath(
'//*[@class="loginBtn"]').click()
time.sleep(3)
spider.cookies = spider.driver.get_cookies()
return HtmlResponse(
url=spider.driver.current_url, # 登录后的url
body=spider.driver.page_source, # html源码
encoding='utf-8')
# 不是登录
else:
req = requests.session()
for cookie in spider.cookies:
req.cookies.set(cookie['name'], cookie['value'])
req.headers.clear()
newpage = req.get(request.url)
print(request.url)
print(newpage.text)
# ((JavascriptExecutor)spider.driver).executeScript("window.scrollTo(0, document.body.scrollHeight)");
# Thread.sleep(2000);
# ((JavascriptExecutor)spider.driver).executeScript("scrollTo(0, 0)");
# Thread.sleep(2000);
# ((JavascriptExecutor)spider.driver).executeScript("window.scrollTo(0, document.body.scrollHeight)");
return HtmlResponse(url=request.url,
body=newpage.text,
encoding="utf-8")
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
# def closeSpider(self):
# self.driver.quit()
小技巧:
爬完网站以后,关闭chrome,重写closeSpider方法如下:
在DownloaderMiddleware的from_crawler方法中订阅spider_closed信号:
crawler.signals.connect(s.closeSpider, signals.spider_closed)
然后实行closeSpider方法
def closeSpider(self, spider):
spider.driver.quit()