scrapy学习之路(五)一种实现登陆爬取的方案:selenium

     很多网站是需要登陆的,并且有自己的登录逻辑,通过selenium可以实现模拟网站登录以及事件的点击,是一种比较难拦截的爬虫方案。

  先决条件:

    (1)首先需要引入selenium和requests类,在requirements.txt中加上这两个以后,在venv中执行pip install即可,一定要在venv中执行,否则可能报错。

  (2)安装对应的chromedriver:在http://npm.taobao.org/mirrors/chromedriver/网站中下载对应的chromedriver,注意版本一定要和你机器里面的chrome的版本对应上,否则会报版本不一致的错误。

    由于Downloader middlewares承担了网页下载的任务,所以我们通过修改downloader middlewares的api来实现模拟登陆的目的。具体步骤如下:

(1)大部分登陆是有cookie机制的,需要在settings.py中将cookie打开,设置如下:

         COOKIES_ENABLED = True

      如果在爬取工程中有robots.txt无法找到之类的错误,可以将ROBOTSTXT_OBEY关闭,配置如下:

        ROBOTSTXT_OBEY = False 

      将DOWNLOADER_MIDDLEWARES开启,设置如下:

DOWNLOADER_MIDDLEWARES = {

'projiectName.middlewares.'projiectName_DownloaderMiddleware': 543,

}

注意设置的内容随着工程名的不同而各不相同

(2)设置登陆入口:

       通过https://mp.csdn.net/postedit/103910538可以将登陆url作为网站入口

(3)修改middlewares.py文件中的****DownloaderMiddleware(object)类,处理登陆请求:

         修改process_request(self, request, spider)方法:

        判断url是否是登陆,如果是登陆,开启webdriver,找到用户名和密码的框,触发点击事件,输入用户名和密码,模拟登陆

按钮点击,实现登陆。

      非登陆url,为了包含cookie,通过requests库的session将cookie带入,发送请求。

   具体的代码例子如下:

      

class MaimaispiderDownloaderMiddleware(object):

# Not all methods need to be defined. If a method is not defined,

# scrapy acts as if the downloader middleware does not modify the

# passed objects.

 

@classmethod

def from_crawler(cls, crawler):

# This method is used by Scrapy to create your spiders.

s = cls()

crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

return s

 

def process_request(self, request, spider):

# Called for each request that goes through the downloader

# middleware.

 

# Must either:

# - return None: continue processing this request

# - or return a Response object

# - or return a Request object

# - or raise IgnoreRequest: process_exception() methods of

# installed downloader middleware will be called

 

# 判断是哪个爬虫

if spider.name == 'projectName': #需要自己根据自己的工程名称修改

# 判断是否是登陆

if request.url.find('login') != -1: #需要自己根据自己的登陆url修改条件

spider.driver = webdriver.Chrome()

spider.driver.get(request.url)

spider.driver.find_element_by_xpath(

'/html/body/div[@class="wrap"]/div[@class="matter clearfix"]/div[@class="content ft"]/div[@class="contactInfor loginBox"]/form[@id="form"]/div[@class="arrow clearfix"]/div[@class="loginPhone"]/input[@class="loginPhoneInput"]'

).click()

time.sleep(2)

#模拟输入账号密码

username = spider.driver.find_element_by_xpath(

'//*[@class="loginPhoneInput"]')

password = spider.driver.find_element_by_xpath(

'//*[@id="login_pw"]')

username.send_keys('*************')  #需要自己根据自己的登陆名称修改

password.send_keys('*************')  #需要自己根据自己的登陆密码

#模拟点击“登录”按钮

spider.driver.find_element_by_xpath(

'//*[@class="loginBtn"]').click()

time.sleep(3)

spider.cookies = spider.driver.get_cookies()

return HtmlResponse(

url=spider.driver.current_url, # 登录后的url

body=spider.driver.page_source, # html源码

encoding='utf-8')

 

# 不是登录

else:

req = requests.session()

for cookie in spider.cookies:

req.cookies.set(cookie['name'], cookie['value'])

req.headers.clear()

newpage = req.get(request.url)

print(request.url)

print(newpage.text)

 

# ((JavascriptExecutor)spider.driver).executeScript("window.scrollTo(0, document.body.scrollHeight)");

# Thread.sleep(2000);

# ((JavascriptExecutor)spider.driver).executeScript("scrollTo(0, 0)");

# Thread.sleep(2000);

# ((JavascriptExecutor)spider.driver).executeScript("window.scrollTo(0, document.body.scrollHeight)");

 

return HtmlResponse(url=request.url,

body=newpage.text,

encoding="utf-8")

 

def process_response(self, request, response, spider):

# Called with the response returned from the downloader.

 

# Must either;

# - return a Response object

# - return a Request object

# - or raise IgnoreRequest

return response

 

def process_exception(self, request, exception, spider):

# Called when a download handler or a process_request()

# (from other downloader middleware) raises an exception.

 

# Must either:

# - return None: continue processing this exception

# - return a Response object: stops process_exception() chain

# - return a Request object: stops process_exception() chain

pass

 

def spider_opened(self, spider):

spider.logger.info('Spider opened: %s' % spider.name)

 

# def closeSpider(self):

# self.driver.quit()

 

小技巧:

    爬完网站以后,关闭chrome,重写closeSpider方法如下:

在DownloaderMiddleware的from_crawler方法中订阅spider_closed信号:

crawler.signals.connect(s.closeSpider, signals.spider_closed)

然后实行closeSpider方法

def closeSpider(self, spider):

spider.driver.quit()

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值