scrapy嵌入selenium
需要解决的问题,可能存在动态渲染的网页,但不希望全部使用selenium;每次对动态网站的抓取能够更换ip代理和请求头;有关selenium页面加载太慢问题。
spider文件
1、网上很多嵌入的教程,推荐的都是在自己写的爬虫文件中初始化driver,这样可以节省资源,但是却不能够实现实时的IP更换(此处我们不在爬虫文件中初始化driver):
def __init__(self):
self.driver = webdriver.Chrome()
def close(self, spider, reason):
print('\033[0;31m爬虫结束\033')
self.driver.quit()
init是爬虫文件开启时被创建并定义的
close即爬虫结束时被调用的方法。
测试用爬虫文件:
import scrapy
from zhihu.items import ZhihuItem
from selenium import webdriver
import time
class zhihuSpider(scrapy.Spider):
name = 'zhihu'
def start_requests(self):
url = 'https://www.zhihu.com/'
yield scrapy.Request(
url=url,
callback=self.parse_first
)
# yield scrapy.Request(
# url='https://www.baidu.com/',
# callback=self.parse_baidu
# )
def parse_first(self, response):
title = response.xpath('//ul[@class="Tabs AppHeader-Tabs"]/li/a/text()').extract()
item = ZhihuItem()
item['title'] = title
print('!!!!!', title)
yield item
def parse_baidu(self, response):
title = response.xpath('//span[@class="btn_wr s_btn_wr bg"]/input').extract()
print('!!!!!', title, '!!!!!!!!!')
middlewares文件
1、要将两个方法连接起来,我们需要对middlewares有基本的了解,在此我们主要要修改的是process_response方法。要注意,scrapy这些方法都是配置好的,它给出了返回示例,但我们可以在这其中加入代码,从而改变他们的返回数据。(下载中间件,不是爬虫中间件,注意要在setting’里面开启配置)
如在process_response()中,这里是request请求完成后获取到的response,但对于js渲染的页面,很多数据是不会在返回中的,因此我们判断一下,然后篡改掉这些response。
from scrapy import signals
from scrapy.http import HtmlResponse
import random
import time
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["pageLoadStrategy"] = "normal"
class ZhihuDownloaderMiddleware:
def driver_init(self):
driver = webdriver.Chrome()
return driver
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.user_agent)
# request.meta['proxy'] = random.choice(self.proxy_list)
# print('代理IP为:', request.meta['proxy'])
return None
#继承spider类
def process_response(self, request, response, spider):
if 'zhihu' in request.url or 'baidu' in request.url:
driver = self.driver_init()
driver.get(url=request.url)
js = "window.scrollTo(0,document.body.scrollHeight)"
driver.execute_script(js)
row_response = driver.page_source
driver.quit()
return HtmlResponse(url=request.url, body=row_response, encoding="utf8", request=request)
else:
return response
注意:在串改中,HtmlResponse模块是需要从scrapy中导入的。
desired_capabilities[“pageLoadStrategy”] = "normal"方法在博客:添加链接描述
设置IP代理,修改代码为:(proxy_list在class类中定义)
#配置driver的代理等,并实现初始化
def __init__(self):
self.proxy_list = [
'http://223.242.223.132:4225', 'http://114.106.171.186:4242','http://117.69.185.190:4286',
'http://171.112.89.237:4278', 'http://121.226.212.214:4264',
'http://59.60.140.150:4242'
]
def driver_init(self):
chrome_options = Options()
# chrome_options.add_argument('--headless')#不弹窗
ip = random.choice(self.proxy_list)
chrome_options.add_argument('--proxy-server='+ip)
driver = webdriver.Chrome(chrome_options=chrome_options)
return driver
2、为driver添加cookies(以知乎为例):
添加cookies本质上是不难的,但在此处我们得先自己抓到自己的cookies。以下是一个简单的抓取知乎cookies的文件:
from selenium import webdriver
import time
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["pageLoadStrategy"] = "normal"
def cookie_get(url):
cookie_list = []
while 1:
choice = input('是否获取cookie(y/n)')
if choice == 'y':
driver = webdriver.Chrome()
driver.get(url)
time.sleep(30)#需要在30s内完成自己的登录,可以直接第三方平台登录
cookies = driver.get_cookies()#获取cookies
print(cookies)
cookie_list.append(cookies)
driver = webdriver.Chrome()
driver.get(url)#尝试使用cookies直接登录
for cook in cookies:
driver.add_cookie(cook)
driver.refresh()
driver.close()
else:
break
print("自动登陆完成,获取成功")
return cookie_list
if __name__ == '__main__':
cookie_list = cookie_get('https://www.zhihu.com/signin?next=%2F')
print(cookie_list)
注意,抓取到的cookie输出出来是一个list,list中是许多的字典,在使用时,我们需要对这个list中的字典循环加入到driver中:
for cook in cookies:
driver.add_cookie(cook)
还是在下载中间件中,初始化自己的cookie_list,并且在driver每次请求链接后加入cookie并刷新网页(起始可以注意到,每次爬取网页都必须登录,一个新页面传给下载中间件,就必须得经过登录过程,非常影响下载速度)。
self.cookie_list = [
[]
]
def process_response(self, request, response, spider):
if 'zhihu' in request.url:
driver = self.driver_init()
driver.get(url=request.url)
cookies = random.choice(self.cookie_list)
for cookie in cookies:
driver.add_cookie(cookie)
driver.refresh()
#time.sleep(10)
js = "window.scrollTo(0,document.body.scrollHeight)"
driver.execute_script(js)
row_response = driver.page_source
driver.quit()
return HtmlResponse(url=request.url, body=row_response, encoding="utf8", request=request)
else:
return response
注意,cookies_list一定是列表中再放一个列表再是字典,不然会登录失败。
setting文件
BOT_NAME = 'zhihu'
SPIDER_MODULES = ['zhihu.spiders']
NEWSPIDER_MODULE = 'zhihu.spiders'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'zhihu.middlewares.ZhihuDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'zhihu.pipelines.ZhihuPipeline': 300,
}
item文件
import scrapy
class ZhihuItem(scrapy.Item):
title = scrapy.Field()
pipelines文件
import xlwt
class ZhihuPipeline:
def __init__(self):
self.row = 1
# 1.创建workbook对象
self.book = xlwt.Workbook(encoding='utf-8')
self.sheet = self.book.add_sheet('CVE', cell_overwrite_ok=True)
self.sheet.write(0, 0, 'title')
def process_item(self, item, spider):
self.sheet.write(self.row, 0, item['title'])
self.row += 1
self.book.save('output.xls')
return item