完善你的scrapy项目
前一段时间那个爬虫导致服务器崩坏的新闻大家都看了吧!做正规爬虫和谐网络环境!
我们今天完善一下我的爬虫 写一下请求头中间键和代理ip中间键!
这两个的中间键的内容相似!直接上码!(在你的Middlewares.py文件中)
# 这个是导入settings.py 文件中的请求头列表
from .settings import MY_USER_AGENT
import random
# 随机进行提取请求头的类
class GuaziMyUserAgentMiddleware(object):
# 每一个请求都会走这个方法
def process_request(self,request,spider):
UA = random.choice(MY_USER_AGENT)
if UA:
try:
print(UA)
except IOError:
pass
# 修改请求头
request.headers.setdefault("User-Agent",UA)
# 代理IP
class ProxyMiddleware(object):
def __init__(self):
self.proxyList = ['36.250.69.4:80', '58.18.52.168:3128', '58.253.238.243:80', '60.191.164.22:3128', '60.191.167.93:3128']
def process_request(self,request,spider):
# 配置代理位置
pro_adr = random.choice(self.proxyList)
print(pro_adr)
request.meta['proxy'] = 'http://' + pro_adr
有了这些中间键后scrapy 会更加的强大呢!
这里在介绍两个网站加载js的中间键 本质上就是对浏览器的调用(selenium 和pyppeteer)
# selenium 中间键
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
class SeleniumMiddleware(object):
def process_request(self,request,spider):
if spider.name == 'T_T':
# print(spider)
# print(spider.name)
chrome_options = Options()
chrome_options.add_argument('--headless') # 启动无头浏览器
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
# 谷歌浏览器路径
self.driver = webdriver.Chrome(executable_path='chromedriver.exe',chrome_options=chrome_options) #,chrome_options=chrome_options
self.driver.get(request.url)
time.sleep(3)
html = self.driver.page_source
self.driver.quit()
return scrapy.http.HtmlResponse(url=request.url,body=html.encode('utf-8'), encoding='utf-8',request=request)
# pyppeteer 中间键
import pyppeteer
from pyppeteer import launch
import asyncio
class PyppeteerMiddlware(object):
def __init__(self):
self.loop = asyncio.get_event_loop()
def process_request(self,request,spider):
content = self.loop.run_until_complete(self._process_request(request,spider))
return scrapy.http.HtmlResponse(url=request.url, body=content.encode('utf-8'), encoding='utf-8', request=request)
async def _process_request(self,request,spider):
browser = await launch(headless=True,slowMo=2, logLevel=4, # ,SEND
ignoreHTTPSErrors=True,dumpio=True,args=['--no-sandbox']) # ,'--disable-setuid-sandbox','--disable-dev-shm-usage'
pages = await browser.pages()
page = pages[0]
try:
await page.goto(url=request.url, options={'timeout': 10000, 'waitUntil': ['networkidle2']})
except pyppeteer.errors.TimeoutError as PET:
pass
content = await page.content()
await asyncio.sleep(2)
await browser.close()
return content
中间键写好之后便是在 settings.py 的调用!
DOWNLOADER_MIDDLEWARES = {
'guazi.middlewares.GuaziDownloaderMiddleware': 543,
# 关闭默认请求头中间键 设置随机请求头
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':None, # 默认是400
'guazi.middlewares.GuaziMyUserAgentMiddleware': 400,
# 设置代理ip
'guazi.middlewares.ProxyMiddleware':100,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
# 调用浏览器实现动态抓取 动态加载还是scrapy_splash更合适
'guazi.middlewares.SeleniumMiddleware':300,
# 'guazi.middlewares.PyppeteerMiddlware':300,
}
好啦!接下来爬取数据的话 就相对方便了呢!