Python第二阶段学习 - week7总结
一. Scrapy框架介绍
1. 概述
Scrapy是基于Python的一个非常流行的网络爬虫框架,可以用来抓取Web站点并从页面中提取结构化的数据。
2. 基本架构
3. 组件
4. 数据处理流程
二. 在PyCharm中打开Scrapy项目
1. 创建Scrapy项目
- 在指定文件路径,按住shift,然后鼠标点击右键,打开命令提示符
- 创建项目:在命令提示符中输入
scrapy startproject project_name
(project_name:自定义,不要有中文) - 创建spider:在命令提示符中输入
scrapy genspider example example.com
(example:蜘蛛程序名称,example.com:允许爬数据的局域网)
2. 在Pycharm中打开项目
-
找到创建好的Scrapy项目,鼠标右键用Pycharm打开
-
创建虚拟环境:在Pycharm下找到Settings,找到Python Interpreter,选择Add新建虚拟环境
-
Pycharm终端输入
pip install scrapy
,重新安装Scrapy
三. 编写蜘蛛程序+运行爬虫
1. 编写蜘蛛程序
a. 在spiders文件下找到已经创建好的蜘蛛程序
b. 编写items文件
import scrapy
class DoubanMovieItem(scrapy.Item):
# 字段,用来保存数据(实质上是一个字典)
title = scrapy.Field()
rating = scrapy.Field()
motto = scrapy.Field()
c. 编写页面解析程序
import scrapy
from scrapy import Request
from scrapy.http import Response
from day32.items import DoubanMovieItem
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250?start=0&filter=']
def parse(self, resp: Response):
selectors = resp.css('#content > div > div.article > ol > li > div > div.info')
for selector in selectors: # type: scrapy.Selector
# 返回给引擎一个item对象(保存数据),用来区分request对象(其他url)
item = DoubanMovieItem()
item['title'] = selector.css('div.hd > a > span:nth-child(1)::text').extract_first()
item['rating'] = selector.css('div.bd > div > span.rating_num::text').extract_first()
item['motto'] = selector.css('div.bd > p.quote > span::text').extract_first()
# 用yield关键字产出item数据
yield item
# request对象,其他的url
selector = resp.css('#content > div > div.article > div.paginator > span.next')
href = selector.xpath('./a/@href').extract_first()
# 产出request对象,然后再用上述解析方法继续解析新的url
yield Request(
url=f'https://movie.douban.com/top250{href}',
callback=self.parse
)
d. 编写settings文件
- headers里添加user_agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
- 是否遵守爬虫协议
ROBOTSTXT_OBEY = False
- 多线程
CONCURRENT_REQUESTS = 4
- 设置延迟
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True
- middlewares文件中定义的类
DOWNLOADER_MIDDLEWARES = {
'day32.middlewares.DoubanDownloaderMiddleware': 543,
}
数字大的优先执行请求,最后得到返回响应。(请求:数字由大到小;响应:数字由小到大)
- pipelines文件中定义的类
ITEM_PIPELINES = {
'day32.pipelines.MovieItemPipeline': 300,
} # 有多个pipeline的情况下,数字大的先执行
e. 编写middlewares文件
- 添加headers参数
def process_request(self, request: Request, spider):
if spider.name == 'douban':
request.cookies[''] = ''
return None
2. 运行爬虫
- 在Terminal中输入
scrapy crawl spider_name
(spider_name需要和蜘蛛程序中的name保持一致) - 在Terminal中输入
scrapy crawl spider_name -o file_name.csv
,可以直接将爬取到的数据写成csv文件
四. 编写Pipeline处理数据
1. 数据持久化
import openpyxl as openpyxl
class MovieItemPipeline:
def open_spider(self, spider):
if spider.name == 'douban':
self.workbook = openpyxl.Workbook()
self.sheet = self.workbook.active
self.sheet.title = 'Top250'
self.sheet.append(('标题', '评分', '名句'))
def process_item(self, item, spider):
if spider.name == 'douban':
self.sheet.append((item['title'], item['rating'], item['motto']))
return item
def close_spider(self, spider):
if spider.name == 'douban':
self.workbook.save('movie_top250.xlsx')
五. 下载中间件的编写和应用
1. 动态内容解析
- scrapy的下载器拿不到动态内容,所有需要自定义下载中间件并且通过selenium启动浏览器,爬取动态内容。
- 主要是编写process_request
import time
from scrapy import signals, Request
from scrapy.crawler import Crawler
from scrapy.http import Response, HtmlResponse
from selenium import webdriver
class Image360DownloaderMiddleware:
# 创建对象的时候会自动调用
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument('--headless')
self.browser = webdriver.Chrome(options=options)
# 对象没人引用时(下载中间件被销毁时)会自动调用
def __del__(self):
self.browser.close()
def process_request(self, request: Request, spider):
if spider.name == 'image360':
self.browser.get(request.url)
for y in range(500, 10001, 500):
# 执行JavaScript代码
self.browser.execute_script(f'window.scrollTo(0, {y})')
time.sleep(0.5)
# body参数最为重要,就是动态页面
# encoding参数必须要写
return HtmlResponse(
url=request.url, request=request, encoding='utf-8',
headers=request.headers, body=self.browser.page_source
)
def process_response(self, request: Request, response: Response, spider):
return response
def process_exception(self, request, exception, spider):
pass
2. settings配置
DOWNLOADER_MIDDLEWARES = {
'day30.middlewares.Image360DownloaderMiddleware': 500,
'day30.middlewares.DoubanDownloaderMiddleware': 543,
}
3. 编写items文件
class ImageItem(scrapy.Item):
url = scrapy.Field()
4. 蜘蛛程序
import scrapy
from day30.items import ImageItem
class Image360Spider(scrapy.Spider):
name = 'image360'
allowed_domains = ['image.so.com']
start_urls = ['https://image.so.com/z?ch=car']
# response就是自己编写的中间件中请求后返回的HtmlResponse
def parse(self, response):
sources = response.xpath('//img/@src').extract()
for image_source in sources: # type: str
if not image_source.endswith('.gif'):
item = ImageItem()
item['url'] = image_source
yield item