爬虫小白的scrapy初次尝试,有错的地方还请指正!
目录
安装python+pycharm
网上安装教程很多,我这里就不重复了。
pycharm破解参考:https://blog.csdn.net/qs17809259715/article/details/90115751
scrapy框架简介
一个快速、高层次的屏幕抓取和web抓取的Python框架,用于抓取web站点并从页面中提取结构化的数据,可以用于数据挖掘、监测和自动化测试,可根据具体需求个性化定制。
Scrapy架构图
各组件功能如下
- Scrapy Engine(引擎):用来处理整个系统的数据传递,是整个系统的核心部分。
- Scheduler(调度器):用来接受引擎发过来的Request请求, 压入队列中, 并在引擎再次请求的时候返回。
- Downloader(下载器):用于引擎发过来的Request请求对应的网页内容, 并将获取到的Responses返回给Spider。
- Spiders(爬虫):对Responses进行处理,从中获取所需的字段(即Item),也可以从Responses获取所需的链接,让Scrapy继续爬取。
- Item Pipeline(管道):负责处理Spider中获取的实体,对数据进行清洗,保存所需的数据。
- Downloader Middlewares(下载器中间件):主要用于处理Scrapy引擎与下载器之间的请求及响应。
- Spider Middlewares(爬虫中间件):主要用于处理Spider的Responses和Requests
Scrapy工作流程图
- SPIDERS的yeild将request发送给ENGIN
- ENGINE对request不做任何处理发送给SCHEDULER
- SCHEDULER( url调度器),生成request交给ENGIN
- ENGINE拿到request,通过MIDDLEWARE进行层层过滤发送给DOWNLOADER
- DOWNLOADER在网上获取到response数据之后,又经过MIDDLEWARE进行层层过滤发送给ENGIN
- ENGINE获取到response数据之后,返回给SPIDERS,SPIDERS的parse()方法对获取到的response数据进行处理,解析出items或者requests
将解析出来的items或者requests发送给ENGIN - ENGIN获取到items或者requests,将items发送给ITEMPIPELINES,将requests发送给SCHEDULER
注意!只有当调度器中不存在任何request了,整个程序才会停止,(也就是说,对于下载失败的URL,Scrapy也会重新下载。)
安装Scrapy
pip install pywin32
pip install zope.interface
pip install Twisted
pip install pyOpenSSL
pip install Scrapy
创建scrapy项目
pycharm 终端输入
scrapy startproject [项目名]
移动到创建项目文件夹
cd [项目名]
创建spider
scrapy genspider -t crawl [爬虫名] [爬取网站]
创建start.py(爬虫启动文件)
from scrapy import cmdline
cmdline.execute("scrapy crawl [爬虫名]".split())
创建完成
实战——爬取笔趣阁
小白笔记 大佬轻喷
爬取要求
爬取笔趣阁首页所有小说的的origin_url(url),title,author,class_(小说分类),introduction,并以json格式储存。
(正文太多不爬取)
网站分析
-
打开谷歌浏览器进入笔趣阁首页https://www.biqugex.com右键检查
发现所有小说的URL都是/book_.+ -
进入小说详情页
发现我们需要的所有信息都在div class='book’下
创建biqugex项目
pycharm终端输入
scrapy startproject biqugex
cd biqugex
scrapy genspider -t crawl bqg_spider biqugex.com
创建start.py(爬虫启动文件)
from scrapy import cmdline
cmdline.execute("scrapy crawl bqg_spider".split())
创建完成
settings.py(配置文件)
ROBOTSTXT_OBEY = False#不遵守机器协议
DOWNLOAD_DELAY = 1#下载延迟1秒
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
}#设置User-Agent
DOWNLOADER_MIDDLEWARES = {
'bqg_sipder.middlewares.BqgSipderDownloaderMiddleware': 543,
}#取消注释DOWNLOADER_MIDDLEWARES
ITEM_PIPELINES = {
'bqg_sipder.pipelines.BqgSipderPipeline': 300,
}#取消注释ITEM_PIPELINES
items.py(设置获取字段)
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class BiqugexItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
origin_url = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
class_ = scrapy.Field()
introduction = scrapy.Field()
content = scrapy.Field()
bqg_spider.py(爬虫)
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from biqugex.items import BiqugexItem
class BqgSpider(CrawlSpider):
name = 'bqg_spider'
allowed_domains = ['biqugex.com']
start_urls = ['https://www.biqugex.com']
rules = (
Rule(LinkExtractor(allow=r'/book_.+'), callback='parse_item', follow=False),
)# 匹配符合要求链接,response到parse_item
def parse_item(self, response):#xpath匹配要需要的信息,最后yield
origin_url = response.url
title = response.xpath("//h2/text()").get().strip()
author = response.xpath("//div[@class='small']/span[1]/text()").get().lstrip('作者:').strip()
class_ = response.xpath("//div[@class='small']/span[2]/text()").get().lstrip('分类:').strip()
introduction = response.xpath("//div[@class='intro']/text()[1]").get().replace("\n","").replace("/r","").strip()
item = BiqugexItem(
origin_url = origin_url,
title = title,
author = author,
class_ = class_,
introduction = introduction
)
yield item
pipelines.py(保存为json文件)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
class BiqugexPipeline(object):
def __init__(self):
self.file = open('novel.json', 'w', encoding='utf-8')
def start_spider(self, spider):
print("爬虫开始了" + "<==" * 10)
pass
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"#ensure_ascii=False防止中文乱码
self.file.write(line)
return item
def close_spider(self, spider):
print("爬虫结束了" + "<==" * 10)
pass
运行爬虫
- start.py,等待一会得到结果
结束语
笔趣阁简单爬取意在熟悉scrapy基本流程, 如有错误欢迎指正!