一、创建项目
在cmd窗口输入
scrapy startproject 项目名
成功后,会在对应目录下生成如下文件
二、创建爬虫
cd 项目名 进入到刚刚生成的文件夹下
scrapy genspider 爬虫名 目标网址域名
会在spiders目录下生成一个爬虫名.py文件
三、爬虫文件介绍
import scrapy
class ZhSpider(scrapy.Spider):
name = 'zh'
allowed_domains = ['zongheng.com']
start_urls = ['http://zongheng.com/']
def parse(self, response):
pass
生成的爬虫名.py文件有以上等内容
- 有一个爬虫类,继承scrapy.Spider
- name属性:爬虫名,在项目中必须唯一,后续爬取需要指定爬虫
- allowed_domains属性:爬取网站的域名,后续此爬虫只会在此域名下爬取
- start_urls 属性:开始爬取网页,这里配置初始网页,可以不受allowed_domains 限制
- parse方法:通过response响应,来解析爬取的网页数据
后续可以发送item到pipline进行数据持久化保存,或者构造新的requests请求
四、代码编写
本次准备抓取纵横中文网的小说
目标网页:
http://book.zongheng.com/showchapter/1022044.html
zh.py
import scrapy
from zongheng.zongheng.items import ZonghengItem
class ZhSpider(scrapy.Spider):
name = 'zh'
allowed_domains = ['zongheng.com']
start_urls = ['http://book.zongheng.com/showchapter/1022044.html']
def parse(self, response):
book_title = response.xpath('//div[@class="book-meta"]/h1/text()').get()
hrefs = response.xpath("//li[@class=' col-4']/a/@href").getall()
for href in hrefs:
item = ZonghengItem()
item['book_title'] = book_title
yield scrapy.Request(
href,
callback=self.detail_parse,
meta={'item':item,}
)
def detail_parse(self, response):
item = response.meta['item']
item['chapter_title'] = response.xpath("//div[@class='title_txtbox']/text()").get()
item['info'] = response.xpath("//div[@class='content']/p/text()").getall()
item['info'] = '\n'.join(item['info']).strip()
yield item
item.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ZonghengItem(scrapy.Item):
book_title = scrapy.Field()
chapter_title = scrapy.Field()
info = scrapy.Field()
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import os
class ZonghengPipeline:
def process_item(self, item, spider):
book_title = item['book_title']
if not os.path.exists(book_title):
os.mkdir(book_title)
chapter_title = item['chapter_title']
info = item['info']
with open(chapter_title+'.txt', 'w', encoding='utf-8') as fp:
fp.write(info)
fp.close()
另外在settings.py中把pipelines的注释去掉
ITEM_PIPELINES = {
'zongheng.pipelines.ZonghengPipeline': 300,
}
五、几个注意点
- 爬虫名.py中,无论是通过scrapy.Request去构造下一个请求,还是把item发送到pipelines中,必须使用yield,生成器函数
- scrapy.Request中可以,通过callback指定下一个解析函数;通过meta传入字典把内容传递给下一个解析函数
- parse解析函数中的response可以直接使用xpath去定位元素,但是需要使用get或者getal最后把元素取出来
- item.py中用来定义抓取的数据的结构体,可以在爬虫或者pipelines中引用
六、爬虫启动
#scrapy crawl 爬虫名
scrapy crawl zh
实现结果如下