00
Scrapy的安装
在Python环境中,使用Pip安装:
pip install scrapy
过程中遇到报 twisted 相关的问题,由于C编译相关的环境缺少的原因,可以手动下载文件来安装。
pip install Twisted-18.4.0-cp35-cp35m-win_amd64.whl
安装完成后再安装 Scrapy即可。
01
Scrapy命令介绍
* 其中 shell 命令为交互调试工具,非常有用,可以即时调试对应站点上的所有资讯:
[s]
Available Scrapy objects:
[s]
scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]
crawler <scrapy.crawler.Crawler object at 0x0000024830629A90>
[s]
item {}
[s]
request <GET http://www.cqgzfglj.gov.cn/gongzdt/>
[s]
response <200 http://www.cqgzfglj.gov.cn/gongzdt/>
[s]
settings <scrapy.settings.Settings object at 0x000002483063A2E8>
[s]
spider <PubhouseSpider 'pubhouse' at 0x2483092c940>
[s] Useful shortcuts:
[s]
fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]
fetch(req) Fetch a scrapy.Request and update local objects
[s]
shelp() Shell help (print this help)
[s]
view(response) View response in a browser
02
项目介绍
2.1 新建项目
scrapy startproject xxx
命令执行后在当前目录新建一个 xxx 的目录,其下的的目录结构同下图的 books
进入 xxx 目录,创建爬虫程序
cd xxx
scrapy genspider xxx xxx.com
2.2 编辑相应文件
2.2.1
entrypoint.py 此文件是为了便于在PyCharm上调试而创建的,代码就两行:
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'xxxxx'])
2.2.2
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class BooksItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field() # 小说章节标题
desc = scrapy.Field() # 小说每章内容
2.2.3 pipelines.py
# -*- coding: utf-8 -*-
import json
import codecs
class BooksPipeline(object):
def process_item(self, item, spider):
self.file = codecs.open(item.get('title')+ '.txt', 'w', encoding='utf-8')
self.file.write(item.get('desc') + '\n\f')
return item
def spider_closed(self, spider):
self.file.close()
2.2.4
books.p
y
# -*- coding: utf-8 -*-
import scrapy
from books.items import BooksItem
from scrapy.http import Request
class FictionSpider(scrapy.Spider):
name = 'fiction'
allowed_domains = ['www.qidian.com']
start_urls = ['https://book.qidian.com/info/1005209812#Catalog']
def parse(self, response):
hxs = response
# 获取书名
names = hxs.xpath('//div[@class="book-info "]/h1/em/text()').extract()[0]
item = BooksItem()
item['title'] = names
charterurl = hxs.xpath('//div[@class="volume"]/ul/li/a/@href').extract()[0]
# 通过获取到的第一章URL进入页面
print(charterurl)
yield Request("https:" + charterurl, meta={'item':item}, callback=self.parsecharter, dont_filter=True)
def parsecharter(self,response):
hxs = response
# 获取章节名
titles = hxs.xpath('//h3[@class="j_chapterName"]/text()').extract()[0]
item = response.meta['item']
content = ''
content = '\n' + content + str(titles) + '\n'
s = hxs.xpath('//div[@class="read-content j_readContent"]//p/text()').extract()
for srt in s:
srt = srt.replace("\u3000", " ")
content = content + srt +'\n'
desc = item.get('desc')
if None==desc:
item['desc'] = content
else:
item['desc'] = desc + content
if content=='':
yield item
chapters = hxs.xpath('//a[@id="j_chapterNext"]/@href').extract() # 下一章地址
Nextt = hxs.xpath('//a[@id="j_chapterNext"]/text()').extract()[0] #判断是不是最后一章
if Nextt == '书末页':
yield item
return
for chapter in chapters:
yield Request("https:" + chapter, meta={'item':item}, callback=self.parsecharter, dont_filter=True)
2.3 运行爬虫
scrapy crawl fiction
运行后在当前目录生成相应的文件