爬取地址:https://www.zwdu.com/book/7846/195393.html
1、创建爬虫项目
在命令行输入:
scrapy startproject xiaoshuo
2、创建自己的爬虫文件
scrapy genspider xiaoshuo zwdu.com
3、修改
start_urls = ['https://www.zwdu.com/book/7846/195393.html']
4、修改settings
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763 '
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'xiaoshuo.pipelines.XiaoshuoPipeline': 300,
}#开启pipelines
5、开始解析
得到标题:
得到小说内容
得到下一章url
代码如下:
def parse(self, response):
title = response.xpath('//h1/text()').extract_first()
content = ''.join(response.xpath('//div[@id="content"]/text()').extract()) #得到列表所以变成字符串
yield {
'title': title,
'content': content
}
next_url = response.xpath('//div[@class="bottem2"]/a[3]/@href').extract_first()
#base_url = 'https://www.zwdu.com/ book/7846/{}'.format(next_url)
if next_url.find('.html') != -1:
yield scrapy.Request(response.urljoin(next_url), callback=self.parse)#自动补齐前面的url
#yield 推送到pipelines
6、配置pipelines,将小说保存至文件中
class XiaoshuoPipeline(object):
def open_spider(self,spider):
self.file = open('wldf.txt', 'w',encoding='utf-8')
def process_item(self, item, spider):
title = item['title']
content = item['content']
info = title + '\n' + content + '\n'
self.file.write(info)
self.file.flush()
return item
def close_spider(self,spider):
self.file.close()
7、新建main文件,启动爬虫
代码如下:
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'zww'])