scrapy下载小说《活着》
余华说:“人是为活着本身而活着的,不是为了活着之外的任何事物所活着”,小说《活着》是一本书写生命意义的书。我很喜欢看余华的《活着》,如果也有喜欢看这本小说的朋友们,请把它下载下来慢慢看,领略活着的意义。今天,我就用scrapy框架来爬取小说信息,只需要几秒钟就可以下载下来……
(1)首先,创建一个小说的爬虫文件
(2)在csw.py文件中获取小说每章的题目和内容
import scrapy
class CswSpider(scrapy.Spider):
name = 'csw'
allowed_domains = ['99csw.com']
start_urls = ['https://www.99csw.com/book/2428/72909.htm']
def parse(self, response):
title=response.xpath('//div[@id="content"]/h2/text()').extract_first()
content=''.join(response.xpath('//div[@id="content"]/div/text()').extract())
yield {
"title":title,
"content":content
}
next_url=response.xpath('//div[@class="page"]/a[@id="next"]/@href').extract_first()
yield scrapy.Request(response.urljoin(next_url),callback=self.parse)~
(3)将获取的小说信息写入本地,在pipeline.py中实现
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class MyspiderPipeline:
def open_spider(self,spider):
self.filename=open('活着.txt','w',encoding='utf-8')
def process_item(self, item, spider):
title=item["title"]
content=item["content"]
info=title+'\n'+content+'\n'
self.filename.write(info)
self.filename.flush()
return item
def close_spider(self,spider):
self.filename.close()
~
(4)设置动态的UA,函数def process_request(self, request, spider)可在useragent.py中获取
1.middlewares.py的代码如下:
from Myspider.settings import USER_AGENT
from random import choice
class UserAgentDownloadMiddleware(object):
def process_request(self, request, spider):
print(choice(USER_AGENT))
request.headers.setdefault(b'User-Agent',choice(USER_AGENT))~
2.settings.py中的代码如下:
DOWNLOADER_MIDDLEWARES = {
'Myspider.middlewares.UserAgentDownloadMiddleware': 243,
}~
要修改它的优先级
ITEM_PIPELINES = {
'Myspider.pipelines.MyspiderPipeline': 300,
}~
agent1='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36'
agent2='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362'
USER_AGENT=[
agent1,
agent2
]~
(5)在pycharm中执行程序
(6)查看结果