scrapy爬取起点小说+使用ip代理
一、概述
本篇的目的是运用scrapy爬取起点的全本小说,并使用ip代理,至于scrapy的安装直接pip install scrapy -i https://pypi.doubanio.com/simple 使用豆瓣源,或者阿里、腾讯。
二、创建项目
1.在命令行中创建项目,输入scrapy startproject 项目名。
2.cd 项目名。
3.scrapy genspider 爬虫名 目标网站的域名。
三、项目编写
1.item.py的编写
import scrapy
class QidianItem(scrapy.Item):
# define the fields for your item here like:
book_name = scrapy.Field() #书名
name = scrapy.Field() # 每一章的名
text = scrapy.Field() #内容
2.spider.py的编写
import scrapy
from QiDian.items import QidianItem
class QidianSpiderSpider(scrapy.Spider):
name = 'Qidian_spider'
allowed_domains = ['qidian.com']
start_urls = ['https://read.qidian.com/chapter/qOvyhrClna3hI-Ha6N4TBg2/dqt_qJVpkVC2uJcMpdsVgA2']
def parse(self, response):
texts = ''
book_name = response.xpath('//title/text()').get()[:7]
title = response.xpath('//span[@class="content-wrap"]/text()').extract()[0]
text = response.xpath('//div[@class="read-content j_readContent"]/p/text()').getall()
for i in text:
texts += i.strip()
item = QidianItem()
item['name'] = title
item['text'] = texts
item['book_name'] = book_name
yield item
next_url = 'https:'+response.xpath('//a[@id="j_chapterNext"]/@href').get()
if next_url:
yield scrapy.Request(next_url,callback=self.parse)
3.piplines.py的编写
import os
class QidianPipeline(object):
def process_item(self, item, spider):
title=str(item['name'])+'.txt'
if str(item['book_name']):
filepath='F:\文件'+os.path.sep+str(item['book_name'])
if not os.path.exists(filepath):
os.mkdir(filepath)
filepaths=filepath+os.path.sep+title
with open(filepaths,'w',encoding='utf-8')as f:
f.write(str(item['text']))
return item
4.settings的编写
此处True改为False
增加请求头
此处使用ip代理需要使用
此处去点注释
添加ip代理池
5.middlewares.py的编写
import random
from scrapy import signals
from QiDian.settings import IPPOOL
class QidianDownloaderMiddleware(object):
def process_request(self, request, spider):
thisip=random.choice(IPPOOL)
print('this ip is:'+ thisip['ipaddr'])
request.meta['Proxy-Authorization'] = 'http://'+ thisip['ipaddr']
最后通过在终端运行scrapy crawl 爬虫名即可运行爬虫