python scrapy简单介绍以及爬取小说阅读网实战并存入数据库(IP池,UA设置)
安装过程请自行百度pip安装scrapy。
https://www.readnovel.com/free 小说阅读网网址
scrapy目录结构解析:如下图
items.py设置爬取数据的字段
middlewares.py 配置中间件 可以配置随机的user-agent伪装成游览器,配置代理ip池
pipelines.py 配置管道,做数据处理
settings.py 配置文件,在这个文件开启middlewares.py和pipelines.py,默认不开启
spiders 目录 存放爬虫文件
实战:
命令行执行scrapy startproject book 创建项目
cd 进目录执行 scrapy genspider getbookinfo 网址 生成爬虫文件
resource.py 存放USER_AGENTS和ip池 代码如下:
#使用随机代理Ip
PROXIES = [
#存放Ip池
]
#USER_AGENTS
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]
setting.py 修改如下: 其中cookie和retry要自行添加, 其他开启注释修改即可。
ROBOTSTXT_OBEY = False #改成false , 不遵守该网站的robots规则
COOKIES_ENABLED = True #开启cookie
RETRY_ENABLED = False #禁止重试,加快效率
DOWNLOADER_MIDDLEWARES = { #开启中间件
'book.middlewares.RandomProxy': 300, #项目名.中间件.类名
'book.middlewares.RandomUser': 543,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 400,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': None,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
}
ITEM_PIPELINES = { #开启管道
'book.pipelines.BookPipeline': 300,
}
middlewares.py设置:
# -*- coding: utf-8 -*-
from book.resource import PROXIES, USER_AGENTS #引入Ip池和ua
import random
class RandomProxy(object): #在settings.py中配置后会被自动执行
def process_request(self, request, spider): #用random设置随机的代理ip
proxy = random.choice(PROXIES)
print(proxy)
request.meta['proxy'] = 'http://%s' % proxy
class RandomUser(object):
def process_request(self, request, spider): #同理
UA = random.choice(USER_AGENTS)
request.headers.setdefault('User-Agent', random.choice(UA))
pipelines.py: 开启管道,并存入数据库,这里测试用的mongodb
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/") #连接mongo
class BookPipeline(object):
def __init__(self):
self.col = client['book']['bookinfo'] #选择book库中的Bookinfo表
def process_item(self, item, spider):
data = dict(item) #将爬虫爬到数据转为字典
self.col.insert(data) #插入数据
return item
def close_spider(self,spider):
client.close() #关闭爬虫
items.py:
字段数据
import scrapy
#书籍信息的字段
class BookItem(scrapy.Item):
# define the fields for your item here like:
#书籍名称
book_name = scrapy.Field()
#书籍类型
book_type = scrapy.Field()
#书籍状态 例如:已完结
book_stat = scrapy.Field()
#书籍作者
book_author = scrapy.Field()
#书籍字数
book_fontNum = scrapy.Field()
getbookinfo.py爬虫:
重点在于xpath的编写。
该网页的翻页规则。
# -*- coding: utf-8 -*-
import scrapy
from book.items import BookItem
#爬取小说阅读网的免费玄幻小说(信息)
class GetbookinfoSpider(scrapy.Spider):
name = 'getbookinfo'
allowed_domains = ['www.readnovel.com'] #限制domain域
pageNum = 1 #默认第一页
#根据pageNum翻页的Url
baseURL = 'https://www.readnovel.com/finish?pageSize=10&gender=1&catId=20001&isFinish=1&isVip=1&size=-1&updT=-1&orderBy=0&pageNum='
start_urls = [baseURL + str(pageNum)]
def parse(self, response):
try:
node_list = response.xpath("//div[@class='book-info']") #xpath获取dom节点
for node in node_list:
item = BookItem() #把数据存入字段
item['book_name'] = node.xpath("./h3/a/text()").extract()[0]
item['book_type'] = node.xpath("./p[1]/span[1]/text()").extract()[0]
item['book_stat'] = node.xpath("./p[1]/span[2]/text()").extract()[0]
item['book_author'] = node.xpath("./p[1]/span[3]/text()").extract()[0]
yield item #生成迭代器,将数据传到管道做处理,然后进入下一个循环。
if self.pageNum != 16: #限制页数
self.pageNum += 1
yield scrapy.Request(self.baseURL + str(self.pageNum), callback=self.parse) #进入下一页 然后callback执行parse函数
except Exception as e:
print(e)
main.py编写简单的启动脚本:
也可以直接命令行scrapy crawl getbookinfo
from scrapy import cmdline
cmdline.execute(['scrapy','crawl','getbookinfo']) #getbookinfo是类名
github 项目地址 : https://github.com/MRXKing/book