文章目录
本文是 Python爬虫高手爬爬爬的第八章内容,由于篇幅庞大自成一篇
之前知识点移步->…/ Python爬虫高手爬爬爬
scrapy框架⭐⭐⭐
什么是scrapy?爬虫中封装好的一个明星框架。功能:高性能的持久化存储,异步的数据下载,高性能的数据解析,分布式。
1 环境的安装:
-
mac or linux:pip install scrapy
-
windows:
- pip install wheel
- 下载twisted,下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
- 安装twisted:pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl
- pip install pywin32
- pip install scrapy
测试:在终端里录入scrapy指令,没有报错即表示安装成功!记得要在相对应虚拟环境里安装
2 创建工程终端
scrapy startproject xxxPro
cd xxxPro
在spiders子目录中创建一个爬虫文件
scrapy genspider spiderName www.xxx.com
执行爬虫文件
scrapy crawl spiderName
3 数据解析
settings.py设置ua伪装和关闭遵守robots协议
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36 Edg/86.0.622.63'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 只输出错误日志
LOG_LEVEL = 'ERROR'
例子爬虫文件spiderone.py
import scrapy
class SpideroneSpider(scrapy.Spider):
#爬虫文件的名称:就是爬虫源文件的一个唯一标识
name = 'spiderone'
#允许的域名:用来限定start_urls列表中哪些url可以进行请求发送
#allowed_domains = ['www.baidu.com']
#起始的url列表:该列表中存放的url会被scrapy自动进行请求的发送
start_urls = ['https://www.biedoul.com/wenzi/']
#用作于数据解析:response参数表示的就是请求成功后对应的响应对象
def parse(self, response):
#解析:作者的名称+段子内容
div_list = response.xpath('/html/body/div[4]/div[1]/div[1]/dl')
# print(div_list)
all_data = [] #存储所有解析到的数据
for div in div_list:
# print(div)
#xpath返回的是列表,但是列表元素一定是Selector类型的对象
#extract可以将Selector对象中data参数存储的字符串提取出来
author = div.xpath('./span/dd/a/strong/text()')[0].extract()
# author = div.xpath('./span/dd/a/strong/text()').extract_first()
#列表调用了extract之后,则表示将列表中每一个Selector对象中data对应的字符串提取了出来
content = div.xpath('./dd//text()').extract()#有br换行用//取所有内容
content = ''.join(content)#列表转字符串
print('author:',author)
print('content:',content)
4 数据持久化存储
4.1 基于终端指令
- 要求:只可以将parse方法的返回值存储到本地的文本文件中
- 注意:持久化存储对应的文本文件的类型只可以为:‘json’, ‘jsonlines’, ‘jl’, ‘csv’, ‘xml’, ‘marshal’, 'pickle
- 指令:scrapy crawl xxx -o filePath
- 好处:简介高效便捷
- 缺点:局限性比较强(数据只可以存储到指定后缀的文本文件中)
在8.3的基础上修改然后终端scrapy crawl spiderone -o ./22222.csv
保存数据
import scrapy
class SpideroneSpider(scrapy.Spider):
name = 'spiderone'
#allowed_domains = ['www.baidu.com']
start_urls = ['https://www.biedoul.com/wenzi/']
def parse(self, response):
div_list = response.xpath('/html/body/div[4]/div[1]/div[1]/dl')
all_data = [] #存储所有解析到的数据
for div in div_list:
author = div.xpath('./span/dd/a/strong/text()')[0].extract()
content = div.xpath('./dd//text()').extract()#//取所有内容
content = ''.join(content)
# 封装返回值
dic = {
'author':author,
'content':content
}
all_data.append(dic)
return all_data
4.2 基于管道存储
- 编码流程:
- 数据解析
- 在item类中定义相关的属性
- 将解析的数据封装存储到item类型的对象
- 将item类型的对象提交给管道进行持久化存储的操作
- 在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作
- 在配置文件中开启管道
好处:通用性强。
例子1,管道存储到txt文件
在上述数据解析的基础上,提交items给管道
import scrapy
from lesson8_scrapy.testpro.testpro.items import TestproItem
class SpideroneSpider(scrapy.Spider):
name = 'spiderone'
#allowed_domains = ['www.baidu.com']
start_urls = ['https://www.biedoul.com/wenzi/']
def parse(self, response):
div_list = response.xpath('/html/body/div[4]/div[1]/div[1]/dl')
all_data = [] #存储所有解析到的数据
for div in div_list:
author = div.xpath('./span/dd/a/strong/text()')[0].extract()
content = div.xpath('./dd//text()').extract()#//取所有内容
content = ''.join(content)
item = TestproItem()
item['author'] = author
item['content'] = content
yield item # 将item提交给了管道
items.py中定义属性,类似于C++中结构体的意思
import scrapy
class TestproItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
author = scrapy.Field()
content = scrapy.Field()
pass
pipelines.py使用items,进行存储操作
#存到txt文件
class TestproPipeline:
fp = None
# 重写父类的一个方法:该方法只在开始爬虫的时候被调用一次
def open_spider(self, spider):
print('开始爬虫......')
self.fp = open('./xiaohua.txt', 'w', encoding='utf-8')
# 专门用来处理item类型对象
# 该方法可以接收爬虫文件提交过来的item对象
# 该方法没接收到一个item就会被调用一次
def process_item(self, item, spider):
author = item['author']
content = item['content']
self.fp.write(author + ':' + content + '\n')
return item # 就会传递给下一个即将被执行的管道类
def close_spider(self, spider):
print('结束爬虫!')
self.fp.close()
settings.py 开启管道权限,管道类和优先级,越小优先级越高
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'testpro.pipelines.TestproPipeline': 300,
}
终端运行就可以scrapy crawl spiderone
例子2:管道存储到数据库
在pipelines.py声明一个新的管道类,并在settings里开启管道类设置优先级。
管道类,注意上一个管道类中process_item方法要进行return item 传递给下个优先级管道类!!!!:
class TestproPipeline:
fp = None
# 重写父类的一个方法:该方法只在开始爬虫的时候被调用一次
def open_spider(self, spider):
print('开始爬虫......')
self.fp = open('./xiaohua.txt', 'w', encoding='utf-8')
# 专门用来处理item类型对象
# 该方法可以接收爬虫文件提交过来的item对象
# 该方法没接收到一个item就会被调用一次
def process_item(self, item, spider):
author = item['author']
content = item['content']
self.fp.write('主题' + ':' + author + '\n')
self.fp.write('内容' + ':' + content + '\n')
return item # 就会传递给下一个即将被执行的管道类
def close_spider(self, spider):
print('结束爬虫!')
self.fp.close()
class mysqlPileLine(object):
conn = None
cursor = None
def open_spider(self,spider):
self.conn = pymysql.Connect(host='10.1.218.**',port=3306,user='root',password='***',db='test',charset='utf8')
def process_item(self,item,spider):
self.cursor = self.conn.cursor()
try:
self.cursor.execute('insert into xiaohua values("%s","%s")'%(item["author"],item["content"]))
self.conn.commit()
except Exception as e:
print(e)
self.conn.rollback()
return item
def close_spider(self,spider):
self.cursor.close()
self.conn.close()
声明管道类:
ITEM_PIPELINES = {
'testpro.pipelines.TestproPipeline': 300,
'testpro.pipelines.mysqlPileLine': 301,
}
数据库结果
5 全站数据爬取
就是将网站中某板块下的全部页码对应的页面数据进行爬取,比如分页
实现方式:
- 将所有页面的url添加到start_urls列表(不推荐)
- 自行手动进行请求发送(推荐):手动请求发送:
yield scrapy.Request(url,callback)
callback专门用做于数据解析
爬取校花网30页的图片名称
import scrapy
class AlldatagirlSpider(scrapy.Spider):
name = 'alldatagirl'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.521609.com/daxuemeinv/']
url = 'http://www.521609.com/daxuemeinv/list8%d.html'
page_num = 1
count = 1
def parse(self, response):
div_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
for div in div_list:
##获取不到提示 IndexError: list index out of range ,存放位置不一样
name = div.xpath('./a[2]/text() | ./a[2]/b/text()')[0].extract()
print(self.count,':','name:',name)
self.count+=1
if self.page_num<=30:
new_url = format(self.url%self.page_num)
self.page_num+=1
yield scrapy.Request(url=new_url,callback=self.parse)
6 五大核心组件
-
引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心) -
调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址 -
下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的) -
爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面 -
项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
7 请求传参,深度爬取
如果爬取解析的数据不在同一张页面中。详情页需要点击。
爬取阿里校园招聘的岗位名称,岗位描述
import scrapy
from ..items import BossproItem
class BossSpider(scrapy.Spider):
name = 'boss'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://campus.alibaba.com/positionList.htm']
# url = 'https://www.zhipin.com/c100010000-p100101/?page={page1}&ka=page-{page2}'
# page_num = 2
# 回调函数接受item
def parse_detail(self, response):
item = response.meta['item']
job_desc = response.xpath('//*[@id="J-jobs"]/div[1]/dl//text()').extract()
job_desc = ''.join(job_desc)
# print(job_desc)
item['job_desc'] = job_desc
yield item
# 解析首页中的岗位名称*[@id="s_position_list"]/ul/li[5]/div[1]/div[1]/div[1]/a/h3
#//*[@id="s_position_list"] //*[@id="s_position_list"]/ul
def parse(self, response):
li_list = response.xpath('//*[@id="J-filter-target"]/tbody/tr')
print(li_list)
for li in li_list:
item = BossproItem()
job_name = li.xpath('./th/a/text()').extract_first()
item['job_name'] = job_name
print(job_name)
detail_url = li.xpath('./th/a/@href').extract_first()
print(detail_url)
# 对详情页发请求获取详情页的页面源码数据
# 手动请求的发送
# 请求传参:meta={},可以将meta字典传递给请求对应的回调函数
yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})
#
# 分页操作
# if self.page_num <= 5:
# new_url = self.url.format(page1=self.page_num,page2=self.page_num)
# print(new_url)
# self.page_num += 1
# print(self.page_num)
#
# yield scrapy.Request(new_url, callback=self.parse)
8 图片爬取
基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别?
- 字符串:只需要基于xpath进行解析且提交管道进行持久化存储
- 图片:xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据
ImagesPipeline:
只需要将img的src的属性值进行解析,提交到管道,管道就会对图片的src进行请求发送获取图片的二进制类型的数据,且还会帮我们进行持久化存储。
使用流程:⭐
- 1.数据解析(图片的地址)
- 2.将存储图片地址的item提交到制定的管道类
- 3.在管道文件中自定制一个基于ImagesPipeLine的一个管道类
get_media_request
file_path
item_completed
- 4.在配置文件中:
- 指定图片存储的目录:IMAGES_STORE = './imgs_bobo'
- 指定开启的管道:自定制的管道类
爬取站长素材的图片
item类
class ImgsproItem(scrapy.Item):
# define the fields for your item here like:
src = scrapy.Field()
# pass
ImagesPipeLine的一个管道类:
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class imgsPileLine(ImagesPipeline):
#就是可以根据图片地址进行图片数据的请求
def get_media_requests(self, item, info):
yield scrapy.Request(item['src'])
#指定图片存储的路径
def file_path(self, request, response=None, info=None):
imgName = request.url.split('/')[-1]
return imgName
def item_completed(self, results, item, info):
return item #返回给下一个即将被执行的管道类
数据解析文件:注意伪属性的加载,有的只有显示页面才会有真实的src
import scrapy
from ..items import ImgsproItem
class ImagesSpider(scrapy.Spider):
name = 'images'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://sc.chinaz.com/tupian/']
def parse(self, response):
div_list = response.xpath('//div[@id="container"]/div')
for div in div_list:
#注意:使用伪属性
src = div.xpath('./div/a/img/@src2').extract_first()
src = 'https:'+''.join(src)
print(src)
item = ImgsproItem()
item['src'] = src
yield item
settings中设置存放路径,并开启管道
9 中间件
下载中间件
- 位置:引擎和下载器之间
- 作用:批量拦截到整个工程中所有的请求和响应
- 拦截请求:(middlewares.py文件中)
- UA伪装:process_request
- 代理IP:process_exception:return request - 拦截响应:
- 篡改响应数据,响应对象
- 需求:爬取网易新闻中的新闻数据(标题和内容)
- 1.通过网易新闻的首页解析出五大板块对应的详情页的url(没有动态加载)
- 2.每一个板块对应的新闻标题都是动态加载出来的(动态加载)
- 3.通过解析出每一条新闻详情页的url获取详情页的页面源码,解析出新闻内容
拦截请求,更改代理ip
数据解析文件,访问百度
import scrapy
class MiddlewaresTestSpider(scrapy.Spider):
name = 'middlewares_test'
# allowed_domains = ['www.xxxx.com']115.24.230.154->
start_urls = ['http://www.baidu.com/s?wd=ip']
def parse(self, response):
page_text = response.text
with open('./ip.html','w',encoding='utf-8') as fp:
fp.write(page_text)
middlewares.py文件
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
import random
class MiddleproDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
PROXY_http = [
'153.180.102.104:80',
'195.208.131.189:56055',
]
PROXY_https = [
'120.83.49.90:9000',
'95.189.112.214:35508',
]
#拦截请求
def process_request(self, request, spider):
#UA伪装
request.headers['User-Agent'] = random.choice(self.user_agent_list)
#为了验证代理的操作是否生效
request.meta['proxy'] = 'http://119.119.116.252'
return None
#拦截所有的响应
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
#拦截发生异常的请求
def process_exception(self, request, exception, spider):
if request.url.split(':')[0] == 'http':
#代理
request.meta['proxy'] = 'http://'+random.choice(self.PROXY_http)
else:
request.meta['proxy'] = 'https://' + random.choice(self.PROXY_https)
return request #将修正之后的请求对象进行重新的请求发送
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
settings.py里打开下载中间件开关
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'middlePro.middlewares.MiddleproDownloaderMiddleware': 543,
}
爬取网易新闻
使用selenium模块结果动态加载问题
- 数据解析模块news.py
import scrapy
from selenium import webdriver
from ..items import WangyiproItem
class NewsSpider(scrapy.Spider):
name = 'news'
# allowed_domains = ['www.cccom']
start_urls = ['https://news.163.com/']
models_urls = [] #存储五个板块对应详情页的url
#解析五大板块对应详情页的url
#实例化一个浏览器对象
def __init__(self):
self.bro = webdriver.Chrome(executable_path='D:\pycharm\PycharmProjects\pachong\lesson7_selenium\chromedriver')
def parse(self, response):
li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
alist = [3,4,6,7,8]
# print('**********************************')
# print(li_list)
for index in alist:
model_url = li_list[index].xpath('./a/@href').extract_first()
self.models_urls.append(model_url)
#依次对每一个板块对应的页面进行请求
for url in self.models_urls:#对每一个板块的url进行请求发送
yield scrapy.Request(url,callback=self.parse_model)
#每一个板块对应的新闻标题相关的内容都是动态加载
def parse_model(self,response): #解析每一个板块页面中对应新闻的标题和新闻详情页的url
# response.xpath()
div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
# print('div_list:')
# print(div_list)
for div in div_list:
title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()
print(new_detail_url)
item = WangyiproItem()
item['title'] = title
#对新闻详情页的url发起请求
yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})
def parse_detail(self,response):#解析新闻内容
content = response.xpath('//*[@id="content"]/div[2]//text()').extract()
content = ''.join(content)
item = response.meta['item']
item['content'] = content
yield item
def closed(self,spider):
self.bro.quit()
- items.py类
import scrapy
class WangyiproItem(scrapy.Item):
title = scrapy.Field()
content = scrapy.Field()
- middlewares.py中间件,拦截请求重写响应⭐
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from time import sleep
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from scrapy.http import HtmlResponse
class WangyiDownloaderMiddleware:
def process_request(self, request, spider):
return None
# 该方法拦截五大板块对应的响应对象,进行篡改
def process_response(self, request, response, spider): # spider爬虫对象
bro = spider.bro # 获取了在爬虫类中定义的浏览器对象
# 挑选出指定的响应对象进行篡改
# 通过url指定request
# 通过request指定response
if request.url in spider.models_urls:
bro.get(request.url) # 五个板块对应的url进行请求
sleep(1)
page_text = bro.page_source # 包含了动态加载的新闻数据
# response #五大板块对应的响应对象
# 针对定位到的这些response进行篡改
# 实例化一个新的响应对象(符合需求:包含动态加载出的新闻数据),替代原来旧的响应对象
# 如何获取动态加载出的新闻数据?
# 基于selenium便捷的获取动态加载数据
new_response = HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request)
return new_response
else:
# response #其他请求对应的响应对象
return response
- settings.py配置文件
BOT_NAME = 'wangyi'
SPIDER_MODULES = ['wangyi.spiders']
NEWSPIDER_MODULE = 'wangyi.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 Edg/86.0.622.69'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'wangyi.middlewares.WangyiDownloaderMiddleware': 543,
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'wangyi.pipelines.WangyiPipeline': 300,
}
LOG_LEVEL = 'ERROR'
10 全站数据爬取利器CrawlSpider
- 全站数据爬取的方式
- 基于Spider:手动请求
- 基于CrawlSpider - CrawlSpider的使用:
- 创建一个工程
- cd XXX
- 创建爬虫文件(CrawlSpider):
-scrapy genspider -t crawl xxx www.xxxx.com
- 链接提取器LinkExtractor:
- 作用:根据指定的规则(allow)进行指定链接的提取
- 规则解析器Rule:
- 作用:将链接提取器提取到的链接进行指定规则(callback)的解析
全站爬取糗图百科
分析:爬取的数据没有在同一张页面中。
1.可以使用链接提取器提取所有的页码链接
2.让链接提取器提取所有的新闻详情页的链接
创建文件 scrapy genspider -t crawl sun2 www.xxxx.com
数据解析文件sun2
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import DetailItem, SunproItem
class Sun2Spider(CrawlSpider):
name = 'sun2'
# allowed_domains = ['www.xxx.com']http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1
start_urls = ['https://www.qiushibaike.com/']
# 链接提取器:根据指定规则(allow="正则")进行指定链接的提取<a href="/8hr/page/2/" rel="nofollow">
# <!--<a href="/8hr/page/2/" rel="nofollow">-->
link = LinkExtractor(allow=r'/8hr/page/\d+/')
link_detail = LinkExtractor(allow=r'/article/\d+')
# print(link_detail.link_extractor)
rules = (
# 规则解析器:将链接提取器提取到的链接进行指定规则(callback)的解析操作
Rule(link, callback='parse_item', follow=False),
# follow=True:可以将链接提取器 继续作用到 连接提取器提取到的链接 所对应的页面中
Rule(link_detail, callback='parse_detail')
)
# http://wz.sun0769.com/html/question/201907/421001.shtml
# http://wz.sun0769.com/html/question/201907/420987.shtml
# 解析新闻编号和新闻的标题
# 如下两个解析方法中是不可以实现请求传参!
# 如法将两个解析方法解析的数据存储到同一个item中,可以以此存储到两个item
def parse_item(self, response):
# 注意:xpath表达式中不可以出现tbody标签
print(response)
tr_list = response.xpath('//*[@id="content"]/div/div[2]/div/ul/li')
for tr in tr_list:
#//*[@id="qiushi_tag_123792375"]/div/div/a/span//*[@id="qiushi_tag_123792375"]/div/div/a
new_num = tr.xpath('./div[1]/div[1]/a/span/text()').extract_first()
new_title = tr.xpath('./div[1]/a[1]/text()').extract_first()
# print('new_num',new_num)
# print('new_title',new_title)
item = SunproItem()
item['title'] = new_title
item['new_num'] = new_num
yield item
# 解析新闻内容和新闻编号/html/body/div[3]/div[2]/div[2]/div[2]/pre
#//*[@id="articleSideLeft"]/a/div[1]/span[1]
def parse_detail(self, response):
new_id = response.xpath('//*[@id="articleSideLeft"]/a/div[1]/span[1]/text()').extract_first()
new_content = response.xpath('//*[@id="single-next-link"]/div[1]//text()').extract()
new_content = ''.join(new_content)
# print('new_id',new_id)
# print('new_content',new_content)
item = DetailItem()
item['content'] = new_content
item['new_id'] = new_id
yield item
持久化管道文件pipelines.py
from itemadapter import ItemAdapter
class SunproPipeline:
fp = None
#重写父类的一个方法:该方法只在开始爬虫的时候被调用一次
def open_spider(self,spider):
print('开始爬虫......')
self.fp = open('./111111111111.txt','w',encoding='utf-8')
#专门用来处理item类型对象
#该方法可以接收爬虫文件提交过来的item对象
#该方法没接收到一个item就会被调用一次
def process_item(self, item, spider):
if item.__class__.__name__ == 'DetailItem':
# print(item['new_id'],item['content'])
self.fp.write('细节id' + ':' + item['new_id'] + '\n')
self.fp.write('细节内容' + ':' + item['content'] + '\n')
pass
else:
# print(item['new_num'],item['title'])
self.fp.write('id' + ':' + item['new_num'] + '\n')
self.fp.write('title' + ':' + item['title'] + '\n')
return item #就会传递给下一个即将被执行的管道类
def close_spider(self,spider):
print('结束爬虫!')
self.fp.close()
items类文件items.py
import scrapy
class SunproItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
new_num = scrapy.Field()
class DetailItem(scrapy.Item):
new_id = scrapy.Field()
content = scrapy.Field()
配置文件settings.py
BOT_NAME = 'sunPro'
SPIDER_MODULES = ['sunPro.spiders']
NEWSPIDER_MODULE = 'sunPro.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 Edg/86.0.622.69'
LOG_LEVEL = 'ERROR'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'sunPro.pipelines.SunproPipeline': 300,
}