目录
【实战1】——站长素材的高清图片(反爬 图片懒加载 但是 我没遇到!!!!!!
一、图片爬取——ImagePipeline
字符串:只需要基于xpath进行解析并提交管道,进行持久化存储
图片:xpath解析出src,单独对图片地址发起请求,获取图片二进制类型的数据
ImagePipeline作用:
只需将img的src属性值进行解析,提交到管道,管道就会对图片的src进行请求发送,获取图片的二进制数据,并进行持久化存储
使用流程:
1.数据解析
2.将存储图片地址的item提交到指定的管道类
3.在管道中自己写一个基于ImagesPipeLine的一个管道类,并重写3个方法
get_media_requests:根据图片地址进行图片数据请求
file_path :指定图片存储路径
item_completed:返回下一个即将执行的管道类
4.在配置文件中 指定图片存储目录
IMAGES_STORE='./imgs' 该路径可以不提前建立
【实战1】——站长素材的高清图片(反爬 图片懒加载 但是 我没遇到!!!!!!
img.py
import scrapy
from ..items import ImgsproItem
class ImgSpider(scrapy.Spider):
name = "img"
# allowed_domains = ["www.xxx.com"]
start_urls = ["https://sc.chinaz.com/tupian/"]
def parse(self, response):
all_src = response.xpath('//div[@class="item"]/img/@data-original').extract()
# https://scpic.chinaz.net/files/default/imgs/2023-07-10/8c21efcadf0e9a30_s.jpg
all_name = response.xpath('//div[@class="item"]/img/@alt').extract()
for i in range(len(all_src)):
print('len(all_src)',len(all_src))
print(i)
download_url = 'https:' + all_src[i]
name=all_name[i]
# print(name,download_url)
item=ImgsproItem()
item['name']=name
item['download_path']=download_url
# 大哥,不要忘记这个呀,忘了pipelines就啥也得不到了
# 而且 是yield!!!!! return就直接执行一次就返回了 那循环就没用了
yield item
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
# class ImgsproPipeline:
# def process_item(self, item, spider):
# return item
from scrapy.pipelines.images import ImagesPipeline
import scrapy
# import pillow
# 自己重写父类
# 因为注释了自带的类,所以 自己写的类要和注释掉的自带类名相同,不然会报错
class ImgsproPipeline(ImagesPipeline):
# 根据图片地址进行图片数据请求
def get_media_requests(self, item, info):
print(item['download_path'])
yield scrapy.Request(item['download_path'])
# 指定图片存储路径
def file_path(self, request, response=None, info=None, *, item):
imgName=item['name']+'.png'
return imgName
def item_completed(self, results, item, info):
return item #返回下一个即将执行的管道类
item.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ImgsproItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name=scrapy.Field()
download_path=scrapy.Field()
pass
LOG_LEVEL='WARNING'
`yield item`和`return item`在Scrapy中有着不同的作用和效果。
- `yield item`:使用`yield`关键字将item生成为一个生成器(generator)对象。这样做的好处是,当Spider返回一个生成器时,Scrapy会自动迭代生成器并处理每个生成的item。这样,Scrapy可以逐个处理item,提高内存利用率,并在需要时实现异步处理。
- `return item`:使用`return`关键字将item作为返回值返回。这样做的效果是,当Spider返回一个item时,Scrapy会立即停止处理并将item传递给后续的pipeline进行处理。这意味着只有第一个item会被处理,而其他的item将被忽略。
在Scrapy中,推荐使用`yield item`来生成item并返回给Scrapy框架进行处理。这样可以确保所有的item都能被逐个处理,并且可以利用Scrapy的异步处理机制提高效率。
如果你使用`return item`,只有第一个item会被处理,而其他的item将被忽略。这通常不是你想要的行为,除非你明确知道只有一个item需要处理。
因此,在Scrapy中,使用`yield item`来生成item并返回给框架进行处理是更常见和推荐的做法。
二、中间件
爬虫中间件:引擎和spider之间
下载中间件(重点):在引擎和下载件之间,批量拦截整个工程中所用的请求和响应
1.拦截请求:(没有实验成功)
(1)UA伪装:process_request
(2)代理IP:process_exception 后面加return requset
middlewares.py
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
import random
# 爬虫类没用到
class MiddleproSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
# 重点是下载类
class MiddleproDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
# 不重要,可删
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
# 拦截请求
def process_request(self, request, spider):
# UA伪装
request.headers['User_Agent']=random.choice(self.user_agent_list)
return None
# 拦截所有响应
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
PROXY_http = [
'153.180.102.104:80',
'195.208.131.189:56055',
]
PROXY_https = [
'120.83.49.90:9000',
'95.189.112.214:35508',
]
# 拦截发生异常的请求
def process_exception(self, request, exception, spider):
# 代理IP
# request.meta['proxy']='http://ip:port'
if request.url.split(':')[0]=='http':
request.meta['proxy'] ='http://'+ random.choice(self.PROXY_http)
else:
request.meta['proxy'] ='https://'+ random.choice(self.PROXY_https)
# 59.64.128.198:21
return request #将修正之后的请求对象进行重新的请求发送
# 不重要,可删
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
middle.py
import scrapy
class MiddleSpider(scrapy.Spider):
# 爬取百度
name = "middle"
# allowed_domains = ["www.xxx.com"]
start_urls = ["https://www.baidu.com/s?wd=ip"]
# https://ip.900cha.com/
def parse(self, response):
print(response)
print("____________")
page_text=response.text
with open('./ip.html','w',encoding='utf-8') as fp:
fp.write(page_text)
# print(page_text)
# scrapy crawl middle
#
settings.py 解开中间件
2.拦截响应
(1)篡改响应数据、响应对象
【实战2】——爬取网易新闻
wangyi.py
import scrapy
from selenium import webdriver
from ..items import WangyiproItem
class WangyiSpider(scrapy.Spider):
name = "wangyi"
# allowed_domains = ["www.xxx.com"]
start_urls = ["https://news.163.com/"]
# 实例化一个浏览器对象
def __init__(self):
self.bro=webdriver.Chrome()
# 存储5大板块详情页
model_urls=[]
# 解析5大板块的url
def parse(self, response):
all_name=response.xpath('//div[@class="index_head"]/div[@class="bd"]//ul/li/a/text()').extract()
all_url=response.xpath('//div[@class="index_head"]/div[@class="bd"]//ul/li/a/@href').extract()
# print(all_name,all_url)
# ['首页', '国内', '国际', '数读', '军事', '航空', '传媒科技研究院', '政务', '公益', '媒体', '王三三']
# 要 国内、国际、媒体
alist=[1,2]
for index in alist:
model_url=all_url[index]
self.model_urls.append(model_url)
print(self.model_urls)
# 依次对每个板块的详情页 发送请求
for url in self.model_urls:
yield scrapy.Request(url,callback=self.parse_model)
# 解析每个板块的标题和详细页url
def parse_model(self,response):
# 每个板块的内容都是动态加载的
all_title=response.xpath('//div[@class="news_title"]/h3/a/text()').extract()
all_detail_url=response.xpath('//div[@class="news_title"]/h3/a/@href').extract()
for i in range(len(all_title)):
item=WangyiproItem()
item['title']=all_title[i]
detail_url=all_detail_url[i]
print("___________________")
print(all_title[i],detail_url)
# 对新闻的详情页url 发起请求
yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item})
def parse_detail(self,response): #解析新闻的详情页
content=response.xpath('//div[@class="post_body"]/p/text()').extract()
content=''.join(content)
item=response.meta['item']
item['content']=content
print("++++++++++++++++++++++++")
# print(content)
yield item #提交到管道进行持久化存储
def close_bro(self,spider):
self.bro.quit()
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class WangyiproItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title=scrapy.Field()
content=scrapy.Field()
pass
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class WangyiproPipeline:
def process_item(self, item, spider):
# print(item)
downLoad_path='./wangyi/'+item['title']+'.txt'
fp=open(downLoad_path,'w',encoding='utf-8')
fp.write(item['title']+'\n'+item['content'])
fp.close()
return item
middlewares.py
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
# 导包
from scrapy.http import HtmlResponse
from time import sleep
class WangyiproSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
class WangyiproDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
# def from_crawler(cls, crawler):
# # This method is used by Scrapy to create your spiders.
# s = cls()
# crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
# return s
def process_request(self, request, spider):
return None
# 该方法拦截5大板块对应的响应对象,进行篡改
def process_response(self, request, response, spider): # spider是爬虫对象
# 挑选出指定的响应对象进行修改
# 通过url指定request
# 通过request指定response
# 如何获取动态加载的新闻数据
# 基于selenium便捷的获取动态加载数据
# 浏览器实例化 要放在外面 因为里面会循环,所以写爬虫文件中
# 通过spider获取浏览器对象
bro = spider.bro
# 通过spider获取model_urls
if request.url in spider.model_urls:
# response #5大板块对应的响应对象
# 针对定位到的这些response进行篡改,
# 实例化一个新的响应对象(符合需求:包含动态加载的新闻数据),替换旧的响应对象
bro.get(request.url)
sleep(2)
page_text=bro.page_source # 包含动态加载的新闻数据
new_response = HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request)
return new_response
else:
# response #其他请求对应的响应对象
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
# def spider_opened(self, spider):
# spider.logger.info("Spider opened: %s" % spider.name)
settings.py
记得打开管道 和 下载中间件
文件结构