scrapy框架实现动态页面的爬取
这次目标以图片为例,首先打开网站,查看目录,查看源代码
显然静态抓取标题链接是行不通的。接着打开F12/网络/xhr检查含有目录链接的信息
找到数据包的请求头,以往的思路是通过网页的url作为起始地址来设计爬虫,这里直接用请求头的url当起始地址
可以看出标题的信息包含在这一坨文本里,也不能直接被xpath提取,这里我先以text()的格式提取全部信息,再用re模块提取有用信息
def parse(self, response):
qqq=response.xpath('//text()').extract()[0]
jieguo=re.findall('"id":.*?\,',qqq)
list=[]
for shuzu in jieguo:
qwe = shuzu.strip('"id":,')#二次清洗,得到纯id
list.append(qwe)
这里我只提取了id信息,原因在于我点进了图片页面发现图片链接也是动态加载的,老方法查看xhr
查看请求头url,显然我打算要以遍历id的形式来得到当前页的所有目录链接,参考代码
for i in list:
url='https://www.fml6.com/appapi/h5/getInforDetailById/{}/img'.format(i)
yield scrapy.Request(url,callback=self.erci)
这里使用了回调函数,每遍历一个url就进入到网页内提取图片的下载链接
def erci(self, response):
www = response.xpath('//img/@src').extract()
list1=[]
for image_url2 in www:
aw = image_url2.strip('\\"/')#图片链接需要过滤
list1.append(aw)
list2=[]
for img in list1:
url2='https://img.fu3k7.com/'+img#过滤后拼装域名
list2.append(url2)
item=Eva02Item()
item['image_url']=list2#保存
return item
接下来要实现目录的翻页,显然又不在源码里,这里用range()函数枚举了页码数字,见代码
for a in range(1,27):
fanye='https://www.fml6.com/appapi/h5/getDetailByCategorySub/20/{}/20'.format(a)
yield scrapy.Request(fanye,callback=self.parse)
到这里最后再回调实现前26页的爬取,注意这一串代码是在parse函数下的。
下面附上源码,
spider.py
from eva02.items import Eva02Item
import scrapy
import re
class QweSpider(scrapy.Spider):
name = 'qwe'
allowed_domains = ['www.fml6.com']
start_urls = ['https://www.fml6.com/appapi/h5/getDetailByCategorySub/20/1/20']
def parse(self, response):
qqq=response.xpath('//text()').extract()[0]
jieguo=re.findall('"id":.*?\,',qqq)
list=[]
for shuzu in jieguo:
qwe = shuzu.strip('"id":,')
list.append(qwe)
for i in list:
url='https://www.fml6.com/appapi/h5/getInforDetailById/{}/img'.format(i)
yield scrapy.Request(url,callback=self.erci)
for a in range(1,27):
fanye='https://www.fml6.com/appapi/h5/getDetailByCategorySub/20/{}/20'.format(a)
yield scrapy.Request(fanye,callback=self.parse)
for e in range(20,28):#这一串代码是枚举目录,爬取网站所有图片
mulu='https://www.fml6.com/appapi/h5/getDetailByCategorySub/{}/1/20'.format(e)
yield scrapy.Request(mulu,self.parse)
def erci(self, response):
www = response.xpath('//img/@src').extract()
list1=[]
for image_url2 in www:
aw = image_url2.strip('\\"/')
list1.append(aw)
list2=[]
for img in list1:
url2='https://img.fu3k7.com/'+img
list2.append(url2)
item=Eva02Item()
item['image_url']=list2
return item
pipeline.py
import scrapy
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
class Eva02Pipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for Image in item['image_url']:
yield scrapy.Request(Image)
def item_completed(self, results, item, info):
return item
setting.py
ITEM_PIPELINES = {
'eva02.pipelines.Eva02Pipeline': 1,
}
DOWNLOAD_DELAY = 0.5
IMAGES_STORE='D:\SEELE\eva02\QQQ'
IMAGE_URLS_FIELD='QQQ'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
item.py
import scrapy
class Eva02Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_url = scrapy.Field()
image=scrapy.Field()