用scrapy爬取菜谱网站的食谱与ajax异步加载标签的获取

由于最近想做一个和菜谱应用,所以需要爬虫爬到大量的食谱数据,学习了一系列爬虫相关的知识,和大家分享一下

首先我们要爬取的网站的域名为"home.meishichina.com"

在这个网站的菜谱页面中有许多的分类,我们选取”热菜“这个类别进行爬虫,爬取其他的类别只需修改链接中的信息即可

'https://home.meishichina.com/recipe/recai/page/2/'

由于需要爬取二级界面,所以我们选取crawlspider进行爬取。这个网站的菜谱分类的第一页链接与其他页的规则不同,所以为了方便我们直接从第二页开始爬取。由于网站的数据量非常大,少爬取一页并不影响什么,大家想爬取第一页的可以去自己研究一下正则规则

  • 在终端中输入

scrapy startproject recipe

创建recipe爬虫项目并且cd到项目中

  • 终端中输入

scrapy genspider -t crawl eat home.meishichina.com

创建eat爬虫模板

  • 填写linkExtractor规则,此网站的规则是

rules = (    Rule(LinkExtractor(allow=r'/recipe/recai/page/\d+/'), callback='parse_item', follow=True),)

即每次把链接后的page/2/的数字加一

  • 在一级页面中利用xpath提取封面图片链接与下一级页面链接

li_list = response.xpath('//*[@id="J_list"]/ul/li')for li in li_list:
# 菜名    
name = li.xpath('./div/a/@title').extract_first()# 图片链接    
src = li.xpath('.//img/@data-src').extract_first()
if src isnotNone:
        src = src.split('?')[0]
# 二级页面链接    
url = li.xpath('./div[2]/h2/a/@href').extract_first()
  • 二级页面同样用xpath分别提取简介,主料,辅料,调料,标签,文字步骤,所需分类,图片步骤,点赞数量与收藏数量

      def parse_second(self, response):
        # print(response.body.decode())
        # print(pq(response).text())
        # 接收一级页面的菜名与图片链接
        name = response.meta['name']
        img_src = response.meta['src']
        # 简介
        brief = response.xpath('//*[@id="block_txt1"]/text()').extract_first()
        if brief:
            brief
        else:
            brief = ''
        # 主料
        main_mater_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/fieldset[1]/div/ul/li')
        main_mater = ''
        for li in main_mater_list:
            mater = li.xpath('.//b/text()').extract_first()
            main_mater = main_mater + mater + '/'
        # 辅料
        price_mater_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/fieldset[2]/div/ul/li')
        price_mater = ''
        for li in price_mater_list:
            mater = li.xpath('.//b/text()').extract_first()
            price_mater = price_mater + mater + '/'
        # 调料
        seasoning_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/fieldset[3]/div/ul/li')
        seasoning = ''
        for li in seasoning_list:
            mater = li.xpath('.//b/text()').extract_first()
            seasoning = seasoning + mater + '/'
        # tag
        tag_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/div[3]/ul/li')
        taste = tag_list[0].xpath('.//a/@title').extract_first()
        craft = tag_list[1].xpath('.//a/@title').extract_first()
        if tag_list[2] is not None:
            time_cost = tag_list[2].xpath('.//a/@title').extract_first()
            if tag_list[3] is not None:
                level = tag_list[3].xpath('.//a/@title').extract_first()
        else:
            level = '一般'
        # 文字步骤
        steps_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/div[5]/ul/li')
        steps = ''
        for li in steps_list:
            step = li.xpath('./div[2]/text()').extract_first()
            steps = steps + step + '/'
        # 所属分类
        category_list = response.xpath('//*[@id="path"]/a')
        category = ''
        for i in range(2, len(category_list)):
            a_temp = category_list[i].xpath('./text()').extract_first()
            category = category + a_temp + '/'
        # 图片步骤
        img_steps_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/div[5]/ul/li')
        img_steps = ''
        for li in img_steps_list:
            if li.xpath('.//img/@src').extract_first() is not None:
                img_step = li.xpath('.//img/@src').extract_first()
                img_step = img_step.split('?')[0]
                img_steps = img_steps + img_step + ' '
            else:
                break

而到了获取点赞数量与收藏数量就变得不一样了,刚开始时我用xpath获取的结果为空

likes=response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/div[10]/ul/li[1]/a/span/text()').extract_first()

由于是第一次爬虫,我下意识以为是xpath语句不对,于是我把xpath语句改来改去,仍然不对。这时我想到了在学习时好像看到了动态加载这么一说,于是我便去找含有点赞数量与收藏数量的ajax链接,果不其然

此时的我刚接触爬虫,对其运作原理也没耐心去看,只知道能用就好。这个请求url中需要每个菜谱的编号,于是我便想能不能在每次爬取二级链接时提取菜谱的编号呢?但当时的我尝试了一下便放弃了,因为我print(url)的结果为一大堆url链接,想获取当前的菜谱链接还得费力去提取还不一定成功,这时我还是没耐心了结scrapy的原理,但凡我当时去认真看一下都不会后面浪费这么多时间。所以我就想转去用selenium或splash渲染页面后再提取。

print(url)的结果

  • 第一天我使用了splash,想用splash得搞定docker的配置并且要再pycharm中下载scrapy-splash包,当我这些搞好后,利用splash渲染总是报错http://localhost:8085/render.html 502 bad gateway,由于splash用的人没有selenium用的多,相对的社区资源也非常少,这个问题几乎没有人问也没搜到任何解决方法,我尝试了设置代理,设置ip池都不行,中间docker还出问题了。最后折腾一天后放弃改用selenium。

spider中

yield SplashRequest(url=i,callback=self.parse_second, args={'lua_source': script,'wait': 20, 'timeout': 20,'http_method': 'POST'}, meta={'name': name, 'src': src})
  • 第二天我使用selenium,通过使用selenium我才慢慢的了解了scrapy的原理,但是selenium能渲染是能渲染,但速度实在是太慢了,而且渲染后我仍然提取不到点赞与收藏数量,有时还会报错计算机积极拒绝等错误,我又搞了半天的代理与Ip问题,最后发现跟这些都没关系,渲染后输出html界面能看到<span>6<span>中是有数据的,但我就是提取不出来。本来我是抱着能多学点东西才这么钻研splash和selenium渲染的,东西是学到了点,但就是没结果,于是又放弃了selenium。

selenium spider中:

chrome_options = Options()chrome_options.add_argument('--headless')
  # 使用无头谷歌浏览器模式
chrome_options.add_argument('--disable-gpu')chrome_options.add_argument('--no-sandbox')
# 指定谷歌浏览器路径
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='chromedriver.exe')

middlewares中:

class EatSpiderMiddleware(object):
    def process_request(self, request, spider):
        # 指定谷歌浏览器路径
        driver = spider.driver
        if 'page' not in request.url:
            driver.get(request.url)
            time.sleep(1)
            driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
            time.sleep(0.5)
            html = driver.page_source
            rs = etree.HTML(html)
            return HtmlResponse(url=request.url, body=rs.encode('utf-8'), encoding='utf-8',
                                            request=request)

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        return None

  • 由于在学习scrapy与selenium的搭配使用时我了解了一些scrapy的运作原理,我便想着好像是我刚开始的理解有误,于是我立刻又接着我刚开始的想法去提取菜谱编号合成ajax请求url,结果十几分钟我就全搞好了,点赞和收藏数据也搞到了。

scrapy原理:

spiders收到爬取url,将每个请求通过engine放入队列中,每个请求按照队列出队请求下载,将request传给downloader访问网络获得response,再将response的结果通过downloaderMiddlewares交付给engine传给spiders进行处理,最后通过ItemPipeline进行数据持久化处理,也就是说进行二级爬取时是一个一个url顺序进行的。

按照这个思路获取每个菜谱url中的编号就很容易了,最后再和ajax链接进行拼接,通过urlib进行模拟请求获得动态数据即可

#异步加载的动态点赞数量与收藏数量ajax链接
        url_ajax = 'https://home.meishichina.com/ajax/ajax.php?ac=user&op=getrecipeloadinfo&id=' + response.meta['num']
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
        }
        request = urllib.request.Request(url=url_ajax, headers=headers)
        # 发送请求
        response = urllib.request.urlopen(request)
        # 接收并打印数据
        content = response.read().decode('utf-8')
        # 将字符转化为字典
        content = json.loads(content)
        # 点赞数量

        likes = content['likenum']
        # 收藏数量
        fav = content['ratnum']
        recipe_data = RecipeItem(name=name, img_src=img_src, brief=brief, main_mater=main_mater,
                                 price_mater=price_mater, seasoning=seasoning, taste=taste, craft=craft,
                                 time_cost=time_cost, level=level, steps=steps, img_steps=img_steps, category=category,
                                 likes=likes, fav=fav)
        yield recipe_data
  • 完整代码

eat.py

import json
import re
import urllib
import urllib.request
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from recipe.items import RecipeItem
from scrapy import Spider, Request
from scrapy_splash import SplashRequest
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time


class EatSpider(CrawlSpider):
    name = 'eat'
    allowed_domains = ['home.meishichina.com']
    start_urls = ['https://home.meishichina.com/recipe/recai/page/2/']

    rules = (
        Rule(LinkExtractor(allow=r'/recipe/recai/page/\d+/'), callback='parse_item', follow=True),
    )
    # chrome_options = Options()
    # chrome_options.add_argument('--headless')  # 使用无头谷歌浏览器模式
    # chrome_options.add_argument('--disable-gpu')
    # chrome_options.add_argument('--no-sandbox')
    # # 指定谷歌浏览器路径
    # driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='chromedriver.exe')

    def parse_item(self, response):
        li_list = response.xpath('//*[@id="J_list"]/ul/li')
        for li in li_list:
            # 菜名
            name = li.xpath('./div/a/@title').extract_first()
            # 图片链接
            src = li.xpath('.//img/@data-src').extract_first()
            if src is not None:
                src = src.split('?')[0]
            # 二级页面链接
            url = li.xpath('./div[2]/h2/a/@href').extract_first()
            print(url)
            # 获取每个菜谱的编号
            num = ''.join(list(filter(str.isdigit, url)))
            # 提交一级页面数据
            #     yield SplashRequest(url=i,callback=self.parse_second, args={'lua_source': script,'wait': 20, 'timeout': 20,'http_method': 'POST'}, meta={'name': name, 'src': src})
            yield scrapy.Request(url=url, callback=self.parse_second, meta={'name': name, 'src': src, 'num': num})

    # 在二级页面获取简介,主料,辅料,调料,tag与做法步骤
    def parse_second(self, response):
        # print(response.body.decode())
        # print(pq(response).text())
        # 接收一级页面的菜名与图片链接
        name = response.meta['name']
        img_src = response.meta['src']
        # 简介
        brief = response.xpath('//*[@id="block_txt1"]/text()').extract_first()
        if brief:
            brief
        else:
            brief = ''
        # 主料
        main_mater_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/fieldset[1]/div/ul/li')
        main_mater = ''
        for li in main_mater_list:
            mater = li.xpath('.//b/text()').extract_first()
            main_mater = main_mater + mater + '/'
        # 辅料
        price_mater_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/fieldset[2]/div/ul/li')
        price_mater = ''
        for li in price_mater_list:
            mater = li.xpath('.//b/text()').extract_first()
            price_mater = price_mater + mater + '/'
        # 调料
        seasoning_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/fieldset[3]/div/ul/li')
        seasoning = ''
        for li in seasoning_list:
            mater = li.xpath('.//b/text()').extract_first()
            seasoning = seasoning + mater + '/'
        # tag
        tag_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/div[3]/ul/li')
        taste = tag_list[0].xpath('.//a/@title').extract_first()
        craft = tag_list[1].xpath('.//a/@title').extract_first()
        if tag_list[2] is not None:
            time_cost = tag_list[2].xpath('.//a/@title').extract_first()
            if tag_list[3] is not None:
                level = tag_list[3].xpath('.//a/@title').extract_first()
        else:
            level = '一般'
        # 文字步骤
        steps_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/div[5]/ul/li')
        steps = ''
        for li in steps_list:
            step = li.xpath('./div[2]/text()').extract_first()
            steps = steps + step + '/'
        # 所属分类
        category_list = response.xpath('//*[@id="path"]/a')
        category = ''
        for i in range(2, len(category_list)):
            a_temp = category_list[i].xpath('./text()').extract_first()
            category = category + a_temp + '/'
        # 图片步骤
        img_steps_list = response.xpath('/html/body/div[5]/div/div[1]/div[3]/div/div[5]/ul/li')
        img_steps = ''
        for li in img_steps_list:
            if li.xpath('.//img/@src').extract_first() is not None:
                img_step = li.xpath('.//img/@src').extract_first()
                img_step = img_step.split('?')[0]
                img_steps = img_steps + img_step + ' '
            else:
                break
        #异步加载的动态点赞数量与收藏数量ajax链接
        url_ajax = 'https://home.meishichina.com/ajax/ajax.php?ac=user&op=getrecipeloadinfo&id=' + response.meta['num']
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
        }
        request = urllib.request.Request(url=url_ajax, headers=headers)
        # 发送请求
        response = urllib.request.urlopen(request)
        # 接收并打印数据
        content = response.read().decode('utf-8')
        # 将字符转化为字典
        content = json.loads(content)
        # 点赞数量

        likes = content['likenum']
        # 收藏数量
        fav = content['ratnum']
        recipe_data = RecipeItem(name=name, img_src=img_src, brief=brief, main_mater=main_mater,
                                 price_mater=price_mater, seasoning=seasoning, taste=taste, craft=craft,
                                 time_cost=time_cost, level=level, steps=steps, img_steps=img_steps, category=category,
                                 likes=likes, fav=fav)
        yield recipe_data

item.py

import scrapy


class RecipeItem(scrapy.Item):
    # define the fields for your item here like:
    name=scrapy.Field()
    img_src=scrapy.Field()
    brief=scrapy.Field()
    main_mater=scrapy.Field()
    price_mater=scrapy.Field()
    seasoning=scrapy.Field()
    taste=scrapy.Field()
    craft=scrapy.Field()
    time_cost=scrapy.Field()
    level=scrapy.Field()
    steps=scrapy.Field()
    img_steps=scrapy.Field()
    category=scrapy.Field()
    likes=scrapy.Field()
    fav=scrapy.Field()
    pass

pipelines

import json
class RecipePipeline:
    def open_spider(self,spider):
        self.fp=open('recipe.json','w',encoding='utf-8')
    def process_item(self, item, spider):
        context = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.fp.write(context)
        return item
    def close_spider(self,spider):
        self.fp.close()
  • 总结

对于一个知识点还是要稳扎稳打了解其运作原理,了解透彻后再进行操作,不然事半功倍,对我来说也是个深刻的教训,不能急于求成。欢迎大佬指教我splash和selenium的错误

关注我~

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值