scrapy对接splash爬取淘宝商品信息_使用scrapy爬取淘宝-CSDN博客

本文链接：https://blog.csdn.net/qq_34303423/article/details/104008941

一.splash简介

官方文档：https://splash.readthedocs.io/en/stable/
GitHub1：https://github.com/scrapinghub/splash
GitHub2：https://github.com/scrapy-plugins/scrapy-splash
Lua教程：https://www.runoob.com/lua/lua-tutorial.html
JS教程：https://www.w3school.com.cn/js/index.asp
本项目地址：待上传

二.安装scrapy-splash

安装docker
通过docker安装scrapy-splash
docker run -d -p 8050:8050 scrapinghub/splash
安装python库
pip3 install scrapy-splash

三.scrapy对接splash

scrapy对接splash，参考：https://github.com/scrapy-plugins/scrapy-splash#configuration
免费代理，参考：http://www.66ip.cn/areaindex_35/index.html
splash设置代理，参考：

四.编写spider

from scrapy import Spider
from urllib.parse import quote
from taobao.items import TaobaoItem
from scrapy_splash import SplashRequest

script = """
function main(splash, args)
    headers={
    ["authority"] = "s.taobao.com",
    ["method"] = "https",
    ["scheme"] = "GET",
    ["accept"] = "*/*",
    ["accept-language"] = "zh-CN,zh;q=0.9",
    ["cookie"] = "",
    ["referer"] = "https://s.taobao.com/search?q=%E8%A3%A4%E5%AD%90&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306",
    ["sec-fetch-mode"] = "no-cors",
    ["sec-fetch-site"] = "same-origin",
    ["user-agent"] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
    }

  assert(splash:go{args.url,headers=headers})
  assert(splash:wait(5))

  js = string.format('document.querySelector(".J_Input").value=%d', args.page)
  splash:evaljs(js) 
  assert(splash:wait(1))
  splash:evaljs('document.querySelector(".J_Submit").click()')
  assert(splash:wait(5))

  return splash:html()
end
"""


class TaoSpider(Spider):
    name = 'taobao'
    allowed_domains = ['www.taobao.com']
    base_url = 'https://s.taobao.com/search?ie=utf8&q='

    def start_requests(self):
        for keyword in self.settings.get('KEYWORDS'):
            for page in range(1, self.settings.get('MAX_PAGE') + 1):
                url = self.base_url + quote(keyword)
                yield SplashRequest(url, callback=self.parse, endpoint='execute',
                                    args={'lua_source': script, 'page': page, 'wait': 7})

    def parse(self, response):
        products = response.xpath('//div[@class="items"]/div')
        for product in products:
            item = TaobaoItem()
            item['shop'] = "".join(product.xpath('.//div[@class="shop"]/a//span/text()').extract()).strip()
            item['price'] = "".join(product.xpath('.//div[contains(@class, "price")]/strong/text()').extract()).strip()
            item['location'] = "".join(product.xpath('.//div[@class="location"]/text()').extract()).strip()
            item['deal'] = "".join(product.xpath('.//div[@class="deal-cnt"]/text()').extract()).strip()
            item['image'] = "".join(product.xpath('.//div[@class="pic"]//img/@src').extract()).strip()
            item['title'] = "".join(product.xpath('.//a[@class="J_ClickStat"]//text()').extract()).strip()
            yield item