scrapy对接splash爬取淘宝商品信息

一.splash简介

二.安装scrapy-splash

  • 安装docker
  • 通过docker安装scrapy-splash
    docker run -d -p 8050:8050 scrapinghub/splash
  • 安装python库
    pip3 install scrapy-splash

三.scrapy对接splash

scrapy对接splash,参考:https://github.com/scrapy-plugins/scrapy-splash#configuration
免费代理,参考:http://www.66ip.cn/areaindex_35/index.html
splash设置代理,参考:

四.编写spider

from scrapy import Spider
from urllib.parse import quote
from taobao.items import TaobaoItem
from scrapy_splash import SplashRequest

script = """
function main(splash, args)
    headers={
    ["authority"] = "s.taobao.com",
    ["method"] = "https",
    ["scheme"] = "GET",
    ["accept"] = "*/*",
    ["accept-language"] = "zh-CN,zh;q=0.9",
    ["cookie"] = "",
    ["referer"] = "https://s.taobao.com/search?q=%E8%A3%A4%E5%AD%90&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306",
    ["sec-fetch-mode"] = "no-cors",
    ["sec-fetch-site"] = "same-origin",
    ["user-agent"] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
    }

  assert(splash:go{args.url,headers=headers})
  assert(splash:wait(5))

  js = string.format('document.querySelector(".J_Input").value=%d', args.page)
  splash:evaljs(js) 
  assert(splash:wait(1))
  splash:evaljs('document.querySelector(".J_Submit").click()')
  assert(splash:wait(5))

  return splash:html()
end
"""


class TaoSpider(Spider):
    name = 'taobao'
    allowed_domains = ['www.taobao.com']
    base_url = 'https://s.taobao.com/search?ie=utf8&q='

    def start_requests(self):
        for keyword in self.settings.get('KEYWORDS'):
            for page in range(1, self.settings.get('MAX_PAGE') + 1):
                url = self.base_url + quote(keyword)
                yield SplashRequest(url, callback=self.parse, endpoint='execute',
                                    args={'lua_source': script, 'page': page, 'wait': 7})

    def parse(self, response):
        products = response.xpath('//div[@class="items"]/div')
        for product in products:
            item = TaobaoItem()
            item['shop'] = "".join(product.xpath('.//div[@class="shop"]/a//span/text()').extract()).strip()
            item['price'] = "".join(product.xpath('.//div[contains(@class, "price")]/strong/text()').extract()).strip()
            item['location'] = "".join(product.xpath('.//div[@class="location"]/text()').extract()).strip()
            item['deal'] = "".join(product.xpath('.//div[@class="deal-cnt"]/text()').extract()).strip()
            item['image'] = "".join(product.xpath('.//div[@class="pic"]//img/@src').extract()).strip()
            item['title'] = "".join(product.xpath('.//a[@class="J_ClickStat"]//text()').extract()).strip()
            yield item

五.运行结果

结果
结果

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值