基于Scrapy淘宝全站Spider设计与实现

最新推荐文章于 2024-04-22 09:41:06 发布

Cold丶kl

最新推荐文章于 2024-04-22 09:41:06 发布

阅读量2.7w

点赞数 1

分类专栏：爬虫随笔文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_42792621/article/details/81211908

版权

爬虫随笔专栏收录该内容

2 篇文章 0 订阅

订阅专栏

虽说职业是数据分析，但是未曾放弃爬虫，始终保持一颗追求技术，敬畏技术的心！

本文基于Scrapy框架实现全站定向爬虫。在爬取淘宝时，没能找到淘宝全站相关的爬虫资料，只能借鉴零星的文章结合崔大的书，实现整体爬虫思路。因作者水平有限，本文仅提供爬取思路及源码，欢迎各位大佬提出改进意见。

准备环境：

Python3

Scrapy1.3.3

Scrapy-Splash

Splash

Nginx

MySQL

反爬核心思路：

因作者经济及能力有限，无法购买或维护有效的IP池及Cookies池,故无法绕开淘宝反爬。所以本文采取Splash暴力渲染Javascript，（Splash是用Python实现的，同时使用Twisted和QT。Twisted（QT）用来让服务具有异步处理能力，结合Scrapy框架发挥高并发性，提高爬虫爬取效率）通过在多台服务器部署Splash，利用Nginx通过轮询方式依次调度Splash，减轻单个Splash压力，完成负载均衡，最终实现高效率全站抓取。

爬虫抓取策略：广度优先

爬虫入口：https://www.taobao.com/tbhome/page/market-list

本次爬虫通过淘宝分类详情页作为入口，通过Splash渲染JS获取全量数据，采用正则表达式过滤全量有效词条URL。

pattern = r'.*//s.taobao.com/list?.*'

过滤所得近2000条有效一级分类URL，通过编写自定义LUA脚本，利用SplashRequest请求驱动LUA脚本实现页面的JS渲染。

本次全站抓取流程图如下：

全量分类页 à 商品列表页 à 商品详情页 à 商品评论页

以下为Spiders：

#!/usr/bin/env python3

# -*- coding:utf-8 -*-

# Author:CCCCCold_kl

import os, sys

import scrapy

import json

import urllib.parse

import re

import datetime

from scrapy.linkextractors import LinkExtractor

from scrapy import Spider, Request

from urllib.parse import quote

from scrapysplashtest.items import ScrapysplashtestItem

from scrapy_splash import SplashRequest

lua_script = """

function main(splash)

splash:go(splash.args.url)

splash.images_enabled = false

splash:wait(0.5)

return splash:html()

end

"""

class TaobaoSpider(Spider):

name = "ALLtaobao"

allowed_domains = [

"www.taobao.com",

"detail.tmall.com",

"rate.taobao.com",

"s.taobao.com",

]

base_url = "https://www.taobao.com/tbhome/page/market-list"

header = {

"Host": "s.taobao.com",

"Connection": "keep-alive",

"Cookie": "_uab_collina=152384731481486255434221; _umdata=70CF403AFFD707DF1A85F005364DC10B4EC91835EB1330570EBBEE6ED83206C5D22033AB874D1A3DCD43AD3E795C914C27AEEB05138BEBFEE2EBDC2C9B919A87",

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",

"Accept-Encoding": "gzip, deflate, sdch",

"Accept-Language": "zh-CN,zh;q=0.8",

}

def start_requests(self): # 重新定义起始url

url = self.base_url

yield scrapy.Request(url, self.parse, dont_filter=True)

def parse(self, response):

# 解析list链接

pattern = r".*//s.taobao.com/list?.*"

le = LinkExtractor(allow=pattern)

links = le.extract_links(response)

print("发现list页面共：【%s】" % len(links))

for i in links:

print("-------------------->%s" % i.url)

yield SplashRequest(

i.url,

callback=self.next_page,

endpoint="execute",

args={"lua_source": lua_script},

dont_filter=True,

)

def next_page(self, response):

# 获取page total,翻页操作

dirty_total = response.xpath(

'//*[@id="listsrp-pager"]/div/div/div/div[1]/text()'

).extract_first()

if dirty_total is not None:

page_total = int(re.findall(r"\d+\.?\d*", dirty_total)[0])

else:

page_total = 5

print("开始获取下一页")

for page in range(page_total + 1):

page_url = response.url + "&s=" + str(page * 60)

print("获取list：【%s】，第【%s】页。" % (response.url, page))

yield SplashRequest(

page_url,

callback=self.parse_shop,

endpoint="execute",

args={"lua_source": lua_script},

dont_filter=True,

)

def parse_shop(self, response):

print(response.url)

print("开始全量商品页")

classification = re.findall(r"&q=(.*?)&", response.url)

if classification:

classification = urllib.parse.unquote(classification[0])

else:

classification = "无分类"

products = response.xpath(

'//div[@id="listsrp-itemlist"]//div[@class="items"][1]//div[contains(@class, "item")]'

)

print("解析列表页商品信息")

for product in products:

price = "".join(

product.xpath('.//div[contains(@class, "price")]//text()').extract()

).strip()

title = "".join(

product.xpath('.//div[contains(@class, "title")]//text()').extract()

).strip()

shop = "".join(

product.xpath('.//div[contains(@class, "shop")]//text()').extract()

).strip()

image = "".join(

product.xpath(

'.//div[@class="pic"]//img[contains(@class, "img")]/@data-src'

).extract_first()

).strip()

deal_preson = product.xpath(

'.//div[contains(@class, "deal-cnt")]//text()'

).extract_first()

location = product.xpath(

'.//div[contains(@class, "location")]//text()'

).extract_first()

shop_id = product.css("div .pic a::attr('data-nid')").extract_first()

shop_url = "https://detail.tmall.com/item.htm?id=" + str(shop_id)

shop_info = {

"classification": classification,

"shop_url": shop_url,

"shop_id": shop_id,

"title": title,

"shop": shop,

"image": image,

"price": price,

"deal_preson": deal_preson,

"location": location,

}

# print('价格%s,标题%s,店铺%s' % (price, title, shop))

print("商品url是：%s" % shop_url)

yield SplashRequest(

shop_url,

callback=self.shop_info_parse,

meta=shop_info,

args={"images": 0, "lua_source": lua_script},

cache_args=["lua_source"],

dont_filter=True,

)

def shop_info_parse(self, response):

print("开始解析商品详情页")

shop_id = response.meta.get("shop_id")

shop_url = response.meta.get("shop_url")

title = response.meta.get("title")

shop = response.meta.get("shop")

image = response.meta.get("image")

price = response.meta.get("price")

deal_preson = response.meta.get("deal_preson")

location = response.meta.get("location")

classification = response.meta.get("classification")

comment_num = response.xpath(

'//*[@id="J_ItemRates"]/div/span[2]/text()'

).extract_first() # 评论量

"""判断评论量是否为空，如果为空，抓取其他位置通过抓取数据是否为空判断抓取逻辑"""

'//*[@id="J_ItemRates"]/div/span[2]'

'//*[@id="J_DetailMeta"]/div[1]/div[1]/div/div[2]/dl[1]/dd/span'

if comment_num is None:

comment_num = response.xpath(

'//*[@id="J_TabBar"]/li[2]/a/em/text()'

).extract_first()

deal_30 = response.xpath(

'//*[@id="J_Counter"]/div/div[2]/a/@title/text()'

).extract_first()

original_preice = response.xpath(

'//*[@id="J_StrPrice"]/em[2]/text()'

).extract_first()

else:

comment_num = comment_num

deal_30 = response.xpath(

'//*[@id="J_DetailMeta"]/div[1]/div[1]/div/ul/li[1]/div/span[2]/text()'

).extract_first() # 30天销量

original_preice = response.xpath(

'//*[@id="J_DetailMeta"]/div[1]/div[1]/div/div[2]/dl[1]/dd/span/text()'

).extract_first() # 原价

store_describe = response.xpath(

'//*[@id="shop-info"]/div[2]/div[1]/div[2]/span/text()'

).extract_first() # 店铺描述

"""淘宝店铺与天猫店铺抓取逻辑不一致，如抓取天猫店铺为None值，则改为淘宝抓取逻辑"""

if store_describe is None:

store_describe = response.xpath(

'//*[@id="J_ShopInfo"]/div/div[2]/div/dl[1]/dd/a/text()'

).extract_first()

store_service = response.xpath(

'//*[@id="J_ShopInfo"]/div/div[2]/div/dl[2]/dd/a/text()'

).extract_first()

store_logistics = response.xpath(

'//*[@id="J_ShopInfo"]/div/div[2]/div/dl[3]/dd/a/text()'

).extract_first()

else:

store_describe = store_describe

store_service = response.xpath(

'//*[@id="shop-info"]/div[2]/div[2]/div[2]/span/text()'

).extract_first() # 店铺服务

store_logistics = response.xpath(

'//*[@id="shop-info"]/div[2]/div[3]/div[2]/span/text()'

).extract_first() # 店铺物流

store_time = response.xpath(

'//*[@id="ks-component1974"]/div/div/div/div[2]/ul/li[3]/div/span[2]/text()'

).extract_first() # 开店时长

inventory = response.xpath('//*[@id="J_EmStock"]/text()').extract_first() # 库存

"""判断库存抓取逻辑是天猫还是淘宝"""

if inventory is None:

inventory = response.xpath('//*[@id="J_SpanStock"]/text()').extract_first()

else:

inventory = inventory

"""每页评论为20条，获取所需翻页数"""

if comment_num is None:

comment_num = 1

comment_num = int(comment_num)

shop_info = {

"classification": classification,

"shop_id": shop_id,

"title": title,

"shop": shop,

"shop_url": shop_url,

"image": image,

"price": price,

"deal_preson": deal_preson,

"location": location,

"comment_num": comment_num,

"deal_30": deal_30,

"original_preice": original_preice,

"store_describe": store_describe,

"store_service": store_service,

"store_logistics": store_logistics,

"store_time": store_time,

"inventory": inventory,

}

if comment_num <= 20:

page = 1

comment_url = "https://rate.taobao.com/feedRateList.htm?auctionNumId={shop_id}&currentPageNum={page}".format(

shop_id=shop_id, page=page

)

yield scrapy.Request(

comment_url, callback=self.comment_parse, meta=shop_info

)

else:

page = round(comment_num / 20)

if page > 251: # 实测评论只有251页

page = 251

print("30天购买人数为：%s" % deal_preson)

print("历史价格为：%s" % original_preice)

print("商品评论数为：%s" % comment_num)

print("商品30天销量为：%s" % deal_30)

print("评论共计：%s页" % page)

for k in range(page + 1):

comment_url = "https://rate.taobao.com/feedRateList.htm?auctionNumId={shop_id}&currentPageNum={page}".format(

shop_id=shop_id, page=k

)

yield scrapy.Request(

comment_url, callback=self.comment_parse, meta=shop_info

)

def comment_parse(self, response):

print("开始解析评论信息")

print(response.url)

ALLtaobao = ScrapysplashtestItem()

"""将商品详细信息传入item"""

ALLtaobao["shop_url"] = response.meta.get("shop_url")

ALLtaobao["shop_id"] = response.meta.get("shop_id")

ALLtaobao["title"] = response.meta.get("title")

ALLtaobao["shop"] = response.meta.get("shop")

ALLtaobao["image"] = response.meta.get("image")

ALLtaobao["price"] = response.meta.get("price")

ALLtaobao["deal_preson"] = response.meta.get("deal_preson")

ALLtaobao["location"] = response.meta.get("location")

ALLtaobao["comment_num"] = response.meta.get("comment_num")

ALLtaobao["deal_30"] = response.meta.get("deal_30")

ALLtaobao["original_preice"] = response.meta.get("original_preice")

ALLtaobao["store_describe"] = response.meta.get("store_describe")

ALLtaobao["store_service"] = response.meta.get("store_service")

ALLtaobao["store_logistics"] = response.meta.get("store_logistics")

ALLtaobao["store_time"] = response.meta.get("store_time")

ALLtaobao["inventory"] = response.meta.get("inventory")

ALLtaobao["spider_datetime"] = datetime.datetime.now().strftime(

"%Y-%m-%d %H:%M:%S"

)

json_data = json.loads(response.text.replace("(", "").replace(")", ""))

if json_data["comments"] is not None:

for i in range(len(json_data["comments"])):

ALLtaobao["comment_date"] = json_data["comments"][i]["date"]

ALLtaobao["content"] = json_data["comments"][i]["content"]

ALLtaobao["rateId"] = json_data["comments"][i]["rateId"]

ALLtaobao["sku"] = json_data["comments"][i]["auction"]["sku"]

ALLtaobao["nick"] = json_data["comments"][i]["user"]["nick"]

ALLtaobao["vipLevel"] = json_data["comments"][i]["user"]["vipLevel"]

ALLtaobao["rank"] = json_data["comments"][i]["user"]["rank"]

print(

json_data["comments"][i]["date"],

json_data["comments"][i]["content"],

json_data["comments"][i]["rateId"],

json_data["comments"][i]["auction"]["sku"],

json_data["comments"][i]["user"]["nick"],

json_data["comments"][i]["user"]["vipLevel"],

json_data["comments"][i]["user"]["rank"],

)

print("yield全量数据")

yield ALLtaobao

淘宝数据中包含天猫数据， spider爬取域名记得添加天猫相关域名：

allowed_domains = [

"www.taobao.com",

"detail.tmall.com",

"rate.taobao.com",

"s.taobao.com",

]

Item:

import scrapy

from scrapy import Item, Field

class ScrapysplashtestItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

collection = table = 'new_taobao'

shop_url = scrapy.Field()

shop_id = scrapy.Field()

title = scrapy.Field()

shop = scrapy.Field()

image = scrapy.Field()

price = scrapy.Field()

deal_preson = scrapy.Field()

location = scrapy.Field()

comment_num = scrapy.Field()

deal_30 = scrapy.Field()

original_preice = scrapy.Field()

store_describe = scrapy.Field()

store_service = scrapy.Field()

store_logistics = scrapy.Field()

store_time = scrapy.Field()

inventory = scrapy.Field()

spider_datetime = scrapy.Field()

comment_date = scrapy.Field()

content = scrapy.Field()

rateId = scrapy.Field()

sku = scrapy.Field()

nick = scrapy.Field()

vipLevel = scrapy.Field()

rank = scrapy.Field()

Pipilines:

import pymysql

from scrapy import Request

from scrapysplashtest.items import ScrapysplashtestItem

class MysqlPipeline():

def __init__(self, host, database, user, password, port):

self.host = host

self.database = database

self.user = user

self.password = password

self.port = port

@classmethod

def from_crawler(cls, crawler):

return cls(

host=crawler.settings.get('MYSQL_HOST'),

database=crawler.settings.get('MYSQL_DATABASE'),

user=crawler.settings.get('MYSQL_USER'),

password=crawler.settings.get('MYSQL_PASSWORD'),

port=crawler.settings.get('MYSQL_PORT'),

)

def open_spider(self, spider):

self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8mb4',

port=self.port)

self.cursor = self.db.cursor()

def close_spider(self, spider):

self.db.close()

def process_item(self, item, spider):

print(item['title'])

data = dict(item)

keys = ', '.join(data.keys())

values = ', '.join(['%s'] * len(data))

sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)

self.cursor.execute(sql, tuple(data.values()))

self.db.commit()

return item

Settings:

JOBDIR='restart'

LOG_FILE = "mySpider.log"

LOG_LEVEL = "DEBUG"

DOWNLOAD_DELAY = 0.25 # 请求一个等待0.25s 降低爬虫爬取速度

RETRY_ENABLED = True

RETRY_TIMES = 30

BOT_NAME = 'scrapysplashtest'

SPIDER_MODULES = ['scrapysplashtest.spiders']

NEWSPIDER_MODULE = 'scrapysplashtest.spiders'

ROBOTSTXT_OBEY = False

SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}

DOWNLOADER_MIDDLEWARES = {

'scrapy_splash.SplashCookiesMiddleware': 723,

'scrapy_splash.SplashMiddleware': 725,

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,

'scrapysplashtest.middlewares.MyUserAgentMiddleware': 400,

}

ITEM_PIPELINES = {

'scrapysplashtest.pipelines.MysqlPipeline': 300,

}

EDIRECT_ENABLED = False

SPLASH_URL =‘自行配置Nginx地址’

HTTPERROR_ALLOWED_CODES = [500, 502, 503, 504, 400, 403, 404, 408]

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

MYSQL_HOST = 'localhost'

MYSQL_DATABASE = 'all_taobao'

MYSQL_USER = 'root'

MYSQL_PASSWORD = ''

MYSQL_PORT = 3306

Warning：

Splash是一个轻量级的服务，请求量大时容易GG，建议在多台服务器部署多个Splash服务，并利用Nginx负载均衡去调度服务。
本文实现全用到两台服务器，一台学生版腾讯云，一台学生版阿里云，阿里云部署3个服务，腾讯云2个服务，在不控制Scrapy速度时，Splash服务还是容易挂掉，各位看官老爷自行配置和优化，本文只是从实现上简单阐述。
Spider在运行时会利用Splash产生大量请求，Splash可能会随时挂掉，本文在服务器上定时对Splash进行重启，而Splash长时间运行时，会在Docker上产生大量内存，所以本文也对Docker及Nginx设置定时重启，防患于未然，各位看官老爷如有更有效办法，欢迎指出。
淘宝网评论URL来自Ajax请求，本文通过抓包方式动态构造评论URL，通过Request请求评论数据并解析（评论的Ajax请求并没有反爬，为了减轻Splash压力，采用Request请求即可）
商品一级分类URL转入的商品列表页可能来自淘宝或天猫，故本文中采用两套Xpath解析规则去过滤数据。
Splash基于Docker,而在Linux上，可能会遇到Docker无法启动的BUG，我的解决方式比较蠢—> 重装系统（Docker一次卸载不干净，仁者见仁，智者见智，欢迎各位看官老爷给出更好的解决方案）。
我比较菜，Xpath/CSS/BeautifulSoup/正则/PyQuery这几个来回切换用的不是很熟练，所以借助Chrome直接提取Xpath。
Pipilines中插入语句采用动态插入。动态插入在其他爬虫中可能会存在风险。
爬虫运行时加入日志及暂停功能的配置，方便调试。

本文所用的Spider实测是可以跑通的，因技术有限，所以难免有很多瑕疵，欢迎各位看官老爷写出更高效更稳定的全站Spider，互相分享学习

以上

Cold丶kl

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
基于Scrapy淘宝全站Spider设计与实现

虽说职业是数据分析，但是未曾放弃爬虫，始终保持一颗追求技术，敬畏技术的心！本文基于Scrapy框架实现全站定向爬虫。在爬取淘宝时，没能找到淘宝全站相关的爬虫资料，只能借鉴零星的文章结合崔大的书，实现整体爬虫思路。因作者水平有限，本文仅提供爬取思路及源码，欢迎各位大...
复制链接

扫一扫

专栏目录