Scrapy学习笔记（4）分布式爬取京东商品详情，评论和评论总结

最新推荐文章于 2021-12-21 23:37:51 发布

浅零半泣

最新推荐文章于 2021-12-21 23:37:51 发布

阅读量2k

点赞数 1

分类专栏： Scrapy 爬虫文章标签： python redis 分布式京东

本文链接：https://blog.csdn.net/sinat_34200786/article/details/78770356

版权

爬虫同时被 2 个专栏收录

5 篇文章 1 订阅

订阅专栏

Scrapy

4 篇文章 0 订阅

订阅专栏

目标：分布式爬取京东商品详情，评论和评论总结

Power by:

Python 3.6
Scrapy 1.4
pymysql
json
redis

项目地址：https://github.com/Dengqlbq/JDSpider

Step 1——相关简介

本文将注意力放在代码实现上，代码思路的描述将另开一文
代码思路：http://blog.csdn.net/sinat_34200786/article/details/78954617

Step 2——总体框架

分析目标后可以发现有如下需求：

指定关键词并爬取关键词商品的id
爬取商品详情
爬取商品评论

如果将所有需求的实现放在同一个Spider中代码难免显得臃肿，所以决定将整个项目分为四部分

JDSpider
ProjectStart
JDUrlsSpider
JDDetailSpider
JDCommentSpider

ProjectStart		 指定关键词并抛出指定数量页面的url
JDUrlsSpider		 提取页面中所有商品id并形成detail-url 和comment-url
JDDetailSpider		 根据detail-url提取商品详情
JDCommentSpider	     根据comment-url提取商品评论

Spider之间通过服务器端redis进行通信，主要就是detail-url和comment-url的传递

Step 3——ProjectStart

指定关键词并抛出指定数量页面的url
页面指在京东浏览商品时某一页

# JDSpider/ProjectStart/Test.py

import redis
from urllib import parse

# Redis configuration
r = redis.Redis(host='HOST', port=6379, password='PASS')

# 改写keywords和page_count
keywords = '手机'
page_count = 100

keywords = parse.quote(keywords)
current_page = 1
start_index = 1

url = 'https://search.jd.com/Search?keyword={0}&enc=utf-8&qrst=1&rt' \
      '=1&stop=1&vt=2&wq={1}&page={2}&s={3}&click=0'

for i in range(page_count):
	# 提供给JDUrlsSpider		
    r.lpush('JDUrlsSpider', url.format(keywords, keywords, current_page, start_index))
    current_page += 2
    start_index += 60

Step4——JDUrlsSpider

提取页面中所有商品id并形成detail-url 和comment-url

创建项目：

cd JDSpider
scrapy startproject JDUrls

浏览商品的某一页时，京东先返回一半的商品信息，另一半采用异步加载只有在滚动条到尾时才加载
所以在提取某页所有商品id时还需要构造异步请求才能真正获得所有商品id

# JDSpider/JDUrls/spiders/JDUrlsSpider.py

from scrapy_redis.spiders import RedisSpider
from JDUrls.items import JDUrlsItem
from scrapy.utils.project import get_project_settings
import scrapy
import re


class JDUrlsSpider(RedisSpider):
    # 获取指定页面中所有商品编号并整合成detail-relate url 和comment-relate url
    name = 'JDUrlsSpider'
    allow_domains = ['www.jd.com']
    redis_key = 'JDUrlsSpider'
    
    settings = get_project_settings()
    hide_url = settings['HIDE_URL']

   def parse(self, response):
       # 页面中未隐藏的所有商品编号
       nums = response.xpath('//ul[@class="gl-warp clearfix"]/li[@class="gl-item"]
				           [@data-sku]/@data-sku').extract()
				           
       keyword = re.findall(r'keyword=(.*?)&enc', response.url)[0]

       # 虽然是同一个页面的商品编号，但异步加载请求隐藏的商品编号时请求的页面编号不同
       page = re.findall(r'page=(\d+)', response.url)[0]
       page = int(page) + 1

       s = ''
       for i in nums:
           s += str(i) + ','
       s = s[0:len(s)-1:]

       item = JDUrlsItem()
       item['num_list'] = nums
       yield item

       yield scrapy.Request(self.hide_url.format(keyword, page, s), callback=self.get_hidden)

   def get_hidden(self, response):
       # 页面中隐藏的所有商品编号
       nums = response.xpath('//li[@class="gl-item"][@data-sku]/@data-sku').extract()

       item = JDUrlsItem()
       item['num_list'] = nums
       yield item

提取出商品id后构造出detail-url和comment-url并存入服务器端redis

# JDSpider/JDUrls/pipelines.py

import redis
from scrapy.utils.project import get_project_settings


class JDUrlsPipeline(object):

   def __init__(self):
       self.settings = get_project_settings()
       self.detail_url = self.settings['GOODS_DETAIL_URL']
       self.comment_url = self.settings['COMMENT_URL']

       self.r = redis.Redis(host=self.settings['REDIS_HOST'], port=self.settings['REDIS_PORT'],
                            password=self.settings['REDIS_PARAMS']['password'])

   def process_item(self, item, spider):
       # 将商品编号整合成detail-relate url 和comment-relate url后存到服务器redis数据库
       for n in item['num_list']:
           self.r.lpush('JDDetailSpider', self.detail_url.format(n))
           self.r.lpush('JDCommentSpider', self.comment_url.format(n))

Step 5——JDDetailSpider

根据detail-url提取商品详情
JDUrlsSpider已经将detail-url存入服务器端redis，JDDetailSpider只需从redis获取url爬取商品详情

创建项目：

cd JDSpider
scrapy startproject JDDetail

要爬取的商品详情具体项如下：

# JDSpider/JDDetail/items.py

import scrapy


class JDDetailItem(scrapy.Item):
    # define the fields for your item here like:

    # TINYTEXT
    name = scrapy.Field()
    # FLOAT
    price = scrapy.Field()
    # TINYTEXT
    owner = scrapy.Field()
    # TINYINT
    jd_sel = scrapy.Field()
    # TINYINT
    global_buy = scrapy.Field()
    # TINYINT
    flag = scrapy.Field()
    # INT
    comment_count = scrapy.Field()
    # INT
    good_count = scrapy.Field()
    # INT
    default_good_count = scrapy.Field()
    # INT
    general_count = scrapy.Field()
    # INT
    poor_count = scrapy.Field()
    # INT
    after_count = scrapy.Field()
    # FLOAT
    good_rate = scrapy.Field()
    # FLOAT
    general_rate = scrapy.Field()
    # FLOAT
    poor_rate = scrapy.Field()
    # FLOAT
    average_score = scrapy.Field()
    # TINYTEXT
    num = scrapy.Field()

爬取详情时，价格数据和评论总结数据是异步加载的，所以需要另外构造异步请求

# JDSpider/JDDetail/JDDetailSpider

from scrapy_redis.spiders import RedisSpider
from JDDetail.items import JDDetailItem
from scrapy.utils.project import get_project_settings
import scrapy
import re
import json


class JDDetailSpider(RedisSpider):
    # 获取指定商品的商品详情
    name = 'JDDetailSpider'
    allow_domains = ['www.jd.com']
    redis_key = 'JDDetailSpider'

    settings = get_project_settings()
    comment_url = settings['COMMENT_EXCERPT_URL']
    price_url = settings['PRICE_URL']

   def parse(self, response):
       item = JDDetailItem()

       # 全球购
       if 'hk' in response.url:
           global_buy = True
       else:
           global_buy = False

       # 商品名
       raw_name = re.findall(r'<div class="sku-name">(.*?)</div>', response.text, re.S)[0].strip()
       if '京东精选' in raw_name:
           jd_sel = True
       else:
           jd_sel = False

       # 确保商品名无多余字符，如可能出现的 "京东精选"
       name_list = raw_name.split('>')
       name = name_list[len(name_list) - 1].strip()

       # 全球购商铺名提取方法不同
       if not global_buy:
           owner_list = response.xpath('//div[@class="J-hove-wrap EDropdown 
						             fr"]/div[@class="item"]/div[@class="name"]'
                                       '/a/text()').extract()
       else:
           owner_list = response.xpath('//div[@class="shopName"]/strong/span/a/text()').extract()

       # 是否自营
       if len(owner_list) == 0:
           owner = '自营'
           flag = True
       else:
           owner = owner_list[0]
           if '自营' in owner:
               flag = True
           else:
               flag = False
               
       num = re.findall(r'(\d+)', response.url)[0]

       item['name'] = name
       item['owner'] = owner
       item['flag'] = flag
       item['global_buy'] = global_buy
       item['jd_sel'] = jd_sel
       item['num'] = num

       # 请求价格json数据
       price_request = scrapy.Request(self.price_url.format(num), callback=self.get_price)
       price_request.meta['item'] = item
       yield price_request

   def get_price(self, response):
       item = response.meta['item']

       price_json = json.loads(response.text)
       item['price'] = price_json[0]['p']
       num = item['num']

       # 请求评论总结json数据
       comment_request = scrapy.Request(self.comment_url.format(num), callback=self.get_comment)
       comment_request.meta['item'] = item
       yield comment_request

   def get_comment(self, response):
       item = response.meta['item']

       comment_json = json.loads(response.text)
       comment_json = comment_json['CommentsCount'][0]

       item['comment_count'] = comment_json['CommentCount']
       item['good_count'] = comment_json['GoodCount']
       item['default_good_count'] = comment_json['DefaultGoodCount']
       item['general_count'] = comment_json['GeneralCount']
       item['poor_count'] = comment_json['PoorCount']
       item['after_count'] = comment_json['AfterCount']
       item['good_rate'] = comment_json['GoodRate']
       item['general_rate'] = comment_json['GeneralRate']
       item['poor_rate'] = comment_json['PoorRate']
       item['average_score'] = comment_json['AverageScore']

       yield item

Step 6——JDCommentSpider

根据comment-url提取商品评论

JDUrlsSpider已经将comment-url存入服务器端redis，JDCommentSpider只需从redis获取url爬取评论

创建项目：

cd JDSpider
scrapy startproject JDComment

要爬取的商品评论具体项如下：

# JDSpider/JDComment/items.py

class JDCommentItem(scrapy.Item):
	
	# TINYTEXT
	good_num = scrapy.Field()
	# TEXT
	content = scrapy.Field()

初始comment-url返回的json数据中只有10条评论，但是maxPage指明了可以获取评论的次数，加个循
环即可获取其他评论数据

# JDSpider/JDComment/JDCommentSpider.py

from scrapy_redis.spiders import RedisSpider
from JDComment.items import JDCommentItem
from scrapy.utils.project import get_project_settings
import scrapy
import json
import re


class JDCommentSpider(RedisSpider):
    # 获取指定商品的评论（完整评论，非摘要）
    name = 'JDCommentSpider'
    allow_domains = ['www.jd.com']
    redis_key = 'JDCommentSpider'

    settings = get_project_settings()
    comment_url = settings['COMMENT_URL']

   def parse(self, response):
       comment_json = json.loads(response.text)
       good_number = re.findall(r'productId=(\d+)', response.url)[0]
       max_page_num = comment_json['maxPage']

       for com in comment_json['comments']:
           item = JDCommentItem()
           item['good_num'] = good_number
           item['content'] = com['content']
           yield item

       for i in range(2, max_page_num):
           yield scrapy.Request(self.comment_url.format(good_number, i), callback=self.get_leftover)

   def get_leftover(self, response):
       comment_json = json.loads(response.text)
       good_number = re.findall(r'productId=(\d+)', response.url)[0]

       for com in comment_json['comments']:
           item = JDCommentItem()
           item['good_num'] = good_number
           item['content'] = com['content']
           yield item

Step 7——启动爬虫

cd ProjectStart
python Test.py

cd JDUrlsSpider
scrapy crawl JDUrlsSpider

cd JDDetailSpider
scrapy crawl JDDetailSpider
(This is distributed crawler, you can run more than one JDDetailSpider)

cd JDCommentSpider
scrapy crawl JDCommentSpider
(This is distributed crawler, you can run more than one JDCommentSpider)

成果展示

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1AB82g0T-1615194057496)(https://github.com/Dengqlbq/JDSpider/raw/master/Image/detail.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qJ6ubYJv-1615194057501)(https://github.com/Dengqlbq/JDSpider/raw/master/Image/partial.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yvDgCVV5-1615194057503)(https://github.com/Dengqlbq/JDSpider/raw/master/Image/comment.png)]

参考资料

####总体框架参考

浅零半泣

关注

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
Scrapy学习笔记（4）分布式爬取京东商品详情，评论和评论总结

目标：分布式爬取京东商品详情，评论和评论总结Power by:Python 3.6Scrapy 1.4pymysqljsonredis项目地址：https://github.com/Dengqlbq/JDSpiderStep 1——相关简介本文将注意力放在代码实现上，代码思路的描述将另开一文代码思路：http://blog.csdn.net/si
复制链接

扫一扫