Scrapy学习笔记(4)分布式爬取京东商品详情,评论和评论总结

目标:分布式爬取京东商品详情,评论和评论总结

Power by:

  1. Python 3.6
  2. Scrapy 1.4
  3. pymysql
  4. json
  5. redis
项目地址:https://github.com/Dengqlbq/JDSpider

Step 1——相关简介

本文将注意力放在代码实现上,代码思路的描述将另开一文
代码思路:http://blog.csdn.net/sinat_34200786/article/details/78954617


Step 2——总体框架

分析目标后可以发现有如下需求:

指定关键词并爬取关键词商品的id
爬取商品详情
爬取商品评论

如果将所有需求的实现放在同一个Spider中代码难免显得臃肿,所以决定将整个项目分为四部分

  • JDSpider
  • ProjectStart
  • JDUrlsSpider
  • JDDetailSpider
  • JDCommentSpider
ProjectStart		 指定关键词并抛出指定数量页面的url
JDUrlsSpider		 提取页面中所有商品id并形成detail-url 和comment-url
JDDetailSpider		 根据detail-url提取商品详情
JDCommentSpider	     根据comment-url提取商品评论

Spider之间通过服务器端redis进行通信,主要就是detail-url和comment-url的传递


Step 3——ProjectStart

指定关键词并抛出指定数量页面的url
页面指在京东浏览商品时某一页

# JDSpider/ProjectStart/Test.py

import redis
from urllib import parse

# Redis configuration
r = redis.Redis(host='HOST', port=6379, password='PASS')

# 改写keywords和page_count
keywords = '手机'
page_count = 100

keywords = parse.quote(keywords)
current_page = 1
start_index = 1

url = 'https://search.jd.com/Search?keyword={0}&enc=utf-8&qrst=1&rt' \
      '=1&stop=1&vt=2&wq={1}&page={2}&s={3}&click=0'

for i in range(page_count):
	# 提供给JDUrlsSpider		
    r.lpush('JDUrlsSpider', url.format(keywords, keywords, current_page, start_index))
    current_page += 2
    start_index += 60


Step4——JDUrlsSpider

提取页面中所有商品id并形成detail-url 和comment-url

创建项目:

cd JDSpider
scrapy startproject JDUrls

浏览商品的某一页时,京东先返回一半的商品信息,另一半采用异步加载只有在滚动条到尾时才加载
所以在提取某页所有商品id时还需要构造异步请求才能真正获得所有商品id

# JDSpider/JDUrls/spiders/JDUrlsSpider.py

from scrapy_redis.spiders import RedisSpider
from JDUrls.items import JDUrlsItem
from scrapy.utils.project import get_project_settings
import scrapy
import re


class JDUrlsSpider(RedisSpider):
    # 获取指定页面中所有商品编号并整合成detail-relate url 和comment-relate url
    name = 'JDUrlsSpider'
    allow_domains = ['www.jd.com']
    redis_key = 'JDUrlsSpider'
    
    settings = get_project_settings()
    hide_url = settings['HIDE_URL']

   def parse(self, response):
       # 页面中未隐藏的所有商品编号
       nums = response.xpath('//ul[@class="gl-warp clearfix"]/li[@class="gl-item"]
				           [@data-sku]/@data-sku').extract()
				           
       keyword = re.findall(r'keyword=(.*?)&enc', response.url)[0]

       # 虽然是同一个页面的商品编号,但异步加载请求隐藏的商品编号时请求的页面编号不同
       page = re.findall(r'page=(\d+)', response.url)[0]
       page = int(page) + 1

       s = ''
       for i in nums:
           s += str(i) + ','
       s = s[0:len(s)-1:]

       item = JDUrlsItem()
       item['num_list'] = nums
       yield item

       yield scrapy.Request(self.hide_url.format(keyword, page, s), callback=self.get_hidden)

   def get_hidden(self, response):
       # 页面中隐藏的所有商品编号
       nums = response.xpath('//li[@class="gl-item"][@data-sku]/@data-sku').extract()

       item = JDUrlsItem()
       item['num_list'] = nums
       yield item

提取出商品id后构造出detail-url和comment-url并存入服务器端redis

# JDSpider/JDUrls/pipelines.py

import redis
from scrapy.utils.project import get_project_settings


class JDUrlsPipeline(object):

   def __init__(self):
       self.settings = get_project_settings()
       self.detail_url = self.settings['GOODS_DETAIL_URL']
       self.comment_url = self.settings['COMMENT_URL']

       self.r = redis.Redis(host=self.settings['REDIS_HOST'], port=self.settings['REDIS_PORT'],
                            password=self.settings['REDIS_PARAMS']['password'])

   def process_item(self, item, spider):
       # 将商品编号整合成detail-relate url 和comment-relate url后存到服务器redis数据库
       for n in item['num_list']:
           self.r.lpush('JDDetailSpider', self.detail_url.format(n))
           self.r.lpush('JDCommentSpider', self.comment_url.format(n))


Step 5——JDDetailSpider

根据detail-url提取商品详情
JDUrlsSpider已经将detail-url存入服务器端redis,JDDetailSpider只需从redis获取url爬取商品详情

创建项目:

cd JDSpider
scrapy startproject JDDetail	

要爬取的商品详情具体项如下:

# JDSpider/JDDetail/items.py

import scrapy


class JDDetailItem(scrapy.Item):
    # define the fields for your item here like:

    # TINYTEXT
    name = scrapy.Field()
    # FLOAT
    price = scrapy.Field()
    # TINYTEXT
    owner = scrapy.Field()
    # TINYINT
    jd_sel = scrapy.Field()
    # TINYINT
    global_buy = scrapy.Field()
    # TINYINT
    flag = scrapy.Field()
    # INT
    comment_count = scrapy.Field()
    # INT
    good_count = scrapy.Field()
    # INT
    default_good_count = scrapy.Field()
    # INT
    general_count = scrapy.Field()
    # INT
    poor_count = scrapy.Field()
    # INT
    after_count = scrapy.Field()
    # FLOAT
    good_rate = scrapy.Field()
    # FLOAT
    general_rate = scrapy.Field()
    # FLOAT
    poor_rate = scrapy.Field()
    # FLOAT
    average_score = scrapy.Field()
    # TINYTEXT
    num = scrapy.Field()

爬取详情时,价格数据和评论总结数据是异步加载的,所以需要另外构造异步请求

# JDSpider/JDDetail/JDDetailSpider

from scrapy_redis.spiders import RedisSpider
from JDDetail.items import JDDetailItem
from scrapy.utils.project import get_project_settings
import scrapy
import re
import json


class JDDetailSpider(RedisSpider):
    # 获取指定商品的商品详情
    name = 'JDDetailSpider'
    allow_domains = ['www.jd.com']
    redis_key = 'JDDetailSpider'

    settings = get_project_settings()
    comment_url = settings['COMMENT_EXCERPT_URL']
    price_url = settings['PRICE_URL']

   def parse(self, response):
       item = JDDetailItem()

       # 全球购
       if 'hk' in response.url:
           global_buy = True
       else:
           global_buy = False

       # 商品名
       raw_name = re.findall(r'<div class="sku-name">(.*?)</div>', response.text, re.S)[0].strip()
       if '京东精选' in raw_name:
           jd_sel = True
       else:
           jd_sel = False

       # 确保商品名无多余字符,如可能出现的 "京东精选"
       name_list = raw_name.split('>')
       name = name_list[len(name_list) - 1].strip()

       # 全球购商铺名提取方法不同
       if not global_buy:
           owner_list = response.xpath('//div[@class="J-hove-wrap EDropdown 
						             fr"]/div[@class="item"]/div[@class="name"]'
                                       '/a/text()').extract()
       else:
           owner_list = response.xpath('//div[@class="shopName"]/strong/span/a/text()').extract()

       # 是否自营
       if len(owner_list) == 0:
           owner = '自营'
           flag = True
       else:
           owner = owner_list[0]
           if '自营' in owner:
               flag = True
           else:
               flag = False
               
       num = re.findall(r'(\d+)', response.url)[0]

       item['name'] = name
       item['owner'] = owner
       item['flag'] = flag
       item['global_buy'] = global_buy
       item['jd_sel'] = jd_sel
       item['num'] = num

       # 请求价格json数据
       price_request = scrapy.Request(self.price_url.format(num), callback=self.get_price)
       price_request.meta['item'] = item
       yield price_request

   def get_price(self, response):
       item = response.meta['item']

       price_json = json.loads(response.text)
       item['price'] = price_json[0]['p']
       num = item['num']

       # 请求评论总结json数据
       comment_request = scrapy.Request(self.comment_url.format(num), callback=self.get_comment)
       comment_request.meta['item'] = item
       yield comment_request

   def get_comment(self, response):
       item = response.meta['item']

       comment_json = json.loads(response.text)
       comment_json = comment_json['CommentsCount'][0]

       item['comment_count'] = comment_json['CommentCount']
       item['good_count'] = comment_json['GoodCount']
       item['default_good_count'] = comment_json['DefaultGoodCount']
       item['general_count'] = comment_json['GeneralCount']
       item['poor_count'] = comment_json['PoorCount']
       item['after_count'] = comment_json['AfterCount']
       item['good_rate'] = comment_json['GoodRate']
       item['general_rate'] = comment_json['GeneralRate']
       item['poor_rate'] = comment_json['PoorRate']
       item['average_score'] = comment_json['AverageScore']

       yield item

Step 6——JDCommentSpider

根据comment-url提取商品评论

JDUrlsSpider已经将comment-url存入服务器端redis,JDCommentSpider只需从redis获取url爬取评论

创建项目:

cd JDSpider
scrapy startproject JDComment

要爬取的商品评论具体项如下:

# JDSpider/JDComment/items.py

class JDCommentItem(scrapy.Item):
	
	# TINYTEXT
	good_num = scrapy.Field()
	# TEXT
	content = scrapy.Field()

初始comment-url返回的json数据中只有10条评论,但是maxPage指明了可以获取评论的次数,加个循
环即可获取其他评论数据

# JDSpider/JDComment/JDCommentSpider.py

from scrapy_redis.spiders import RedisSpider
from JDComment.items import JDCommentItem
from scrapy.utils.project import get_project_settings
import scrapy
import json
import re


class JDCommentSpider(RedisSpider):
    # 获取指定商品的评论(完整评论,非摘要)
    name = 'JDCommentSpider'
    allow_domains = ['www.jd.com']
    redis_key = 'JDCommentSpider'

    settings = get_project_settings()
    comment_url = settings['COMMENT_URL']

   def parse(self, response):
       comment_json = json.loads(response.text)
       good_number = re.findall(r'productId=(\d+)', response.url)[0]
       max_page_num = comment_json['maxPage']

       for com in comment_json['comments']:
           item = JDCommentItem()
           item['good_num'] = good_number
           item['content'] = com['content']
           yield item

       for i in range(2, max_page_num):
           yield scrapy.Request(self.comment_url.format(good_number, i), callback=self.get_leftover)

   def get_leftover(self, response):
       comment_json = json.loads(response.text)
       good_number = re.findall(r'productId=(\d+)', response.url)[0]

       for com in comment_json['comments']:
           item = JDCommentItem()
           item['good_num'] = good_number
           item['content'] = com['content']
           yield item

Step 7——启动爬虫

cd ProjectStart
python Test.py
cd JDUrlsSpider
scrapy crawl JDUrlsSpider
cd JDDetailSpider
scrapy crawl JDDetailSpider
(This is distributed crawler, you can run more than one JDDetailSpider)
cd JDCommentSpider
scrapy crawl JDCommentSpider
(This is distributed crawler, you can run more than one JDCommentSpider)

成果展示

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1AB82g0T-1615194057496)(https://github.com/Dengqlbq/JDSpider/raw/master/Image/detail.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qJ6ubYJv-1615194057501)(https://github.com/Dengqlbq/JDSpider/raw/master/Image/partial.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yvDgCVV5-1615194057503)(https://github.com/Dengqlbq/JDSpider/raw/master/Image/comment.png)]


参考资料

####总体框架参考

  • 1
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值