目标:分布式爬取京东商品详情,评论和评论总结
Power by:
- Python 3.6
- Scrapy 1.4
- pymysql
- json
- redis
项目地址:https://github.com/Dengqlbq/JDSpider
Step 1——相关简介
本文将注意力放在代码实现上,代码思路的描述将另开一文
代码思路:http://blog.csdn.net/sinat_34200786/article/details/78954617
Step 2——总体框架
分析目标后可以发现有如下需求:
指定关键词并爬取关键词商品的id
爬取商品详情
爬取商品评论
如果将所有需求的实现放在同一个Spider中代码难免显得臃肿,所以决定将整个项目分为四部分
- JDSpider
- ProjectStart
- JDUrlsSpider
- JDDetailSpider
- JDCommentSpider
ProjectStart 指定关键词并抛出指定数量页面的url
JDUrlsSpider 提取页面中所有商品id并形成detail-url 和comment-url
JDDetailSpider 根据detail-url提取商品详情
JDCommentSpider 根据comment-url提取商品评论
Spider之间通过服务器端redis进行通信,主要就是detail-url和comment-url的传递
Step 3——ProjectStart
指定关键词并抛出指定数量页面的url
页面指在京东浏览商品时某一页
# JDSpider/ProjectStart/Test.py
import redis
from urllib import parse
# Redis configuration
r = redis.Redis(host='HOST', port=6379, password='PASS')
# 改写keywords和page_count
keywords = '手机'
page_count = 100
keywords = parse.quote(keywords)
current_page = 1
start_index = 1
url = 'https://search.jd.com/Search?keyword={0}&enc=utf-8&qrst=1&rt' \
'=1&stop=1&vt=2&wq={1}&page={2}&s={3}&click=0'
for i in range(page_count):
# 提供给JDUrlsSpider
r.lpush('JDUrlsSpider', url.format(keywords, keywords, current_page, start_index))
current_page += 2
start_index += 60
Step4——JDUrlsSpider
提取页面中所有商品id并形成detail-url 和comment-url
创建项目:
cd JDSpider
scrapy startproject JDUrls
浏览商品的某一页时,京东先返回一半的商品信息,另一半采用异步加载只有在滚动条到尾时才加载
所以在提取某页所有商品id时还需要构造异步请求才能真正获得所有商品id
# JDSpider/JDUrls/spiders/JDUrlsSpider.py
from scrapy_redis.spiders import RedisSpider
from JDUrls.items import JDUrlsItem
from scrapy.utils.project import get_project_settings
import scrapy
import re
class JDUrlsSpider(RedisSpider):
# 获取指定页面中所有商品编号并整合成detail-relate url 和comment-relate url
name = 'JDUrlsSpider'
allow_domains = ['www.jd.com']
redis_key = 'JDUrlsSpider'
settings = get_project_settings()
hide_url = settings['HIDE_URL']
def parse(self, response):
# 页面中未隐藏的所有商品编号
nums = response.xpath('//ul[@class="gl-warp clearfix"]/li[@class="gl-item"]
[@data-sku]/@data-sku').extract()
keyword = re.findall(r'keyword=(.*?)&enc', response.url)[0]
# 虽然是同一个页面的商品编号,但异步加载请求隐藏的商品编号时请求的页面编号不同
page = re.findall(r'page=(\d+)', response.url)[0]
page = int(page) + 1
s = ''
for i in nums:
s += str(i) + ','
s = s[0:len(s)-1:]
item = JDUrlsItem()
item['num_list'] = nums
yield item
yield scrapy.Request(self.hide_url.format(keyword, page, s), callback=self.get_hidden)
def get_hidden(self, response):
# 页面中隐藏的所有商品编号
nums = response.xpath('//li[@class="gl-item"][@data-sku]/@data-sku').extract()
item = JDUrlsItem()
item['num_list'] = nums
yield item
提取出商品id后构造出detail-url和comment-url并存入服务器端redis
# JDSpider/JDUrls/pipelines.py
import redis
from scrapy.utils.project import get_project_settings
class JDUrlsPipeline(object):
def __init__(self):
self.settings = get_project_settings()
self.detail_url = self.settings['GOODS_DETAIL_URL']
self.comment_url = self.settings['COMMENT_URL']
self.r = redis.Redis(host=self.settings['REDIS_HOST'], port=self.settings['REDIS_PORT'],
password=self.settings['REDIS_PARAMS']['password'])
def process_item(self, item, spider):
# 将商品编号整合成detail-relate url 和comment-relate url后存到服务器redis数据库
for n in item['num_list']:
self.r.lpush('JDDetailSpider', self.detail_url.format(n))
self.r.lpush('JDCommentSpider', self.comment_url.format(n))
Step 5——JDDetailSpider
根据detail-url提取商品详情
JDUrlsSpider已经将detail-url存入服务器端redis,JDDetailSpider只需从redis获取url爬取商品详情
创建项目:
cd JDSpider
scrapy startproject JDDetail
要爬取的商品详情具体项如下:
# JDSpider/JDDetail/items.py
import scrapy
class JDDetailItem(scrapy.Item):
# define the fields for your item here like:
# TINYTEXT
name = scrapy.Field()
# FLOAT
price = scrapy.Field()
# TINYTEXT
owner = scrapy.Field()
# TINYINT
jd_sel = scrapy.Field()
# TINYINT
global_buy = scrapy.Field()
# TINYINT
flag = scrapy.Field()
# INT
comment_count = scrapy.Field()
# INT
good_count = scrapy.Field()
# INT
default_good_count = scrapy.Field()
# INT
general_count = scrapy.Field()
# INT
poor_count = scrapy.Field()
# INT
after_count = scrapy.Field()
# FLOAT
good_rate = scrapy.Field()
# FLOAT
general_rate = scrapy.Field()
# FLOAT
poor_rate = scrapy.Field()
# FLOAT
average_score = scrapy.Field()
# TINYTEXT
num = scrapy.Field()
爬取详情时,价格数据和评论总结数据是异步加载的,所以需要另外构造异步请求
# JDSpider/JDDetail/JDDetailSpider
from scrapy_redis.spiders import RedisSpider
from JDDetail.items import JDDetailItem
from scrapy.utils.project import get_project_settings
import scrapy
import re
import json
class JDDetailSpider(RedisSpider):
# 获取指定商品的商品详情
name = 'JDDetailSpider'
allow_domains = ['www.jd.com']
redis_key = 'JDDetailSpider'
settings = get_project_settings()
comment_url = settings['COMMENT_EXCERPT_URL']
price_url = settings['PRICE_URL']
def parse(self, response):
item = JDDetailItem()
# 全球购
if 'hk' in response.url:
global_buy = True
else:
global_buy = False
# 商品名
raw_name = re.findall(r'<div class="sku-name">(.*?)</div>', response.text, re.S)[0].strip()
if '京东精选' in raw_name:
jd_sel = True
else:
jd_sel = False
# 确保商品名无多余字符,如可能出现的 "京东精选"
name_list = raw_name.split('>')
name = name_list[len(name_list) - 1].strip()
# 全球购商铺名提取方法不同
if not global_buy:
owner_list = response.xpath('//div[@class="J-hove-wrap EDropdown
fr"]/div[@class="item"]/div[@class="name"]'
'/a/text()').extract()
else:
owner_list = response.xpath('//div[@class="shopName"]/strong/span/a/text()').extract()
# 是否自营
if len(owner_list) == 0:
owner = '自营'
flag = True
else:
owner = owner_list[0]
if '自营' in owner:
flag = True
else:
flag = False
num = re.findall(r'(\d+)', response.url)[0]
item['name'] = name
item['owner'] = owner
item['flag'] = flag
item['global_buy'] = global_buy
item['jd_sel'] = jd_sel
item['num'] = num
# 请求价格json数据
price_request = scrapy.Request(self.price_url.format(num), callback=self.get_price)
price_request.meta['item'] = item
yield price_request
def get_price(self, response):
item = response.meta['item']
price_json = json.loads(response.text)
item['price'] = price_json[0]['p']
num = item['num']
# 请求评论总结json数据
comment_request = scrapy.Request(self.comment_url.format(num), callback=self.get_comment)
comment_request.meta['item'] = item
yield comment_request
def get_comment(self, response):
item = response.meta['item']
comment_json = json.loads(response.text)
comment_json = comment_json['CommentsCount'][0]
item['comment_count'] = comment_json['CommentCount']
item['good_count'] = comment_json['GoodCount']
item['default_good_count'] = comment_json['DefaultGoodCount']
item['general_count'] = comment_json['GeneralCount']
item['poor_count'] = comment_json['PoorCount']
item['after_count'] = comment_json['AfterCount']
item['good_rate'] = comment_json['GoodRate']
item['general_rate'] = comment_json['GeneralRate']
item['poor_rate'] = comment_json['PoorRate']
item['average_score'] = comment_json['AverageScore']
yield item
Step 6——JDCommentSpider
根据comment-url提取商品评论
JDUrlsSpider已经将comment-url存入服务器端redis,JDCommentSpider只需从redis获取url爬取评论
创建项目:
cd JDSpider
scrapy startproject JDComment
要爬取的商品评论具体项如下:
# JDSpider/JDComment/items.py
class JDCommentItem(scrapy.Item):
# TINYTEXT
good_num = scrapy.Field()
# TEXT
content = scrapy.Field()
初始comment-url返回的json数据中只有10条评论,但是maxPage指明了可以获取评论的次数,加个循
环即可获取其他评论数据
# JDSpider/JDComment/JDCommentSpider.py
from scrapy_redis.spiders import RedisSpider
from JDComment.items import JDCommentItem
from scrapy.utils.project import get_project_settings
import scrapy
import json
import re
class JDCommentSpider(RedisSpider):
# 获取指定商品的评论(完整评论,非摘要)
name = 'JDCommentSpider'
allow_domains = ['www.jd.com']
redis_key = 'JDCommentSpider'
settings = get_project_settings()
comment_url = settings['COMMENT_URL']
def parse(self, response):
comment_json = json.loads(response.text)
good_number = re.findall(r'productId=(\d+)', response.url)[0]
max_page_num = comment_json['maxPage']
for com in comment_json['comments']:
item = JDCommentItem()
item['good_num'] = good_number
item['content'] = com['content']
yield item
for i in range(2, max_page_num):
yield scrapy.Request(self.comment_url.format(good_number, i), callback=self.get_leftover)
def get_leftover(self, response):
comment_json = json.loads(response.text)
good_number = re.findall(r'productId=(\d+)', response.url)[0]
for com in comment_json['comments']:
item = JDCommentItem()
item['good_num'] = good_number
item['content'] = com['content']
yield item
Step 7——启动爬虫
cd ProjectStart
python Test.py
cd JDUrlsSpider
scrapy crawl JDUrlsSpider
cd JDDetailSpider
scrapy crawl JDDetailSpider
(This is distributed crawler, you can run more than one JDDetailSpider)
cd JDCommentSpider
scrapy crawl JDCommentSpider
(This is distributed crawler, you can run more than one JDCommentSpider)
成果展示
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1AB82g0T-1615194057496)(https://github.com/Dengqlbq/JDSpider/raw/master/Image/detail.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qJ6ubYJv-1615194057501)(https://github.com/Dengqlbq/JDSpider/raw/master/Image/partial.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yvDgCVV5-1615194057503)(https://github.com/Dengqlbq/JDSpider/raw/master/Image/comment.png)]
参考资料
####总体框架参考