python+scrapy+selenium爬京东零食数据+简单分析

最新推荐文章于 2024-05-12 12:52:39 发布

小学鸡_陈三五

最新推荐文章于 2024-05-12 12:52:39 发布

阅读量1.7k

点赞数

分类专栏：数据挖掘与数据分析文章标签：爬虫数据分析 python scrapy selenium

本文链接：https://blog.csdn.net/sisi_chen_3285/article/details/86218675

版权

该项目使用Python的Scrapy和Selenium库爬取京东零食的商品编号、名称、评论数、价格和好评度等信息。由于部分数据异步加载，通过Fiddler进行抓包分析获取。数据存入MySQL数据库，总计5828条记录。分析发现价格、好评度和评论数的分布特点，并通过图表展示。尽管好评度区分度不高，但价格和评论数间存在负相关关系，提示可能存在的刷单现象。

摘要由CSDN通过智能技术生成

一、工具

win8.1

python3.7

pycharm

fiddle

firefox、chrome

MySQL

二、项目简介

本项目爬取了京东的商品编号、商品名称、评论数、价格、好评度等信息。其中评论数、价格、好评度这三个参数是异步加载的，不能直接从页面源代码获取，在这里我采用了fiddle进行抓包分析。

当第一版爬虫从京东获取数据后，发现近半数的商品价格没有获取到。于是针对没有获取到价格的商品重新抓包分析，得到了另外一个存放价格的js，顺利从中得到了价格。

京东商品页存在着懒加载机制，先加载30个商品信息，下拉滚动条，触发ajax请求再加载30个商品信息。这里通过selenium来解决。

数据库表结构如下（商品id,零食名字，评论数，价格，店名，链接，好评度）：

注：其实好评度那边本来想写praise_degree的，结果也不知道怎么就写成price_degree，后来为了偷懒也就没改。(ｐ・・q）

三、项目实战

item.py

import scrapy
class JingdongItem(scrapy.Item):
    id = scrapy.Field()
    name = scrapy.Field()
    comment_amount = scrapy.Field()
    price = scrapy.Field()
    shop_name = scrapy.Field()
    link = scrapy.Field()
    price_degree = scrapy.Field()

jdls.py

# -*- coding: utf-8 -*-
import scrapy
from jingdong.items import JingdongItem
from scrapy.http import Request
import urllib.request
import re

class JdlsSpider(scrapy.Spider):
    name = 'jdls'
    allowed_domains = ['jd.com']
    start_urls = ['http://jd.com/']

    def parse(self, response):
        try:
            key = "零食"
            search_url = "https://search.jd.com/Search?keyword=" + key + "&enc=utf-8&wq=" + key
            for i in range(1,101):
                page_url = search_url + "&page=" + str(i*2-1)
                #print("现在第"+str(i)+"页了",end="")
                yield Request(url=page_url,callback=self.next,meta={'proxy': '58.210.94.242:50514'})
        except Exception as e:
            print(e,end="")
            print("失败---")
    def next(self,response):
        id = response.xpath('//ul[@class="gl-warp clearfix"]/li/@data-sku').extract()
        for j in range(0,len(id)):
            ture_url = "https://item.jd.com/" + str(id[j]) + ".html"
            yield Request(url=ture_url,callback=self.next2,meta={'proxy': '58.210.94.242:50514'})
    def next2(self,response):
        item = JingdongItem()
        item["name"] = response.xpath('//head/title/text()').extract()[0].replace('【图片 价格 品牌 报价】-京东','').replace('【行情 报价 价格 评测】-京东','')
        tshopname = response.xpath('//div[@class="name"]/a/text()').extract()
        item["shop_name"] = tshopname[0] if tshopname else None
        item["link"] = response.url
        thisid = re.findall('https://item.jd.com/(.*?).html',item['link'])[0]
        item["id"] = thisid
        priceurl = "https://p.3.cn/prices/mgets?callback=jQuery3630170&type=1&area=1_72_4137_0&pdtk=&pduid=575222872&pdpin=&pin=null&pdbp=0&skuIds=J_" + thisid + "%2CJ_7931130%2CJ_8461830%2CJ_6774075%2CJ_6774155%2CJ_6218367&ext=11100000&source=item-pc"
        commenturl = "https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv3575&productId=" + thisid + "&score=0&sortType=5&page=1&pageSize=10&isShadowSku=0&rid=0&fold=1"
        pricedata = urllib.request.urlopen(priceurl).read().decode("utf-8", "ignore")
        commentdata = urllib.request.urlopen(commenturl).read().decode("utf-8", "ignore")
        pricepat = '"p":"(.*?)"'
        commentpat = '"goodRateShow":(.*?),'
        comment_amountpat = '"goodCount":(.*?),'
        tprice = re.compile(pricepat).findall(pricedata)
        item["price"] = tprice[0] if tprice else None
        if item["price"] ==None:
            newpriceurl = "https://c0.3.cn/stock?skuId="+thisid+"&cat=1320,1583,1591&venderId=88149&area=1_72_4137_0&buyNum=1&choseSuitSkuIds=&extraParam={%22originid%22:%221%22}&ch=1&fqsp=0&pdpin=&callback=jQuery9513488"
            newpricedata = urllib.request.urlopen(newpriceurl).read().decode("utf-8", "ignore")
            newtprice = re.compile(pricepat).findall(newpricedata)
            item["price"] = newtprice[0] if newtprice else None
        tdegree = re.compile(commentpat).findall(commentdata)
        item["price_degree"] = tdegree[0] if tdegree else None
        item["comment_amount"] = re.compile(com

最低0.47元/天解锁文章

小学鸡_陈三五

关注

0
点赞
踩
19

收藏

觉得还不错? 一键收藏
0
评论
python+scrapy+selenium爬京东零食数据+简单分析

一、工具win8.1python3.7pycharmfiddlefirefox、chromeMySQL二、项目简介本项目爬取了京东的商品编号、商品名称、评论数、价格、好评度等信息。其中评论数、价格、好评度这三个参数是异步加载的，不能直接从页面源代码获取，在这里我采用了fiddle进行抓包分析。当第一版爬虫从京东获取数据后，发现近半数的商品价格没有获取到。于是针对...
复制链接

扫一扫