京东-集成灶评论数据爬取（demjson,lxml和requests）

最新推荐文章于 2023-12-29 18:00:23 发布

不会飞的仔

最新推荐文章于 2023-12-29 18:00:23 发布

阅读量456

点赞数 2

分类专栏：爬虫文章标签： python xpath json 数据挖掘

本文链接：https://blog.csdn.net/weixin_42362456/article/details/111557019

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Python爬虫之京东评论数据爬取

记一次爬取京东产品评论时攻克的难点：
1、json数据解析时，容易出现decode报错
2、京东PC端评论数据模块反爬特别严重，几乎是请求几次到几十次就会被封IP

解决方法：
1、采用python的demjson库解析json评论数据，同时在每次请求时加上time.sleep(n),避免请求过于频繁
2、尝试多次，发现PC的评论数据反爬确实不是作者可以攻破的，作者尝试过time.sleep()，代理IP等，最终还是无效；最后采用请求移动端的接口，再加上time.sleep(),数据可以正常爬取

附上完整代码：

import json,demjson
import pandas as pd
import time
import requests
import datetime
from sqlalchemy import create_engine
import pymysql
from numpy import *
import re,urllib.request
from lxml import etree

def jd_comment_new():
    url1 = 'https://search.jd.com/search?keyword=%E9%9B%86%E6%88%90%E7%81%B6&qrst=1&wq=%E9%9B%86%E6%88%90%E7%81%B6&stock=1&ev=exbrand_%E7%81%AB%E6%98%9F%E4%BA%BA%EF%BC%88marssenger%EF%BC%89%5E&page=1&s=61&click=0'
    url = 'https://search.jd.com/search?keyword=%E9%9B%86%E6%88%90%E7%81%B6&qrst=1&wq=%E9%9B%86%E6%88%90%E7%81%B6&stock=1&ev=exbrand_%E7%81%AB%E6%98%9F%E4%BA%BA%EF%BC%88marssenger%EF%BC%89%5E&page={}&s=61&click=0'
    headers = {
        'Cookie':'抓取你的cookie',
        'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36',
        'referer':'https://item.m.jd.com/',
            }
    re=requests.get(url1,headers=headers)
    time.sleep(2)
    html1=etree.HTML(re.text)
    page_num=html1.xpath('//*[@id="J_topPage"]/span/i/text()')[0] #获取最大页数
    # print(page_num)
    page_size=2*int(page_num)
    for page in range(1,page_size+1):
        resp = requests.get(url.format(page),headers=headers)
        time.sleep(2)
        htmllist=etree.HTML(resp.text)
        # print(htmllist)
        ul=htmllist.xpath('//*[@id="J_goodsList"]/ul/li') #商品列表
        # print(ul)
        for li in ul:
            # print(li,'\n')
            data_sku=li.xpath('./@data-sku')[0]  #获取商品ID
            # sku_name=li.xpath('./div/div[3]/a/em/text()[2]')[0]
            # print(sku_name)
            # print(data_sku)
            url2='https://wq.jd.com/commodity/comment/getcommentlist?&version=v2&fold=1&pagesize=10&sceneval=2&score=0&sku={}&sorttype=5&page={}'
            maxpage=100
            for page1 in range(maxpage):
                # print(page1)
                res=requests.get(url=url2.format(data_sku,page1),headers=headers)
                # try:
                #     res.encoding='utf-8'
                # except:
                #     res.encoding='GB18030'
                time.sleep(2)
                # print(res.text.strip())
                js = demjson.decode(res.text.strip()[10:-1])['result']['comments']
                # print(js)
                # if not js:
                #     print(page1)
                #     # break
                content=[];creationTime=[];referenceId=[]
                #评论内容  #评论时间       #产品ID
                for i in js:
                    # print(i['content'],'\n')
                    content.append(i['content'])
                    creationTime.append(i['creationTime'])
                    referenceId.append(i['referenceId'])
                # print(content)
                dic={}
                dic['content']=content
                # print(type(dic))
                dic['creationTime']=creationTime
                dic['referenceId']=referenceId
                # print(dic)
                df=pd.DataFrame(dic)
                print(df)
                
jd_comment_new()

本人也是最近才开始摸索python爬虫，对python爬虫中的很多框架和技术都不是很熟悉，欢迎大家来交流爬虫心得。

不会飞的仔

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
京东-集成灶评论数据爬取（demjson,lxml和requests）

Python爬虫之京东评论数据爬取记一次爬取京东产品评论时攻克的难点：1、json数据解析时，容易出现decode报错2、京东PC端评论数据模块反爬特别严重，几乎是请求几次到几十次就会被封IP解决方法：1、采用python的demjson库解析json评论数据，同时在每次请求时加上time.sleep(n),避免请求过于频繁2、尝试多次，发现PC的评论数据反爬确实不是作者可以攻破的，作者尝试过time.sleep()，代理IP等，最终还是无效；最后采用请求移动端的接口，再加上time.sleep
复制链接

扫一扫