Python爬取京东商品评论(二)

        上一篇博客中简单实现了京东商品评论的爬虫(Python爬取京东商品评论),由于这个爬虫是我毕设的一部分,所以我需要对这个程序做一些补充,上一篇的程序中有许多不足的地方,下面我逐个解决。

        1、首先是评论页数的问题。上一个程序中我是通过手动输入要爬取的评论页数来控制程序,但是由于各个商品的评论页数无法确定,所以我们首先要实现程序自动获取评论页数的功能。

        我将json文件中的内容复制到了json在线解析网站上(json在线解析),看到了其数据结构:

        其中maxPage这一项就代表了有多少页评论,所以我就将第一页评论单独处理,将评论写入文件中,取出其中的maxPage,由maxPage来控制爬取剩下的评论页。这部分的代码如下:

def start_deal(start_url)->int:
    req=urllib.request.Request(url=start_url,headers=header)
    content=urllib.request.urlopen(req)
    if content.getcode() !=200:
        print("爬取起始页失败!")
        sys.exit(0)
    content=content.read().decode('gbk')
    content=content.strip("fetchJSON_comment98();")
    text=json.loads(content)
    max_page=text['maxPage']
    comment=text['comments']
    fp=open('京东.txt','a',encoding='utf-8')
    for i in comment:
        fp.write(str(i['content'])+'\n')
    print('起始页完成!')
    fp.close()
    return max_page

        其中start_url就是第一页评论的json文件的URL,返回的max_page就表示总共的评论页数。

 

        2、然后是数据量的问题,京东的评论页面是这样的:

        京东的评论量看着很大,轻轻松松几十万,但其实我们能看到的有效评论只有100页,这点是无法改变的。在全部评价中可以爬取100页,同样的,在好评、中评、差评中也都可以爬取到100页,那么我们就可以分别从三种评价中分别爬取100页,由此来提高数据量。观察一下好评、中评、差评的URL:

差评:https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score=1&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1

中评:https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score=2&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1

好评:https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score=3&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1

        其实区别就在score这一项,好评为3,中评为2,差评为1。有了这个规律我们就可以实现自动爬取三种评论。这部分代码如下:

    for i in range(3,0,-1):
        start_url = "https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score={}&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1"
        start_url=start_url.format(i)
        page=start_deal(start_url)

        {}在Python中称为占位符,可以用format函数来填充内容。

 

        3、数据存储问题。评论信息中包含着许多换行符,这样保存到文件中时,一条评论分成好几行,容易被当成是好几条评论。所以我们在存入文件时要先将评论中的换行符去掉,用空格来代替,那么就对将数据写入文件的代码进行修改:

    fp=open('京东.txt','a',encoding='utf-8')
    for i in comment:
        text=str(i['content']).replace('\n',' ')
        fp.write(text+'\n')

        还可以将其存入到xlsx文件和CSV文件中,方便之后进行自然语言处理:

def book_save():
    book = xlwt.Workbook(encoding='utf-8', style_compression=0)
    sheet = book.add_sheet('京东商品评论', cell_overwrite_ok=True)
    sheet.write(0, 0, "内容")
    n = 1
    with open('京东.txt', 'r', encoding='utf-8') as f:
        li = f.readlines()
        for i in li:
            sheet.write(n, 0, i)
            n += 1
    book.save(u'京东.xlsx')
    data_xls = pd.read_excel('京东.xlsx')
    data_xls.to_csv('京东.csv', encoding='utf-8')

       

        4、程序全部代码如下:

import urllib.request
import json
import random
import sys
import xlwt
import pandas as pd

user_agents = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
        'Opera/8.0 (Windows NT 5.1; U; en)',
        'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
        'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
        "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
        "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
    ]
header={
    'User_Agent':str(random.choice(user_agents)),
    'Referer':'https://item.jd.com/100009177424.html'
}

def start_deal(start_url)->int:
    req=urllib.request.Request(url=start_url,headers=header)
    content=urllib.request.urlopen(req)
    if content.getcode() !=200:
        print("爬取起始页失败!")
        sys.exit(0)
    content=content.read().decode('gbk')
    content=content.strip("fetchJSON_comment98();")
    text=json.loads(content)
    max_page=text['maxPage']
    comment=text['comments']
    fp=open('京东.txt','a',encoding='utf-8')
    for i in comment:
        text=str(i['content']).replace('\n',' ')
        fp.write(text+'\n')
    print('起始页完成!')
    fp.close()
    return max_page

def deal(score,max_page):
    for i in range(1,max_page):
        url = "https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score={}&sortType=5&page={}&pageSize=10&isShadowSku=0&rid=0&fold=1"
        url=url.format(str(score),str(i))
        req=urllib.request.Request(url=url,headers=header)
        content=urllib.request.urlopen(req)
        if(content.getcode()!=200):
            print('爬取失败!')
            sys.exit(0)
        else:
            content=content.read().decode('gbk')
        content=content.strip('fetchJSON_comment98();')
        text=json.loads(content)
        comment=text['comments']
        fp=open('京东.txt','a',encoding='utf-8')
        for j in comment:
            text = str(j['content']).replace('\n', ' ')
            fp.write(text+'\n')
        print('第%s页完成!'%(i+1))
        fp.close()

def book_save():
    book = xlwt.Workbook(encoding='utf-8', style_compression=0)
    sheet = book.add_sheet('京东商品评论', cell_overwrite_ok=True)
    sheet.write(0, 0, "内容")
    n = 1
    with open('京东.txt', 'r', encoding='utf-8') as f:
        li = f.readlines()
        for i in li:
            sheet.write(n, 0, i)
            n += 1
    book.save(u'京东.xlsx')
    data_xls = pd.read_excel('京东.xlsx')
    data_xls.to_csv('京东.csv', encoding='utf-8')

if __name__ == '__main__':
    for i in range(3,0,-1):
        start_url = "https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score={}&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1"
        start_url=start_url.format(i)
        page=start_deal(start_url)
        deal(i,page)
        book_save()

 

更新:

        5、还是数据量的问题,导师说3000条评论的话数据量太少,让我再想想办法。我分析的是京东商城上智能手机的评论,这个商品比较特别,即它有许多种不同的配置和配色,如图:

        而且我又发现,每种配色的商品编码并不相同,可以在页面的URL中找到,8GB+128GB下6种配色的商品编码分别为:

['100009177424','100009177400','100005185609','100009177422','100005185613','100009177428']

        之后又联想到了评论界面中的这个按钮:

        于是我便在选择了只看当前商品评论的功能下用JSON解析器查看了6种配色的手机的评论,发现他们全都不一样。对于我的毕设来说,这6种配色本质上都是同一款手机,所以我就可以把这些不同配色的手机评论都爬取下来,视为同一款手机的原始评论数据。

        爬6种配色的评论其实也很好实现,首先JSON文件的URL需要改变一下。原本只爬一种商品的评论时,其评论数据的JSON文件的URL是这样的:

https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1"

        在选择了只看当前商品的功能后,其评论JSON文件的URL变成了这样,差别不大:

https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1

        其次URL中productId这一项现在也成了变量,需将6种不同的商品编码分别填入:

['100009177424','100009177400','100005185609','100009177422','100005185613','100009177428']

        在加入了爬6种配色评论的功能和一些细节上的优化后,代码如下,不需要这么多评论数据的可以忽略:

import urllib.request
import json
import random
import sys
import xlwt
import pandas as pd

start_url="https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98&productId={}&score={}&sortType=5&page={}&pageSize=10&isShadowSku=0&fold=1"
# 如果还嫌不够,可以将8GB+256GB的6种配色的商品编码补充进来
id_list=['100009177424','100009177400','100005185609','100009177422','100005185613','100009177428']
user_agents = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
        'Opera/8.0 (Windows NT 5.1; U; en)',
        'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
        'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
        "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
        "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
    ]
header={
    'User_Agent':str(random.choice(user_agents)),
    'Referer':'https://item.jd.com/100009177424.html'
}

def start_deal(start_url)->int:
    req=urllib.request.Request(url=start_url,headers=header)
    content=urllib.request.urlopen(req)
    if content.getcode() !=200:
        print("爬取起始页失败!")
        sys.exit(0)
    content=content.read().decode('gbk','replace')
    content=content.strip("fetchJSON_comment98();")
    text=json.loads(content)
    max_page=text['maxPage']
    comment=text['comments']
    fp=open('京东.txt','a',encoding='utf-8')
    for i in comment:
        text=str(i['content']).replace('\n',' ')
        fp.write(text+'\n')
    print('起始页完成!')
    fp.close()
    return max_page

def deal(id,score,max_page):
    for i in range(1,max_page):
        url=start_url.format(id,str(score),str(i))
        req=urllib.request.Request(url=url,headers=header)
        content=urllib.request.urlopen(req)
        if(content.getcode()!=200):
            print('爬取失败!')
            sys.exit(0)
        else:
            content=content.read().decode('gbk','replace')
        content=content.strip('fetchJSON_comment98();')
        text=json.loads(content)
        comment=text['comments']
        fp=open('京东.txt','a',encoding='utf-8')
        for j in comment:
            text = str(j['content']).replace('\n', ' ')
            fp.write(text+'\n')
        print('第%s页完成!'%(i+1))
        fp.close()

def book_save():
    book = xlwt.Workbook(encoding='utf-8', style_compression=0)
    sheet = book.add_sheet('京东商品评论', cell_overwrite_ok=True)
    sheet.write(0, 0, "内容")
    n = 1
    with open('京东.txt', 'r', encoding='utf-8') as f:
        li = f.readlines()
        for i in li:
            sheet.write(n, 0, i)
            n += 1
    book.save(u'京东.xlsx')
    data_xls = pd.read_excel('京东.xlsx')
    data_xls.to_csv('京东.csv', encoding='utf-8')

if __name__ == '__main__':
    for i in range(3,0,-1):
        for j in id_list:
            getpage_url=start_url.format(j,i,'0')
            page=start_deal(getpage_url)
            deal(j,i,3)
            book_save()

 

评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值