上一篇博客中简单实现了京东商品评论的爬虫(Python爬取京东商品评论),由于这个爬虫是我毕设的一部分,所以我需要对这个程序做一些补充,上一篇的程序中有许多不足的地方,下面我逐个解决。
1、首先是评论页数的问题。上一个程序中我是通过手动输入要爬取的评论页数来控制程序,但是由于各个商品的评论页数无法确定,所以我们首先要实现程序自动获取评论页数的功能。
我将json文件中的内容复制到了json在线解析网站上(json在线解析),看到了其数据结构:
其中maxPage这一项就代表了有多少页评论,所以我就将第一页评论单独处理,将评论写入文件中,取出其中的maxPage,由maxPage来控制爬取剩下的评论页。这部分的代码如下:
def start_deal(start_url)->int:
req=urllib.request.Request(url=start_url,headers=header)
content=urllib.request.urlopen(req)
if content.getcode() !=200:
print("爬取起始页失败!")
sys.exit(0)
content=content.read().decode('gbk')
content=content.strip("fetchJSON_comment98();")
text=json.loads(content)
max_page=text['maxPage']
comment=text['comments']
fp=open('京东.txt','a',encoding='utf-8')
for i in comment:
fp.write(str(i['content'])+'\n')
print('起始页完成!')
fp.close()
return max_page
其中start_url就是第一页评论的json文件的URL,返回的max_page就表示总共的评论页数。
2、然后是数据量的问题,京东的评论页面是这样的:
京东的评论量看着很大,轻轻松松几十万,但其实我们能看到的有效评论只有100页,这点是无法改变的。在全部评价中可以爬取100页,同样的,在好评、中评、差评中也都可以爬取到100页,那么我们就可以分别从三种评价中分别爬取100页,由此来提高数据量。观察一下好评、中评、差评的URL:
其实区别就在score这一项,好评为3,中评为2,差评为1。有了这个规律我们就可以实现自动爬取三种评论。这部分代码如下:
for i in range(3,0,-1):
start_url = "https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score={}&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1"
start_url=start_url.format(i)
page=start_deal(start_url)
{}在Python中称为占位符,可以用format函数来填充内容。
3、数据存储问题。评论信息中包含着许多换行符,这样保存到文件中时,一条评论分成好几行,容易被当成是好几条评论。所以我们在存入文件时要先将评论中的换行符去掉,用空格来代替,那么就对将数据写入文件的代码进行修改:
fp=open('京东.txt','a',encoding='utf-8')
for i in comment:
text=str(i['content']).replace('\n',' ')
fp.write(text+'\n')
还可以将其存入到xlsx文件和CSV文件中,方便之后进行自然语言处理:
def book_save():
book = xlwt.Workbook(encoding='utf-8', style_compression=0)
sheet = book.add_sheet('京东商品评论', cell_overwrite_ok=True)
sheet.write(0, 0, "内容")
n = 1
with open('京东.txt', 'r', encoding='utf-8') as f:
li = f.readlines()
for i in li:
sheet.write(n, 0, i)
n += 1
book.save(u'京东.xlsx')
data_xls = pd.read_excel('京东.xlsx')
data_xls.to_csv('京东.csv', encoding='utf-8')
4、程序全部代码如下:
import urllib.request
import json
import random
import sys
import xlwt
import pandas as pd
user_agents = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Opera/8.0 (Windows NT 5.1; U; en)',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
]
header={
'User_Agent':str(random.choice(user_agents)),
'Referer':'https://item.jd.com/100009177424.html'
}
def start_deal(start_url)->int:
req=urllib.request.Request(url=start_url,headers=header)
content=urllib.request.urlopen(req)
if content.getcode() !=200:
print("爬取起始页失败!")
sys.exit(0)
content=content.read().decode('gbk')
content=content.strip("fetchJSON_comment98();")
text=json.loads(content)
max_page=text['maxPage']
comment=text['comments']
fp=open('京东.txt','a',encoding='utf-8')
for i in comment:
text=str(i['content']).replace('\n',' ')
fp.write(text+'\n')
print('起始页完成!')
fp.close()
return max_page
def deal(score,max_page):
for i in range(1,max_page):
url = "https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score={}&sortType=5&page={}&pageSize=10&isShadowSku=0&rid=0&fold=1"
url=url.format(str(score),str(i))
req=urllib.request.Request(url=url,headers=header)
content=urllib.request.urlopen(req)
if(content.getcode()!=200):
print('爬取失败!')
sys.exit(0)
else:
content=content.read().decode('gbk')
content=content.strip('fetchJSON_comment98();')
text=json.loads(content)
comment=text['comments']
fp=open('京东.txt','a',encoding='utf-8')
for j in comment:
text = str(j['content']).replace('\n', ' ')
fp.write(text+'\n')
print('第%s页完成!'%(i+1))
fp.close()
def book_save():
book = xlwt.Workbook(encoding='utf-8', style_compression=0)
sheet = book.add_sheet('京东商品评论', cell_overwrite_ok=True)
sheet.write(0, 0, "内容")
n = 1
with open('京东.txt', 'r', encoding='utf-8') as f:
li = f.readlines()
for i in li:
sheet.write(n, 0, i)
n += 1
book.save(u'京东.xlsx')
data_xls = pd.read_excel('京东.xlsx')
data_xls.to_csv('京东.csv', encoding='utf-8')
if __name__ == '__main__':
for i in range(3,0,-1):
start_url = "https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score={}&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1"
start_url=start_url.format(i)
page=start_deal(start_url)
deal(i,page)
book_save()
更新:
5、还是数据量的问题,导师说3000条评论的话数据量太少,让我再想想办法。我分析的是京东商城上智能手机的评论,这个商品比较特别,即它有许多种不同的配置和配色,如图:
而且我又发现,每种配色的商品编码并不相同,可以在页面的URL中找到,8GB+128GB下6种配色的商品编码分别为:
['100009177424','100009177400','100005185609','100009177422','100005185613','100009177428']
之后又联想到了评论界面中的这个按钮:
于是我便在选择了只看当前商品评论的功能下用JSON解析器查看了6种配色的手机的评论,发现他们全都不一样。对于我的毕设来说,这6种配色本质上都是同一款手机,所以我就可以把这些不同配色的手机评论都爬取下来,视为同一款手机的原始评论数据。
爬6种配色的评论其实也很好实现,首先JSON文件的URL需要改变一下。原本只爬一种商品的评论时,其评论数据的JSON文件的URL是这样的:
https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1"
在选择了只看当前商品的功能后,其评论JSON文件的URL变成了这样,差别不大:
https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98&productId=100009177424&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1
其次URL中productId这一项现在也成了变量,需将6种不同的商品编码分别填入:
['100009177424','100009177400','100005185609','100009177422','100005185613','100009177428']
在加入了爬6种配色评论的功能和一些细节上的优化后,代码如下,不需要这么多评论数据的可以忽略:
import urllib.request
import json
import random
import sys
import xlwt
import pandas as pd
start_url="https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98&productId={}&score={}&sortType=5&page={}&pageSize=10&isShadowSku=0&fold=1"
# 如果还嫌不够,可以将8GB+256GB的6种配色的商品编码补充进来
id_list=['100009177424','100009177400','100005185609','100009177422','100005185613','100009177428']
user_agents = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Opera/8.0 (Windows NT 5.1; U; en)',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
]
header={
'User_Agent':str(random.choice(user_agents)),
'Referer':'https://item.jd.com/100009177424.html'
}
def start_deal(start_url)->int:
req=urllib.request.Request(url=start_url,headers=header)
content=urllib.request.urlopen(req)
if content.getcode() !=200:
print("爬取起始页失败!")
sys.exit(0)
content=content.read().decode('gbk','replace')
content=content.strip("fetchJSON_comment98();")
text=json.loads(content)
max_page=text['maxPage']
comment=text['comments']
fp=open('京东.txt','a',encoding='utf-8')
for i in comment:
text=str(i['content']).replace('\n',' ')
fp.write(text+'\n')
print('起始页完成!')
fp.close()
return max_page
def deal(id,score,max_page):
for i in range(1,max_page):
url=start_url.format(id,str(score),str(i))
req=urllib.request.Request(url=url,headers=header)
content=urllib.request.urlopen(req)
if(content.getcode()!=200):
print('爬取失败!')
sys.exit(0)
else:
content=content.read().decode('gbk','replace')
content=content.strip('fetchJSON_comment98();')
text=json.loads(content)
comment=text['comments']
fp=open('京东.txt','a',encoding='utf-8')
for j in comment:
text = str(j['content']).replace('\n', ' ')
fp.write(text+'\n')
print('第%s页完成!'%(i+1))
fp.close()
def book_save():
book = xlwt.Workbook(encoding='utf-8', style_compression=0)
sheet = book.add_sheet('京东商品评论', cell_overwrite_ok=True)
sheet.write(0, 0, "内容")
n = 1
with open('京东.txt', 'r', encoding='utf-8') as f:
li = f.readlines()
for i in li:
sheet.write(n, 0, i)
n += 1
book.save(u'京东.xlsx')
data_xls = pd.read_excel('京东.xlsx')
data_xls.to_csv('京东.csv', encoding='utf-8')
if __name__ == '__main__':
for i in range(3,0,-1):
for j in id_list:
getpage_url=start_url.format(j,i,'0')
page=start_deal(getpage_url)
deal(j,i,3)
book_save()