本篇用作前两篇的首尾,并解决最后一个问题。
第一篇中我介绍了Python爬虫的基本语法,介绍了如何分析页面、如何分析json文件,并将爬取评论的基本功能实现了,是最初的版本。链接:Python爬取京东商品评论(一)
第二篇中我对第一篇的程序进行了大量的修改,改进了数据量、自动化等问题,对程序代码进行了整理,程序功能已经比较完整且稳定。链接:Python爬取京东商品评论(二)
本篇解决第二篇中的一个遗留问题,即自动获取不同配色的手机的商品代码。上一篇的程序中,只能通过手动输入所有的商品代码才能去爬取对应的评论数据,本篇将会使该过程更自动化,只需输入一次该商品在京东的商品页网址,就能自动获取其他配色款式的商品代码,并自动爬取评论。
需要指出的是,这里只针对某个固定的内存版本,也就是说,只能提取出所有64GB的配色,或者所有128GB的配色。
页面分析:
分析页面的办法就不赘述了,这里我直接贴上保存商品代码的HTML语句,从图中可以看出,商品代码保存在一个id为“choose-attr-1”的div块下,我们首先需要提取出这个名为“choose-attr-1”的div块,然后使用正则表达式,取出商品代码。
所以我们先获取该页面的HTML代码:
url=str(input("请输入网址:"))
Req=urllib.request.Request(url=url)
content=urllib.request.urlopen(Req)
content=content.read().decode('utf-8')
BeautifulSoup库:
这个库可以对HTML代码进行分析,提取出其中的信息,程序中只用了库中的find_all函数,所以该库的使用方法这里就不赘述了。我们先利用这个库来提取出id为“choose-attr-1”的div块:
soup=BeautifulSoup(content,'html.parser')
choose=soup.find_all(id='choose-attr-1')
find_all函数得到的是一个迭代器,其中的元素即为所有id为“choose-attr-1”的div块中的HTML代码,且元素可以再次使用find_all函数来再次分析。
在上面的页面分析中我们可以看出,商品代码保存在class为“item”的div块中,所以我们再次使用find_all函数:
for i in choose:
j=i.find_all(class_='item')
此时j为迭代器,其中就是所有保存有商品代码的div块。
正则表达式:
正则表达式同样可以从文本中提取信息。在得到商品代码的div块HTML代码后,我们就可以用一句简单的正则表达式来提取商品代码,并保存到list中:
web_regex=re.compile(r"[0-9]+")
for i in choose:
j=i.find_all(class_='item')
for one in j:
text=web_regex.findall(str(one))
id_list.append(text[0])
程序代码:
在第二篇中的程序中加入此部分的代码后,代码整理如下:
import urllib.request
from bs4 import BeautifulSoup
import re
import json
import random
import sys
import xlwt
import pandas as pd
start_url = "https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98&productId={}&score={}&sortType=5&page={}&pageSize=10&isShadowSku=0&fold=1"
id_list = []
user_agents = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Opera/8.0 (Windows NT 5.1; U; en)',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
]
header = {
'User_Agent': str(random.choice(user_agents)),
'Referer': 'https://item.jd.com/100009177424.html'
}
def get_id_list():
url=str(input("请输入网址:"))
Req=urllib.request.Request(url=url)
content=urllib.request.urlopen(Req)
content=content.read().decode('utf-8')
soup=BeautifulSoup(content,'html.parser')
choose=soup.find_all(id='choose-attr-1')
web_regex=re.compile(r"[0-9]+")
for i in choose:
j=i.find_all(class_='item')
for one in j:
text=web_regex.findall(str(one))
id_list.append(text[0])
def start_deal(start_url) -> int:
req = urllib.request.Request(url=start_url, headers=header)
content = urllib.request.urlopen(req)
if content.getcode() != 200:
print("爬取起始页失败!")
sys.exit(0)
content = content.read().decode('gbk', 'replace')
content = content.strip("fetchJSON_comment98();")
text = json.loads(content)
max_page = text['maxPage']
comment = text['comments']
fp = open('京东.txt', 'a', encoding='utf-8')
for i in comment:
text = str(i['content']).replace('\n', ' ')
fp.write(text + '\n')
print('起始页完成!')
fp.close()
return max_page
def deal(id, score, max_page):
for i in range(1, max_page):
url = start_url.format(id, str(score), str(i))
req = urllib.request.Request(url=url, headers=header)
content = urllib.request.urlopen(req)
if (content.getcode() != 200):
print('爬取失败!')
sys.exit(0)
else:
content = content.read().decode('gbk', 'replace')
content = content.strip('fetchJSON_comment98();')
text = json.loads(content)
comment = text['comments']
fp = open('京东.txt', 'a', encoding='utf-8')
for j in comment:
text = str(j['content']).replace('\n', ' ')
fp.write(text + '\n')
print('第%s页完成!' % (i + 1))
fp.close()
def book_save():
book = xlwt.Workbook(encoding='utf-8', style_compression=0)
sheet = book.add_sheet('京东商品评论', cell_overwrite_ok=True)
sheet.write(0, 0, "内容")
n = 1
with open('京东.txt', 'r', encoding='utf-8') as f:
li = f.readlines()
for i in li:
sheet.write(n, 0, i)
n += 1
book.save(u'京东.xlsx')
data_xls = pd.read_excel('京东.xlsx')
data_xls.to_csv('京东.csv', encoding='utf-8')
if __name__ == '__main__':
get_id_list()
for i in range(3, 0, -1):
for j in id_list:
getpage_url = start_url.format(j, i, '0')
page = start_deal(getpage_url)
deal(j, i, 3)
book_save()