总体说明
爬取苏宁易购的最大难点就在于他这个页面是很多js动态加载的内容,寻找和构造链接比较麻烦,如果不用JS逆向工程。采用selenium等爬取的效率相对会比较差一点,下面我会先放上我找的js动态加载的url图片,然后重点说明一下这些url的构造,最后附上代码,希望对您能有帮助,不足之处希望多提意见。谢谢!
主页面翻页url构造
a = urllib.parse.quote("华为手机")
for i in range(50):
print("第%s页" % i)
url = "https://search.suning.com/"+a+"/&iy=0&isNoResult=0&cp=" + str(i)
html = self.get_html(url)
self.get_phone_data(html)
点击下一页的时候寻找下一页的link,然后定位翻页的参数位置
详情页面数据获取
这边需要的数据是详情页面中的手机的价格,手机的评论统计,买家的回复内容
获取手机价格的url构造
#获取手机的价格 手机价格的连接需要自己拼凑
def get_price_html(self, goods_src):
try:
src_args = re.findall(r"com/(.*?).html", goods_src)[0]
key0 = src_args.split("/")[0]
key1 = src_args.split("/")[-1]
price_src = "https://pas.suning.com/nspcsale_0_0000000" + key1 + "_0000000" + key1 + "_" + key0 + "_250_029_0290199_20089_1000257_9254_12006_Z001___R1901001_0.5_0___000060864___.html?callback=pcData&_=1581050220963"
html = self.get_html(price_src)
price = re.compile(r'"netPrice":"(.*?)"', re.S)
price_ret = price.findall(html)
return price_ret[0]
except:
return -1
获取手机评论的统计数据的url
#获取评论的总的数量
def get_comment_num(self, clsid, goods_src):
src_args = re.findall(r"com/(.*?).html", goods_src)[0]
key1 = src_args.split("/")[-1]
if clsid:
url = "https://review.suning.com/ajax/review_count/cluster-"+str(clsid)+\
"-0000000"+str(key1)+"-0000000000-----satisfy.htm?callback=satisfy"
else:
url = "http://review.suning.com/ajax/review_count/general--0000000"+str(key1)+"-0000000000-----satisfy.htm?callback=satisfy"
html = self.get_html(url)
# print(html)
oneStarCount = re.findall(r'"oneStarCount":(.*?),', html)[0]
twoStarCount = re.findall(r'"twoStarCount":(.*?),', html)[0]
threeStarCount = re.findall(r'"fourStarCount":(.*?),', html)[0]
fourStarCount = re.findall(r'"threeStarCount":(.*?),', html)[0]
fiveStarCount = re.findall(r'"fiveStarCount":(.*?),', html)[0]
picFlagCount = re.findall(r'"picFlagCount":(.*?),', html)[0]
totalCount = re.findall(r