这里主要是来练习一下网络爬取技术,详情请看以下链接:
[爬虫]python最新爬取京东评论+词云图+LDA模型分析_哔哩哔哩_bilibili
首先scapy框架就不用多说了,这里的格式如下:
主要的操作在JDSpiders里面实现,接下来查看京东评论的网络结构,直接搜索评论,然后发现要获取的信息都在comments里面:
打开该api地址:
这里都是json格式的地址,要提取的信息(评论、地点、时间、昵称)如上
同时这里可以发现,翻页的时候只有page变化,所以构造url的时候只要将page从0到100循环即可(因为京东只能看到前100页评论)
接着直接写代码:
1、返回到items.py里面,填写获取的信息
class JdGoodsCommitItem(scrapy.Item):
shop_id = scrapy.Field()
content = scrapy.Field()
creationTime = scrapy.Field()
nickname = scrapy.Field()
score = scrapy.Field()
location=scrapy.Field()
2直接附上JDSpiders的代码
import scrapy
from jd.items import JdGoodsCommitItem
import json
from bs4 import BeautifulSoup
import requests
import re
from urllib.parse import urlencode
class JDSpiders(scrapy.Spider):
name = 'JDSpiders'
allowed_domains = ['www.jd.com']
url_head = 'https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1710822302658&loginType=3&uuid=181111935.1006305175.1710410734.1710766111.1710822119.19&productId=100083659339&score=0&sortType=5'
url_middle = '&page='
url_end = '&pageSize=10&isShadowSku=0&fold=1&bbtf=&shield='
def start_requests(self):
for i in range(0, 2):
#构造网页页面
print("当前页面:", url)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
json_string = response.text
soup = BeautifulSoup(json_string, 'html.parser')
#省略爬取商品操作
yield item
3.修改pipeline下载地址:
def process_item(self, item, spider):
if spider.name == 'JDSpiders':
# 将item转换为JdGoodsItemModel,然后保存到数据库
row = pd.DataFrame({'nickname': item['nickname'], 'content': item['content'], 'score': item['score'],
'creationTime': item['creationTime'], 'location': item['location']}, index=[0])
row.to_csv('dataset/jd_goods.csv', index=False, encoding='utf-8', mode='a', header=False)
return item
同时settings里面的pipline别忘了改:
DOWNLOADER_MIDDLEWARES = {
'jd.middlewares.JdDownloaderMiddleware': 543,
}
4.本人中间件用的是edge浏览器(你不用中间件也是可以的,因为只有翻页操作,不需要登录浏览器)
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
webdriver = None
def get_driver(self):
if self.webdriver is None:
options = webdriver.EdgeOptions()
options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
# 设置浏览器驱动程序路径
driver = webdriver.Edge(options=options)
self.webdriver = driver
return self.webdriver
爬取到的数据如图:
码字不易,记得点个赞再走~