之前看到网上有AppStore应用评论的爬虫,但是由于用的是官方提供的api,每个应用最多只能获取到500条评论,完全没法满足数据分析的需求。因此经过一些分析,写了一个可以获取更多评论的爬虫。
1 配置文件(config_api.json)
{
"max_page": 5,
"ids": ["要爬app的id", "要爬app的id"],
"headers": {
"User-Agent": "你自己的",
"Authorization": "你自己的"
},
"intervals": 2
}
首先解释一下配置文件:
max_page:要爬的最大评论页数,每页是10条评论;
ids:要爬取的应用id列表;
headers:浏览器发起请求的请求头;
intervals:每爬一页评论的间隔时间。
2 代码(spider.py)
import os
import csv
import json
import time
import requests
next_url = None
review_path = 'reviews'
if not os.path.exists(review_path):
os.mkdir(review_path)
with open('config_api.json', 'r') as file:
config = json.loads(file.read())
pending_queue = config['ids']
max_page = config['max_page']
headers = config['headers']
intervals = config['intervals']
# 发送请求获取响应
def get_response(app_id, page):
time.sleep(intervals)
try:
url = 'https://amp-api.apps.apple.com/v1/catalog/cn/apps/' + app_id +'/reviews?l=zh-Hans-CN&offset=' + str(page * 10) + '&platform=web&additionalPlatforms=appletv%2Cipad%2Ciphone%2Cmac'
r = requests.get(url, headers=headers)
r.raise_for_status()
return r.json()
except requests.exceptions.HTTPError:
return 'HTTPError!'
# 解析响应
def parse_response(r):
global next_url
if "next" in r.keys():
next_url = r['next']
else:
next_url = None
for item in r['data']:
yield {
"id": item['id'],
"type": item['type'],
"title": item['attributes']['title'],
"userName": item['attributes']['userName'],
"isEdited": item['attributes']['isEdited'],
"review": item['attributes']['review'],
"rating": item['attributes']['rating'],
"date": item['attributes']['date']
}
# 写入 csv 文件
def write_to_file(app_id, item):
with open(f'{review_path}/{app_id}.csv', 'a', encoding='utf-8-sig', newline='') as csv_file:
fieldnames = ['id', 'type', 'title', 'userName', 'isEdited', 'review', 'rating', 'date']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writerow(item)
# 主函数
def main():
while len(pending_queue):
cur_id = pending_queue.pop()
print(f'开始爬取 {cur_id}')
for i in range(0, max_page):
r = get_response(cur_id, i)
print(f"第 {i+1} 页评论已获取")
for item in parse_response(r):
write_to_file(cur_id, item)
print(f'第 {i} 页评论已存储')
if not next_url:
break
print(f'结束爬取 {cur_id}')
if __name__ == '__main__':
main()
3 结果预览
4 结语
有问题或者建议可以留言,如果对你有帮助的话,也可以关注我的公众号,谢谢。