一:分析抓取数据存储的位置:
cmts: [
{
评论时间:time
评论昵称:nickName
性别:gender
所在城市:cityName
内容:content
猫眼等级:userLevel
评分:score
}
{}
{}
..................
]
二:代码实现:
import csv
import os
import requests
from fake_useragent import UserAgent
import pandas as pd
##获取单页数据
def get_comment(id, offset=0):
url = 'http://m.maoyan.com/mmdb/comments/movie/%s.json?_v_=yes' % id
params = {
'offset': offset
}
headers = {
'User-Agent': str(UserAgent(verify_ssl=False).random)
}
req = requests.get(url=url, params=params, headers=headers).json()
# comment = req.text
# comment = json.loads(comment)
comment = req.get('cmts')
print(comment)
'''
评论时间:time
评论昵称:nickName
性别:gender
所在城市:cityName
内容:content
猫眼等级:userLevel
评分:score
'''
list_info = []
for comment_singe in comment:
time = comment_singe.get('time')
nickName = comment_singe.get('nickName')
gender = comment_singe.get('gender')
cityName = comment_singe.get('cityName')
content = comment_singe.get('content')
userLevel = comment_singe.get('userLevel')
score = comment_singe.get('score')
list_one = [time, nickName, gender, cityName, content, userLevel, score]
list_info.append(list_one)
return list_info
##写入文件
def writer_file(list_info):
file_size = os.path.getsize('/home/kiosk/PycharmProjects/Scrapy/爬取猫眼评论/data.csv')
if file_size == 0:
##表头
name = ['评论时间', '评论昵称', '性别', '所在城市', '内容', '猫眼等级', '评分']
##建立DataFrame对象
file_test = pd.DataFrame(columns=name, data=list_info)
##数据写入,不要索引
file_test.to_csv('/home/kiosk/PycharmProjects/Scrapy/爬取猫眼评论/data.csv', encoding='utf-8', index=False)
else:
with open('/home/kiosk/PycharmProjects/Scrapy/爬取猫眼评论/data.csv', 'a+', newline='') as file_test:
##追加到文件后面
writer = csv.writer(file_test)
##写文件
writer.writerows(list_info)
if __name__ == '__main__':
##可以换其他电影的id
id = '1189879'
try:
for offset in range(1,2000):
##获取每一页内容
get_comment(id, offset=offset)
list_info = get_comment(id)
writer_file(list_info)
except TypeError:
print('-----------------爬取完成----------------------')
出现的问题:存入的文件的时间xxx
解决:将time数据处理下改为:2019-07-14格式
time = (comment_singe.get('time')).split()[0]
PS:爬了3次没封IP,猫眼好像故意让你爬的,增加流量。