html文本 知乎,Python爬取知乎问题“即将步入研究生,有什么忠告?”所有回答后将数据写入Excel并生成.html文件...

学Python爬虫一周多,今天练练手,爬取了一个自己感兴趣的知乎话题即将步入研究生,有什么忠告?。一共是272个答案,本次爬取的目的是爬取到所有回答者的昵称、个性签名、赞同数以及具体的内容。

先检查一波:

c12de8d8fe12

截屏自知乎

我天真地以为全部都在“Elements”这个页面里面,DuangDuangDuang一阵代码。呵呵,太天真了。

既然在“Elements”没有,那就一定是带参数请求数据的。

那看看“Network”下的“XHR”,每次下拉出现新的内容,总是会有个“answers?...”的东西,直觉告诉我应该就是这个鬼东西了。看看吧:

c12de8d8fe12

截屏自知乎

就是它了,再看看:

c12de8d8fe12

截屏自知乎

c12de8d8fe12

截屏自知乎

完犊子了,完全没规律啊!!!!!!

这种大网站是不可能没有规律的,再看看、、、、、、

c12de8d8fe12

截屏自知乎

c12de8d8fe12

截屏自知乎

果然,第3个“answers?...”开始就有规律了,“limit”始终是5,“offset”依次叠加5。

针对前两个,两次代码,之后剩下的,一个大的for循环搞定(前两个是可以不用for循环的,但是为了后面的for循环,就索性都用了)。

代码如下:

# 载入相应的模块

import time

import requests

import openpyxl

time1 = time.time()

lists = []

lists.append(['answer_kname','headline','voteup_count','content'])

##################

for i in range(0,1,1):

url = 'https://www.zhihu.com/api/v4/questions/64270965/answers'

headers = {

'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

}

params = {

'include': 'data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled,is_recognized,paid_info,paid_info_content;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics',

'offset': str(i),

'limit': '3',

'sort_by': 'default',

'platform': 'desktop'

}

res = requests.get(url, headers=headers, params=params)

res_json = res.json()

items = res_json['data']

for item in items:

answer_kname = item['author']['name']

headline = item['author']['headline']

content = item['content']

voteup_count = item['voteup_count']

lists.append([answer_kname,headline,voteup_count,content])

##################

for i in range(3,3,1):

url = 'https://www.zhihu.com/api/v4/questions/64270965/answers'

headers = {

'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

}

params = {

'include': 'data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled,is_recognized,paid_info,paid_info_content;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics',

'offset': str(i),

'limit': '5',

'sort_by': 'default',

'platform': 'desktop'

}

res = requests.get(url, headers=headers, params=params)

res_json = res.json()

items = res_json['data']

for item in items:

answer_kname = item['author']['name']

headline = item['author']['headline']

content = item['content']

voteup_count = item['voteup_count']

lists.append([answer_kname,headline,voteup_count,content])

##################

for i in range(8,278,5):

url = 'https://www.zhihu.com/api/v4/questions/64270965/answers'

headers = {

'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

}

params = {

'include': 'data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled,is_recognized,paid_info,paid_info_content;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics',

'offset': str(i),

'limit': '5',

'sort_by': 'default',

'platform': 'desktop'

}

res = requests.get(url, headers=headers, params=params)

res_json = res.json()

items = res_json['data']

for item in items:

answer_kname = item['author']['name']

headline = item['author']['headline']

content = item['content']

voteup_count = item['voteup_count']

lists.append([answer_kname,headline,voteup_count,content])

##################

file = openpyxl.Workbook()

sheet = file.active

sheet.title = 'answers'

for i in lists:

sheet.append(i)

file.save('即将步入研究生,有什么忠告.xlsx')

##################

file_html = open('知乎:即将步入研究生,有什么忠告.html','w',encoding= 'utf-8')

for i in lists:

file_html.write(i[3])

file_html.close()

##################

time2 = time.time()

print('爬虫耗时:%.3f'%(float(time2-time1)),'秒')

附件:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值