谢邀,人在美国,刚下飞机!
上面这段几乎是逛知乎社区的大(比)佬耳熟能详的段子,从几何起,知乎也已经蜕变成最大的段子,灌水,钓鱼贴的集合区,质量度远远没有以前那么高了,当然其中还有河蟹神兽出没,莫(牛)名(逼)管理删帖封号,知乎已经不是以前的知乎了!
逼乎,分享你刚编的故事,当然其中还有各种LSP最爱的钓鱼帖,跪在真实,手动狗头保命!!
爬取目标链接:https://www.zhihu.com/question/328457531
这里本渣渣就以其中的一个钓鱼帖,带来知乎问答Python爬虫,知乎问答爬虫爬取文字与图片demo(不使用Cookie),不用登陆获取知乎问答的数据,你只需要获取到问答链接或者id号。
LSP的最爱!!!
获取知乎问答有以下三种方式:
第一种,带答案的链接 https://www.zhihu.com/question/328457531/answer/855549300
第二种,不带答案的链接 https://www.zhihu.com/question/328457531
第三种,直接以id获取 328457531
参考源码:
#获取知乎问答id
#20201208 @author:WX:huguo00289
#@微信公众号:二爷记
# -*- coding: UTF-8 -*-
import re
def get_id(url):
if "question" and "answer" in url:
print("您输入的是问答全网址,正在获取id..")
id=re.search(r'question/(.+?)/answer',url).group(1)
elif "question" in url:
print("您输入的是问题网址,正在获取id..")
id = url.split('/')[-1]
else:
print("您输入的是问答id,已获取id..")
id =url
print(f'>> 您输入的知乎问答id为:{id}')
return id
由于知乎的数据链接几乎都是json格式,接口的存在使得你直接请求接口再解析数据即可,唯一需要注意的是分页形式及相关参数!
这里需要注意的参数有三个:
问答ID 知乎问答的链接ID
limit 知乎问答的数据个数,一般限定为5,初始回答页面本渣渣这里定义为0页,限定为3
offset 分页页码
获取单页数据参考源码:
#获取单页数据
def get_content(self,page):
url=f"https://www.zhihu.com/api/v4/questions/{self.id}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bsettings.table_of_content.enabled%3B&limit=5&offset={page}&platform=desktop&sort_by=default"
response=requests.get(url,headers=self.headers,timeout=5)
time.sleep(2)
print(response.status_code)
html=response.content.decode('utf-8')
req=json.loads(html)
json_datas=req['data']
self.get_data(page,json_datas)
获取0页数据答案数参考源码:
#获取0页数据及答案数
def get_pagenum(self):
page=0
url = f"https://www.zhihu.com/api/v4/questions/{self.id}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bsettings.table_of_content.enabled%3B&limit=3&offset=&platform=desktop&sort_by=default"
response = requests.get(url, headers=self.headers, timeout=5)
time.sleep(2)
print(f">> 正在获取第{page}页数据..")
print(response.status_code)
html = response.content.decode('utf-8')
req = json.loads(html)
totals=req['paging']['totals']
print(f'共有回答数:{totals}')
self.get_page(totals)
json_datas = req['data']
self.get_data(page,json_datas)
比较有意思的就是知乎问答回答数据的分页组合形式,这里给出参考,可能并不一定准确哈!
#获取页码
def get_page(self,totals):
pagenum=(int(totals)-4)/5
#print(pagenum)
if pagenum>int(pagenum):
pagenum=int(pagenum)+1
if pagenum==int(pagenum):
pagenum = int(pagenum)
self.pagenum=pagenum
print(f'>> 共有{self.pagenum}回答分页')
运行效果:
爬取效果:
完整python源码获取
请关注本渣渣微信公众号:二爷记
后台回复“知乎问答”
给各位老哥打包了exe工具,仅供学习交流使用!
链接: https://pan.baidu.com/s/1rZh6aQ8fPY5TJKqlkS4eEQ
提取码: d5n2
注意:
软件编写环境win7 64位
推荐win7系统运行,其他系统可能存在不兼容情况
exe获取授权码
请关注微信公众号:二爷记
后台回复“知乎问答授权”