Python爬虫，知乎问答美女小姐姐图片与文字内容采集爬虫

最新推荐文章于 2021-08-23 11:56:05 发布

二爷记

最新推荐文章于 2021-08-23 11:56:05 发布

阅读量767

点赞数 1

分类专栏： python爬虫文章标签： python 机器学习人工智能小程序 html

本文链接：https://blog.csdn.net/minge89/article/details/110914268

版权

python爬虫专栏收录该内容

47 篇文章 4 订阅

订阅专栏

谢邀，人在美国，刚下飞机！

上面这段几乎是逛知乎社区的大（比）佬耳熟能详的段子，从几何起，知乎也已经蜕变成最大的段子，灌水，钓鱼贴的集合区，质量度远远没有以前那么高了，当然其中还有河蟹神兽出没，莫（牛）名（逼）管理删帖封号，知乎已经不是以前的知乎了！

逼乎，分享你刚编的故事，当然其中还有各种LSP最爱的钓鱼帖，跪在真实，手动狗头保命！！

爬取目标链接：https://www.zhihu.com/question/328457531

这里本渣渣就以其中的一个钓鱼帖，带来知乎问答Python爬虫，知乎问答爬虫爬取文字与图片demo（不使用Cookie），不用登陆获取知乎问答的数据，你只需要获取到问答链接或者id号。

LSP的最爱！！！

获取知乎问答有以下三种方式：

第一种，带答案的链接 https://www.zhihu.com/question/328457531/answer/855549300
第二种，不带答案的链接 https://www.zhihu.com/question/328457531
第三种，直接以id获取 328457531

参考源码：

#获取知乎问答id
#20201208 @author：WX：huguo00289
#@微信公众号：二爷记

# -*- coding: UTF-8 -*-
import re

def get_id(url):
    if "question" and "answer" in url:
        print("您输入的是问答全网址，正在获取id..")
        id=re.search(r'question/(.+?)/answer',url).group(1)
    elif "question" in url:
        print("您输入的是问题网址，正在获取id..")
        id = url.split('/')[-1]
    else:
        print("您输入的是问答id，已获取id..")
        id =url
    print(f'>> 您输入的知乎问答id为：{id}')
    return id

由于知乎的数据链接几乎都是json格式，接口的存在使得你直接请求接口再解析数据即可，唯一需要注意的是分页形式及相关参数！

这里需要注意的参数有三个：

问答ID 知乎问答的链接ID
limit 知乎问答的数据个数，一般限定为5，初始回答页面本渣渣这里定义为0页，限定为3
offset 分页页码

获取单页数据参考源码：

    #获取单页数据
    def get_content(self,page):
        url=f"https://www.zhihu.com/api/v4/questions/{self.id}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bsettings.table_of_content.enabled%3B&limit=5&offset={page}&platform=desktop&sort_by=default"
        response=requests.get(url,headers=self.headers,timeout=5)
        time.sleep(2)
        print(response.status_code)
        html=response.content.decode('utf-8')
        req=json.loads(html)
        json_datas=req['data']
        self.get_data(page,json_datas)

获取0页数据答案数参考源码：

    #获取0页数据及答案数
    def get_pagenum(self):
        page=0
        url = f"https://www.zhihu.com/api/v4/questions/{self.id}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bsettings.table_of_content.enabled%3B&limit=3&offset=&platform=desktop&sort_by=default"
        response = requests.get(url, headers=self.headers, timeout=5)
        time.sleep(2)
        print(f">> 正在获取第{page}页数据..")
        print(response.status_code)
        html = response.content.decode('utf-8')
        req = json.loads(html)
        totals=req['paging']['totals']
        print(f'共有回答数：{totals}')
        self.get_page(totals)
        json_datas = req['data']
        self.get_data(page,json_datas)

比较有意思的就是知乎问答回答数据的分页组合形式，这里给出参考，可能并不一定准确哈！

#获取页码
    def get_page(self,totals):
        pagenum=(int(totals)-4)/5
        #print(pagenum)
        if pagenum>int(pagenum):
            pagenum=int(pagenum)+1
        if pagenum==int(pagenum):
            pagenum = int(pagenum)

        self.pagenum=pagenum
        print(f'>> 共有{self.pagenum}回答分页')

运行效果：