Python 爬取单个知乎问答

最新推荐文章于 2024-06-28 13:06:16 发布

feiyy404

最新推荐文章于 2024-06-28 13:06:16 发布

阅读量352

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/enjolras_fuu/article/details/106331847

版权

爬虫专栏收录该内容

49 篇文章 4 订阅

订阅专栏

问题起源

在知乎上看到一个问题：求推荐好看的二次元头像的。

一个个去翻回答感觉有点麻烦。想把这个问题下全部的图像抓取到文件夹中慢慢选。

分析

1 找到接口
2 json 中有个字段叫 content,是每个回答的 html 页面。从其中解析出 img 标签中的头像链接
3 知乎的反爬注意加上请求头

主要代码

import json
import requests
from lxml import html

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
}


def download_picture(url):
    pic_name = url.split("/")[-1]
    content = requests.get(url, headers=headers).content
    with open("/Users/furuiyang/Desktop/zhihu/{}".format(pic_name), "wb") as f:
        f.write(content)


def crawl(offset):
    url = 'https://www.zhihu.com/api/v4/questions/363414427/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics' \
          '&offset={}&limit=20&sort_by=updated'.format(offset)
    r = requests.get(url, verify=False, headers=headers)
    content = r.content.decode("utf-8")
    ret = json.loads(content)
    answers = ret.get("data")
    for answer in answers:
        content = answer.get("content")
        doc = html.fromstring(content)
        imgs = doc.xpath("//img/@src")
        imgs = [img for img in imgs if img.startswith("http")]
        if not imgs:
            continue
        for img in imgs:
            download_picture(img)


def main():
    # 根据页面问题的个数进行调整
    for i in range(0, 4):
        crawl(i*20)


if __name__ == "__main__":
    main()