Python 爬取单个知乎问答

问题起源

在知乎上看到一个问题: 求推荐好看的二次元头像的

一个个去翻回答感觉有点麻烦。 想把这个问题下全部的图像抓取到文件夹中慢慢选。

分析

  • 1 找到接口
  • 2 json 中有个字段叫 content,是每个回答的 html 页面。 从其中解析出 img 标签中的头像链接
  • 3 知乎的反爬注意加上请求头

主要代码

import json
import requests
from lxml import html

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
}


def download_picture(url):
    pic_name = url.split("/")[-1]
    content = requests.get(url, headers=headers).content
    with open("/Users/furuiyang/Desktop/zhihu/{}".format(pic_name), "wb") as f:
        f.write(content)


def crawl(offset):
    url = 'https://www.zhihu.com/api/v4/questions/363414427/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics' \
          '&offset={}&limit=20&sort_by=updated'.format(offset)
    r = requests.get(url, verify=False, headers=headers)
    content = r.content.decode("utf-8")
    ret = json.loads(content)
    answers = ret.get("data")
    for answer in answers:
        content = answer.get("content")
        doc = html.fromstring(content)
        imgs = doc.xpath("//img/@src")
        imgs = [img for img in imgs if img.startswith("http")]
        if not imgs:
            continue
        for img in imgs:
            download_picture(img)


def main():
    # 根据页面问题的个数进行调整
    for i in range(0, 4):
        crawl(i*20)


if __name__ == "__main__":
    main()

实现效果

在这里插入图片描述

github

https://github.com/furuiyang0715/examples101/blob/master/zhihu_pics/pic_get.py

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值