Python 爬取单个知乎问答

问题起源

在知乎上看到一个问题: 求推荐好看的二次元头像的

一个个去翻回答感觉有点麻烦。 想把这个问题下全部的图像抓取到文件夹中慢慢选。

分析

  • 1 找到接口
  • 2 json 中有个字段叫 content,是每个回答的 html 页面。 从其中解析出 img 标签中的头像链接
  • 3 知乎的反爬注意加上请求头

主要代码

import json
import requests
from lxml import html

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
}


def download_picture(url):
    pic_name = url.split("/")[-1]
    content = requests.get(url, headers=headers).content
    with open("/Users/furuiyang/Desktop/zhihu/{}".format(pic_name), "wb") as f:
        f.write(content)


def crawl(offset):
    url = 'https://www.zhihu.com/api/v4/questions/363414427/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics' \
          '&offset={}&limit=20&sort_by=updated'.format(offset)
    r = requests.get(url, verify=False, headers=headers)
    content = r.content.decode("utf-8")
    ret = json.loads(content)
    answers = ret.get("data")
    for answer in answers:
        content = answer.get("content")
        doc = html.fromstring(content)
        imgs = doc.xpath("//img/@src")
        imgs = [img for img in imgs if img.startswith("http")]
        if not imgs:
            continue
        for img in imgs:
            download_picture(img)


def main():
    # 根据页面问题的个数进行调整
    for i in range(0, 4):
        crawl(i*20)


if __name__ == "__main__":
    main()

实现效果

在这里插入图片描述

github

https://github.com/furuiyang0715/examples101/blob/master/zhihu_pics/pic_get.py

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
爬取知乎问答,可以通过以下步骤: 1. 安装 `requests` 和 `beautifulsoup4` 库:打开命令行,输入以下命令安装: ``` pip install requests beautifulsoup4 ``` 2. 打开知乎网站,找到要爬取问答页面,例如:https://www.zhihu.com/question/123456789。 3. 使用 `requests` 库获取该页面的 HTML 内容: ```python import requests url = 'https://www.zhihu.com/question/123456789' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) html = response.text ``` 4. 使用 `beautifulsoup4` 库解析 HTML 内容,获取问答的标题和内容: ```python from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') title = soup.find('h1', class_='QuestionHeader-title').text.strip() content = soup.find('div', class_='QuestionRichText').text.strip() ``` 5. 获取所有回答的内容: ```python answers = [] for answer in soup.find_all('div', class_='List-item'): answer_content = answer.find('div', class_='RichContent-inner').text.strip() answers.append(answer_content) ``` 完整代码示例: ```python import requests from bs4 import BeautifulSoup url = 'https://www.zhihu.com/question/123456789' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) html = response.text soup = BeautifulSoup(html, 'html.parser') title = soup.find('h1', class_='QuestionHeader-title').text.strip() content = soup.find('div', class_='QuestionRichText').text.strip() answers = [] for answer in soup.find_all('div', class_='List-item'): answer_content = answer.find('div', class_='RichContent-inner').text.strip() answers.append(answer_content) print(title) print(content) print(answers) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值