爬虫实战某问答网站乎

云温叶叶叶

已于 2022-05-03 09:35:30 修改

阅读量306

点赞数 2

文章标签：爬虫 python 知乎网

于 2021-08-29 00:25:27 首次发布

本文链接：https://blog.csdn.net/weixin_50238287/article/details/119974388

版权

问答（知乎）网站回答数据抓取超详细

网站分析
- api接口查找
代码部分
写在最后

网站分析

我这里爬取的链接如下
https://www.zhihu.com/question/353341563/answer/903740226
我们会发现不点击查看所有回答就只能看到几条回答数据，当如果把url中的/answer/903740226删除后就可以显示所有数据了。在这里插入图片描述

api接口查找

鼠标右击点击查看网页源代码，发现源码中并没有我们想要的数据，什么数据是通过渲染，需要去找数据接口，也就是我们需要的具体数据是通过哪个url来的

F 在这里插入图片描述
F12进入开发者工具
我们会发下，我们想要的数据在名为answer？include…里面（下面3.中框错啦，不好意思。）

点开之后就可以看到我们想要的部分了内容格式是json

每一个回答的api接口url中只有5条回答数据，所以我们需要继续分析
在这里插入图片描述
根据上图分析，我的爬取思路是：
找到最开始的那一组回答的url（根据is_start:Ture判断），以此为最先爬取的url；
并提取他的 is_end:
如果是True说明已经爬到最后一组了，爬取结束，
如果是False说明还有回答，爬取继续。

代码部分

获取并解析数据


import requests
from icecream import ic   
#这个包可以让我们的数据打印的更美观 尤其是json数据
#下载方式 终端输入 pip install icecream
import time
import csv
"""
    有url的id（就是base_url中questions/后面的数字）就可以抓取,我抓取的时候发现不需要cookie,就没加，时间2021.8.28
"""
# 这个base_url 就是最开始的那组回答的 url
base_url = 'https://www.zhihu.com/api/v4/questions/25038841/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset=0&limit=5&sort_by=default&platform=desktop'

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
    }
#用于判断爬了几组
num = 0 
def parse_page(url):
    global num
    response = requests.get(url, headers=headers)
    # ic(response.json())
    paging = response.json()['paging']
    next = paging['next']
    is_end = paging['is_end']
    answers = response.json()['data']
    result = []
    for answer in answers:
        name = answer['author']['name']
        url_token = answer['author']['url_token']
        url = 'https://www.zhihu.com/people/' + url_token
        content = answer['content']
        comment_count = answer['comment_count']
        voteup_count = answer['voteup_count']
        updated_time = answer['updated_time']

        result.append({
            '名称': name,
            '个人中心': url,
            '内容': content,
            '评论数': comment_count,
            '点赞数': voteup_count,
            '编辑时间': time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(updated_time))
            # 原来的数据是时间戳的格式，这里我调用了一些time的模块进行更改
        })
    #调用自定义函数将数据保存到csv中
    write_csv(result)   
    
   #判断是否爬取玩所有回答的url
   #如果没有num就+1表示爬到第几组回答
    if is_end == False:
        num += 1
        ic(result)
        print(f'第{num}个api抓取完毕'.center(30,'='))
        #为了防止被封ip，爬玩一组回答我设置了停止1秒
        time.sleep(1)
        #继续调用这个函数，继续爬取
        parse_page(next)
    #否则就是爬完了
    else:
        print('回答抓取完毕！')

储存部分

file_name = '知乎回答.csv'
# encoding要用utf-8-sig 不然写入会乱码
def write_csv(result):
    with open(file_name,'a',encoding='utf-8-sig',newline="") as fp:
        csv_writer = csv.DictWriter(fp,fieldnames=['名称','个人中心','内容','评论数','点赞数','编辑时间'])
        # 防止表头重复写入
        with open(file_name, 'r', encoding='utf-8-sig', newline="") as file:
            reader = csv.reader(file)
            if not [row for row in reader]:
                csv_writer.writeheader()
                csv_writer.writerows(result)
            else:
                csv_writer.writerows(result)

主函数

def main():
    parse_page(base_url)

if __name__ == '__main__':
    main()

写在最后

这是我第一次写博文，又写的不好的地方或写错的地方，希望友友们可以指正。
我的初衷是拿写博文当写笔记了，也希望对各位友友有所帮助，大家一起共同进步！

云温叶叶叶

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
爬虫实战某问答网站乎

2021年最新乎回答数据抓取这里写自定义目录标题网站分析api接口查找网站分析我这里爬取的链接如下https://www.zhihu.com/question/353341563/answer/903740226我们会发现不点击查看所有回答就只能看到几条回答数据，当如果把url中的/answer/903740226删除后就可以显示所有数据了。api接口查找鼠标右击点击查看网页源代码，发现源码中并没有我们想要的数据，什么数据是通过渲染...
复制链接

扫一扫