爬取知乎话题广场

最新推荐文章于 2023-09-09 00:05:24 发布

轻风远扬

最新推荐文章于 2023-09-09 00:05:24 发布

阅读量1.8k

点赞数

分类专栏： python 文章标签：爬虫知乎话题广场标题

本文链接：https://blog.csdn.net/weixin_44535544/article/details/101438199

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

不登录爬取知乎的话题广场

分享一套不需要登录就可以爬取知乎上话题广场的全部话题以及话题对应的链接
话不多说，先上代码，然后讲解分析过程

from faker import Faker
import requests
import json
import time
from lxml import etree
import threading
from threading import Thread
from pymongo import MongoClient


def insert_mongo(data):
    con = MongoClient('localhost')
    db = con.Spider.Zhihu
    if db.find_one(data):
        print('已存在')
    else:
        db.insert_one(data)
        print('插入成功')


def seplist(start_urls, cut_number):
    cut_list = []
    for i in range(cut_number):
        cut_list.append([])
    for i in range(len(start_urls)):
        cut_list[i % cut_number].append(start_urls[i])
    return cut_list


def all_topic_id(url):
    headers = {
        'user-agent': Faker().user_agent()
    }
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'
    html = etree.HTML(response.text)
    ids = []
    for each in html.xpath('//li[@class="zm-topic-cat-item"]'):
        topic_id = each.xpath('@data-id')[0]
        ids.append(topic_id)
    return ids


def all_topic_urls(lists):
    for item in lists:
        middle_url = 'https://www.zhihu.com/node/TopicsPlazzaListV2'
        offset = 0
        while True:
            headers = {
                'user-agent': Faker().user_agent()
            }
            data = {'method': 'next',
                    "params": json.dumps({"topic_id": int(item), "offset": offset, "hash_id": ""})}
            res = requests.post(middle_url, data=data, headers=headers)
            lists = json.loads(res.text)['msg']
            if not lists:
                break
            for one_topic in lists:
                htm = etree.HTML(one_topic)
                one_topic_url = htm.xpath('//a[1]/@href')[0]
                topic_name = htm.xpath('//strong/text()')[0]
                link = 'https://www.zhihu.com' + one_topic_url
                topic_dict = {'topic_name': topic_name,
                              'topic_link': link}
                with rlock:
                    insert_mongo(data=topic_dict)
            offset += 20


if __name__ == '__main__':
    s = time.time()
    rlock = threading.RLock()
    start_url = 'https://www.zhihu.com/topics'
    id_lists = all_topic_id(start_url)
    number = int(len(id_lists)/2)
    cut_lists = seplist(id_lists, number)
    threadlist = []
    for i in range(number):
        t = Thread(target=all_topic_urls, args=(cut_lists[i],))
        t.start()
        threadlist.append(t)
    for thd in threadlist:
        thd.join()
    print('抓取全部话题用时：', time.time() - s)

这个代码的最终结果是保存在了mongodb中，如果想要变更保存方式，只需要修改 insert_mongo(data=topic_dict)这一行代码就可以了

首先，打开知乎的话题广场页面，这个话题广场不需要登录也可以浏览全部信息。

在这里插入图片描述

第一步、获取话题大类的链接

点击分类，页面发生跳转，但是网页的链接并没有发生变化，所以需要通过监测网页运行过程，分析数据是哪里来的。
在这里插入图片描述
这里是页面跳转加载出来的数据。

在这里插入图片描述
post请求这个链接，发现一个数字，然后检索网页源代码，确定这个数字代表的就是每个大类的id。

根据这个发现，就可以使用正常的request请求开始页面，然后提取每个话题大类对应的数字id。
得出的结果是：
ids = [‘1761’, ‘3324’, ‘833’, ‘99’, ‘69’, ‘113’, ‘304’, ‘13908’, ‘570’, ‘2955’, ‘988’, ‘388’, ‘285’, ‘686’, ‘444’, ‘1537’, ‘19800’, ‘253’, ‘4196’, ‘8437’, ‘2253’, ‘4217’, ‘2143’, ‘1538’, ‘1740’, ‘237’, ‘112’, ‘445’, ‘1027’, ‘215’, ‘68’, ‘75’, ‘395’]

第二步、根据大类，抓取每个大类下的所有话题

在一个话题大类中，发现内容需要不停的下拉或者点击更多才能显示
在这里插入图片描述
通过分析动态加载规则，发现每次加载新的内容，只有offset发生变化。经历丰富的人基本都知道，这个参数就类似于翻页参数page。
所以通过不断的循环 offset += 20 直到没有数据更新就可以请求到所有的子话题。

通过分析返回的页面，发现数据基本是规则的html写法，我这里使用了xpath提取方法，其他提取方法也可以，直接获取到每个话题对应的链接和名称。

第三步、保存数据

到这里，数据提取基本就完成了，接下来就是保存，因人而异，保存就不多说了。

知乎的反扒不算特别严格，这里的数据量大概是1万多点，再未使用代理ip的情况下并没有触发验证码识别。
下次抓取每个话题的全部内容及回答时，数据量会非常大，一定会触发反扒，到时候再讲怎么处理。

有写的不好的地方，请多指教。
以上就是获取知乎话题广场所有话题的方式，有疑问留言。

轻风远扬

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
爬取知乎话题广场

不登录爬取知乎中一个话题的所有问题和所有回答分享一套不需要登录就可以爬取知乎上的问题以及回答的爬虫方案。需要的模块faker 生成虚拟的user-agentrequests 请求数据json 格式化json数据并读取re 正则表达式提取数据pip install faker首先分析网页进入知乎话题广场，跳过知乎登录页面https://www.zhihu.com/topics...
复制链接

扫一扫