爬取知乎，通过数据传输接口

最新推荐文章于 2024-05-25 09:37:50 发布

BRUIN.

最新推荐文章于 2024-05-25 09:37:50 发布

阅读量911

点赞数 2

分类专栏： Python爬虫文章标签： python 多线程 json

本文链接：https://blog.csdn.net/I_I___LO_VE___YA/article/details/104465730

版权

Python爬虫专栏收录该内容

38 篇文章 2 订阅

订阅专栏

前面做过使用selenium爬取过动态网站的项目，这次通过访问知乎数据传输接口获取数据，即避免了登录，又提升了爬取速度，一举两得。
主要功能由两个线程完成，一个访问知乎的数据传输API，因为数据比较多，所以选择将获取的数据保存为json文件，没有使用队列；另一个线程将保存的数据进行解析，获取一些评论的超链接，基本都是一些关于python的教程或书籍。
然后就是，因为两个线程对一个文件进行读写，所以要进行同步。意思就是：就是两个人拉肚子，要上厕所，但是只有一个坑位，一个人进去了就会锁上门，另外一个人就要等待，然后结束了，另外一个人再用这个坑。但是过一会儿刚刚回来的人又肚子疼了，只得又等着里面的人出来，如此循环。
而这个厕所的使用，也就是对这个文件的使用就靠两个线程之间的信息传递，就使用到了threading模块中的Condition了。
查看Condition源码，一般有这两个魔法方法的方法可以可以使用上下文管理器with，具体使用方法可查看源码。
在这里插入图片描述

在这里插入图片描述

这里就直接甩出代码了。。。。。。。

import requests
import json
import time
import re
import threading
from threading import Condition
from bs4 import BeautifulSoup
import csv


class ZhiHuSpider(object):
    def __init__(self):
        self.headers = {'User-Agent': 'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0;'}
        self.condition = Condition()

    def get_json(self):
        q = 'python'
        limit = 20
        offset = 20
        lc_idx = 26

        while True:
        	# 通过分析可以发现该API的一些特性
            url = 'https://www.zhihu.com/api/v4/search_v3?' \
                  't=general' \
                  '&q={}' \
                  '&correction=1' \
                  '&offset={}' \
                  '&limit={}' \
                  '&lc_idx={}' \
                  '&show_all_topics=0' \
                  '&search_hash_id=28c1b0ac521c72c41398295fdcd9b6d8' \
                  '&vertical_info=0%2C1%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C1'.format(q, offset, limit, lc_idx)

            response = requests.get(url, headers=self.headers)
            response.encoding = 'utf-8'
            result = json.loads(response.text)

            # 写入数据文件 线程同步、避免数据写入读取紊乱
            with self.condition:
                self.condition.notify()
                # 控制爬取速度
                print("正在下载中。。。")
                time.sleep(10)

                with open('7-zhihu.json', 'w') as f:
                    json.dump(result, f)
                self.condition.wait()

            offset += limit
            lc_idx += limit

    def parse(self):
        # 解析爬取到本地的数据
        num = 0
        while True:
            # 读取已保存在本地的json文件
            with self.condition:
                self.condition.wait()
                print("正在读取中。。。")
                with open('7-zhihu.json', 'r') as f:
                    j = json.load(f)
                self.condition.notify()

            # 提取数据
            result = []
            datas = j['data']
            for data in datas:
                # 获取标题
                title = str()
                pattern = re.compile('<em>|</em>')
                titles = pattern.sub('', data['highlight']['title'])
                for i in titles:
                    for j in i:
                        title += j
                num += 1

                # 获取超链接
                content = data['object']['content']
                html = BeautifulSoup(content, 'lxml')
                # print(html.prettify())

                hrefs = html.find_all('a')
                for href in hrefs:
                    if href:
                        result.append([title, href.string, href.get('href')])

            # 数据保存
            # self.save(result)
            print('已保存:%d条\n' % num)

    def save(self, data):
        with open(r'C:\Users\xiongrenyi\Desktop\python_links.csv', 'a', newline='') as file:
            writer = csv.writer(file)
            writer.writerows(data)

    def run(self):
    	# 创建两个线程，一个爬取页面，一个解析返回的json数据
        crawl_thread = threading.Thread(target=self.get_json)
        parse_thread = threading.Thread(target=self.parse)
        parse_thread.start()
        crawl_thread.start()


if __name__ == '__main__':
    zhihu_spider = ZhiHuSpider()
    zhihu_spider.run()

运行效果如下，其中每一条包含多个href：
在这里插入图片描述
爬取结果如下：

BRUIN.

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬取知乎，通过数据传输接口

前面做过使用selenium爬取过动态网站的项目，这次通过访问知乎数据传输接口获取数据，即避免了登录，又提升了爬取速度，一举两得。主要功能由两个线程完成，一个访问知乎的数据传输API，因为数据比较多，所以选择将获取的数据保存为json文件，没有使用队列；另一个线程将保存的数据进行解析，获取一些评论的超链接，基本都是一些关于python的教程或书籍。如果不知道怎么查找一个网站的数据传输接口，可以...
复制链接

扫一扫

专栏目录