Python 网络爬虫实战:爬取知乎一个话题下的全部问题

此前分享过一篇知乎的爬虫《Python网络爬虫实战:爬取知乎话题下 18934 条回答数据》,这篇爬虫主要是用来爬取知乎中一个问题下的全部回答数据。

而在后续跟读者朋友们的交流中发现,大家关心的其实不仅局限在爬取一个问题的回答,而是爬取某一个话题下,全部问题的全部回答。

所以这篇爬虫我会教大家,如何使用爬虫,爬取知乎中某一个话题下的全部问题。

 


 0x00 制定爬取策略

我们任取一个话题(如:https://www.zhihu.com/topic/20192351/),进入话题的主页,如截图所示。

事实上,知乎话题下的所有 “问题” 没有想象中那么容易爬取。

为什么这么说呢?其实不是说知乎的反爬机制有多难搞,最大的困难在于它的数据,并不完整

我们发现, 网站上并没有任何关于话题下所有问题的列表数据,所有的数据都藏在了 “讨论”,“精华”,“等待回答” 三个页签。

在等待回答页签下,是一些新发布的或者回答数较少的问题列表;在讨论和精华页签下,是一些精彩回答或者专栏文章的列表。

那么,那些 热度高、关注多的问题列表 呢?抱歉并没有。

所以我的爬取策略是:

1. 等待回答页签的问题列表,直接爬取问题ID,进行存储。

2. 讨论和精华页签的回答数据,提取每条回答对应的问题 ID,然后去重存储。

 

0x01 抓取数据接口

基本的策略制定好了,那么接下来的事儿,就是分别抓取 “讨论”,“精华”,“等待回答”,三个页签下的数据接口。

在浏览器中按 F12 打开开发者工具,切换到 Network,过滤器选择 XHR。

向下滑动网页滚动条,让网页加载新的数据。于是在开发者工具中,我们就抓取到了数据的接口。

知乎并没有设置特别严格的反爬措施,所以不需要在反反爬方面做特殊处理,设置好参数,直接访问接口就可以了。

下面简单说一下数据接口,看着很长的一大串,其实只需要关注三个地方就可以了,其他不用管。

1. 第一处(图中的 20192351 )是话题的 ID,如果后续大家想爬其他话题的话,只需要更改这个参数即可。

2. 第二处(图中的 top_activity )表示是 “讨论”页签的数据接口,同理,“精华”页签是 essence ,“等待回答”是 top_question。

2. 第三处(图中的 20.00000 )是表示该请求的数据是第20条到第30条(从第20条开始的,每页10条数据),可以用它来控制页码,第一页就设置它为 0 (第0 - 9条),第二页就是 10(第10 - 19条)......以此类推

具体分析网站和抓包过程,这里不再赘述了,感兴趣的话,可以参考我之前的爬虫文章。

1. Python网络爬虫实战:爬取知乎话题下 18934 条回答数据

2. Python爬虫基础:使用 Python 爬虫时经常遇到的问题合集

3. Python 爬虫实战系列专栏

 

0x02 爬虫代码准备

根据上述的分析,我们可以简单编写一下代码,进行爬取。

import requests
import json
 
def fetchHotel(url):
    # 发起网络请求,获取数据
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    }
 
    # 发起网络请求
    r = requests.get(url,headers=headers)
    r.encoding = 'Unicode'
    return r.text

def parseJson(text):
    json_data = json.loads(text)
    lst = json_data['data']
    nextUrl = json_data['paging']['next']
    
    if not lst:
        return;

    for item in lst:
        type = item['target']['type']
        
        if type == 'answer':
            # 回答
            question = item['target']['question']
            id = question['id']
            title = question['title']
            url = 'https://www.zhihu.com/question/' + str(id)
            print("问题:",id,title)

        elif type == 'article':
            #专栏
            zhuanlan = item['target']
            id = zhuanlan['id']
            title = zhuanlan['title']
            url = zhuanlan['url']
            vote = zhuanlan['voteup_count']
            cmts = zhuanlan['comment_count']
            auth = zhuanlan['author']['name']
            print("专栏:",id,title)

        elif type == 'question':
            # 问题
            question = item['target']
            id = question['id']
            title = question['title']
            url = 'https://www.zhihu.com/question/' + str(id)
            print("问题:",id,title)

    return nextUrl

if __name__ == '__main__':
    topicID = '20192351'
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/top_activity?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&after_id=0'
    while url:
        text = fetchHotel(url)
        url = parseJson(text)

下面是运行结果:

注:

  1. 爬虫核心部分都在上面了,后面的部分是在此基础上做一些优化,如用数据库,多线程等,不关心的话跳过也没关系。
  2. 上述代码为 “讨论” 页签的爬虫代码,其他两个页签的数据,只需要仿照这个写就可以了)
  3. 由于页签下混杂着回答,专栏文章,甚至还有广告,所以在解析函数中针对每种类型的数据做了相应的处理。
  4.  由于后续我使用数据库进行数据存储,所以如果需要存储在本地 csv 中的话,请参考我此前的爬虫文章,并注意,自行做好去重工作。

0x03 数据库准备

由于数据量较大,并且需要数据去重,所以为了省事儿,我决定使用 mysql 数据库来进行数据存储。

首先要安装 mysql 数据库,和 pymysql 库。

pip install pymysql

1. 创建数据库

在 MySQL 数据库中创建数据库 zhuhuTopic 和数据表 questions,下面是我创建的表的结构。

2. 连接数据库

import pymysql

db = pymysql.connect(host='xxx.xxx.xxx.xxx', port=3306, user='zhihuTopic', passwd='xxxxxx', db='zhihuTopic', charset='utf8')
cursor = db.cursor()

将自己的数据库的 IP、端口,用户名、密码等参数填进去,运行即可完成数据库的连接。

3. 增删改查

通过 pymysql 库操作数据库,实际上也是通过执行 SQL 语句。也就是说,自行构造 SQL 语句的字符串 sql,然后通过 cursor.excute(sql) 执行。

def queryDB():
    # 查询数据库
    try:
        sql = "select * from questions"
        cursor.execute(sql)
        data = cursor.fetchone()
        print(data)
    except Exception as e:
        print(e)

查询数据时,执行 cursor.fetchone() 每次取一条数据,执行 cursor.fetchall() 一次性取全部数据,执行 cursor.fetchmany(n) 每次取 n 条数据。

def insertDB(id, title, url):
    #插入数据
    try:
        sql = "insert into questions values (%s,'%s','%s')"%(id, title, url)
        cursor.execute(sql)
        db.commit()
    except Exception as e:
        db.rollback()
        print(e)

 更新数据,删除数据等操作的写法跟插入数据的写法一样,只要修改对应 sql 语句即可。

如果数据库操作报错了。调试的话,首先把 sql 字符串打印出来,复制到 MySQL 数据库中执行看是否有误。

大部分的情况下,都是 SQL 语句有误引起的。 

所以,我们可以将爬虫代码优化成如下:

import requests
import json
import pymysql

db = pymysql.connect(host='xxx.xx.xx.xxx', port=3306, user='zhihuTopic', passwd='xxxxxx', db='zhihuTopic', charset='utf8')
cursor = db.cursor()

def saveQuestionDB(id, title, url):
    #插入数据
    try:
        sql = "insert into questions values (%s,'%s','%s')"%(id, title, url)
        cursor.execute(sql)
        db.commit()
    except Exception as e:
        db.rollback()
        print(e)

def saveArticleDB(id, title, vote, cmts, auth, url):
    #插入数据
    try:
        sql = "insert into article values (%s,'%s',%s,%s,'%s', '%s')"%(id, title, vote, cmts, auth, url)
        cursor.execute(sql)
        db.commit()
    except Exception as e:
        db.rollback()
        print(e)

def fetchHotel(url):
    # 发起网络请求,获取数据
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    }
 
    # 发起网络请求
    r = requests.get(url,headers=headers)
    r.encoding = 'Unicode'
    return r.text

def parseJson(text):
    json_data = json.loads(text)
    lst = json_data['data']
    nextUrl = json_data['paging']['next']
    
    if not lst:
        return;

    for item in lst:
        type = item['target']['type']
        
        if type == 'answer':
            # 回答
            question = item['target']['question']
            id = question['id']
            title = question['title']
            url = 'https://www.zhihu.com/question/' + str(id)
            print("问题:",id,title)
            # 保存到数据库
            saveQuestionDB(id,title,url)

        elif type == 'article':
            #专栏
            zhuanlan = item['target']
            id = zhuanlan['id']
            title = zhuanlan['title']
            url = zhuanlan['url']
            vote = zhuanlan['voteup_count']
            cmts = zhuanlan['comment_count']
            auth = zhuanlan['author']['name']
            print("专栏:",id,title)
            # 保存到数据库
            saveArticleDB(id, title, vote, cmts, auth, url)

        elif type == 'question':
            # 问题
            question = item['target']
            id = question['id']
            title = question['title']
            url = 'https://www.zhihu.com/question/' + str(id)
            print("问题:",id,title)
            # 保存到数据库
            saveQuestionDB(id,title,url)

    return nextUrl

if __name__ == '__main__':
    topicID = '20192351'
    # 讨论
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/top_activity?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&after_id=0'
    while url:
       text = fetchHotel(url)
       url = parseJson(text)

    # 精华
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/essence?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&offset=0'
    while url:
       text = fetchHotel(url)
       url = parseJson(text)

    # 等待回答
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/top_question?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&offset=0'
    while url:
       text = fetchHotel(url)
       url = parseJson(text)

,“讨论”,“精华” 页签中除了问题的回答之外,还有一些专栏文章,顺手也爬了存数据库吧。下面是文章的数据表结构。

 运行程序之后,数据库中确实爬取到了数据。

 

0x04 多线程准备

 前面的代码是单线程爬取。如果不考虑效率的话,大家完全可以等待一个页签中的数据爬取完成之后,继续爬取下一个页签。

但是嘛,那样毕竟比较浪费时间。这里我打算用多线程来爬取,每一个线程爬取一个页签的数据,提高爬虫效率。 

import threading

def crawl_1(topicID):
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/top_activity?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&after_id=0'
    while url:
        text = fetchHotel(url)
        url = parseJson(text)
        print("crawl_讨论")

def crawl_2(topicID):
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/essence?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&offset=0'
    while url:
        text = fetchHotel(url)
        url = parseJson(text)
        print("crawl_精华")

def crawl_3(topicID):
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/top_question?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&offset=0'
    while url:
        text = fetchHotel(url)
        url = parseJson(text)
        print("crawl_等待回答")

我们将爬取讨论,精华,等待回答三个页签的代码,分别放到 crawl_1,crawl_2,crawl_3 这 3 个函数中。

然后在主函数中启动 3 条线程,每条线程执行一个爬虫函数。

def main():

    topicID = '20192351'
    try:
        t1 = threading.Thread(target=crawl_1, args=(topicID,))
        t2 = threading.Thread(target=crawl_2, args=(topicID,))
        t3 = threading.Thread(target=crawl_3, args=(topicID,))
        t1.start()
        t2.start()
        t3.start()
    except:
       print ("Error: 无法启动线程")

但是使用多线程的话,要注意一件事情,就是有些操作是可以多条线程同时做的,比如说发起网络请求,解析数据等,但是有些操作同一时间只能允许一条线程使用,比如说文件读写,数据库操作等。

所以,在数据库操作环节需要上锁,每次只放一个线程进去,避免多个线程同时操作数据库,造成错误。

lock = threading.Lock()

def saveQuestionDB(id, title, url):
    #插入数据
    lock.acquire()
    try:
        sql = "insert into questions values (%s,'%s','%s')"%(id, title, url)
        cursor.execute(sql)
        db.commit()
    except Exception as e:
        db.rollback()
        print(e)
    lock.release()

在数据库操作之前 lock.acquire() 上锁,操作完成之后,lock.release() 解锁。

 

最后,此爬虫的完整代码如下:

import requests
import json
import pymysql
import threading

lock = threading.Lock()
db = pymysql.connect(host='xxx.xx.xx.xxx', port=3306, user='zhihuTopic', passwd='xxxxxx', db='zhihuTopic', charset='utf8')
cursor = db.cursor()

def saveQuestionDB(id, title, url):
    #插入数据
    lock.acquire()
    try:
        sql = "insert into questions values (%s,'%s','%s')"%(id, title, url)
        cursor.execute(sql)
        db.commit()
    except Exception as e:
        db.rollback()
        print(e)
    lock.release()

def saveArticleDB(id, title, vote, cmts, auth, url):
    #插入数据
    lock.acquire()
    try:
        sql = "insert into article values (%s,'%s',%s,%s,'%s', '%s')"%(id, title, vote, cmts, auth, url)
        cursor.execute(sql)
        db.commit()
    except Exception as e:
        db.rollback()
        print(e)
    lock.release()

def fetchHotel(url):
    # 发起网络请求,获取数据
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    }
 
    # 发起网络请求
    r = requests.get(url,headers=headers)
    r.encoding = 'Unicode'
    return r.text

def parseJson(text):
    json_data = json.loads(text)
    lst = json_data['data']
    nextUrl = json_data['paging']['next']
    
    if not lst:
        return;

    for item in lst:
        type = item['target']['type']
        
        if type == 'answer':
            # 回答
            question = item['target']['question']
            id = question['id']
            title = question['title']
            url = 'https://www.zhihu.com/question/' + str(id)
            print("问题:",id,title)
            # 保存到数据库
            saveQuestionDB(id,title,url)

        elif type == 'article':
            #专栏
            zhuanlan = item['target']
            id = zhuanlan['id']
            title = zhuanlan['title']
            url = zhuanlan['url']
            vote = zhuanlan['voteup_count']
            cmts = zhuanlan['comment_count']
            auth = zhuanlan['author']['name']
            print("专栏:",id,title)
            # 保存到数据库
            saveArticleDB(id, title, vote, cmts, auth, url)

        elif type == 'question':
            # 问题
            question = item['target']
            id = question['id']
            title = question['title']
            url = 'https://www.zhihu.com/question/' + str(id)
            print("问题:",id,title)
            # 保存到数据库
            saveQuestionDB(id,title,url)
    return nextUrl

def crawl_1(topicID):
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/top_activity?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&after_id=0'
    while url:
        text = fetchHotel(url)
        url = parseJson(text)
        print("crawl_1")

def crawl_2(topicID):
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/essence?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&offset=0'
    while url:
        text = fetchHotel(url)
        url = parseJson(text)
        print("crawl_2")

def crawl_3(topicID):
    url = 'https://www.zhihu.com/api/v4/topics/' + topicID + '/feeds/top_question?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Canswer_type%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.paid_info%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.annotation_detail%2Ccontent%2Chermes_label%2Cis_labeled%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.annotation_detail%2Ccomment_count%3B&limit=10&offset=0'
    while url:
        text = fetchHotel(url)
        url = parseJson(text)
        print("crawl_3")

if __name__ == '__main__':
    topicID = '20192351'
    try:
        t1 = threading.Thread(target=crawl_1, args=(topicID,))
        t2 = threading.Thread(target=crawl_2, args=(topicID,))
        t3 = threading.Thread(target=crawl_3, args=(topicID,))
        t1.start()
        t2.start()
        t3.start()
    except:
       print ("Error: 无法启动线程")

 

0x05 后记

之前的爬虫里我基本很少使用数据库,也很少使用多线程。

不使用数据库,一是因为爬取的数据量较小,保存到本地使用 excel 处理可能更方便;二是作为面向新手的实战教程,使用数据库会增加额外的学习成本(需要安装软件 MySQL,安装额外的库 pymysql,学习额外的语言 SQL)。

不使用多线程,一是没必要,因为数据量太小了;二是减少对方网站服务器的访问压力;三是减少新手学习难度。

但是说实话,在数据量大一些的情况下,使用数据库和多线程的优势立马出来了,数据的存取,爬取的效率等方面都很强悍。

 

爬取完成之后,总计获取到了 17375 条问题,403 篇专栏文章,共计 17375 条数据。

跟话题主页上显示的 23498 条问题相比,显然还是有较大差距的。

不过缺失的那部分数据去了哪儿,是漏爬了?还是被删除了?还是被折叠了?还是有什么其他原因,我就不得而知了。

 

 


 如果文章中有哪里没有讲明白,或者讲解有误的地方,欢迎在评论区批评指正,或者扫描下面的二维码,加我微信,大家一起学习交流,共同进步。

©️2020 CSDN 皮肤主题: 书香水墨 设计师:CSDN官方博客 返回首页