Python:如何获取知乎用户信息并且存储在本地的Excel中？

最新推荐文章于 2021-01-28 16:20:29 发布

Behersve

最新推荐文章于 2021-01-28 16:20:29 发布

阅读量344

点赞数 2

分类专栏： Python爬虫文章标签： python 知乎大数据

本文链接：https://blog.csdn.net/YiXiao1997/article/details/86776465

版权

Python爬虫专栏收录该内容

13 篇文章 0 订阅

订阅专栏

刚开始比较发愁的问题是如何去获得那么多用户的信息，而且不能全部是行业相近的用户，这样统计出来的信息都是相近的行业，就失去了统计的意义，于是乎，找到一个用户的关注者列表，关注者肯定有若干关注者，这样就可以获取大量的用户信息。并且也保证了用户的分散性。但是遇到一个问题，如何处理互关！！在递归中，如果遇到互关的两个用户，递归就出不来，很是令人苦恼。

①导入各种库

from bs4 import BeautifulSoup
import re
import json
import requests
import math
import xlwt

②获得关注者数量

def getFllowersnum(headurl):#获取关注者数量
    try:
        response = requests.get(headurl, headers=kv)
        html = response.text
        number = re.findall(r'class="NumberBoard-itemValue" title="(.*?)">', html)
        #concerned = int(number[0])
        followers = int(number[1])
        return followers
    except:
        print("该账户已停用或个人信息未公开")

已经详细注释

def getuserContent(listurl,headurl,namelist,count,flag1):#获取用户信息
    try:
        for i in range(math.ceil(getFllowersnum(headurl)/20)):
            response = requests.get(listurl,headers = kv)
            text = response.text
            dict = json.loads(text)
            for j in range(20):
                if flag1 == 0:#使用一个flag，如果递归中出现互关的，flag就会被改变，将退出该层递归
                    name = dict["data"][j]["name"]#获取返回的json中name标签
                    url_token = dict["data"][j]["url_token"]#获取返回的json中惟一的用户标识信息
                    headurl = "https://www.zhihu.com/people/"+url_token+"/activities"#构造个人用户简介页面链接
                    print(name,"关注者:"+str(getFllowersnum(headurl)),headurl)

                    getFllowersnum(headurl)#调用获取关注者的函数
                    listurl = "https://www.zhihu.com/api/v4/members/"+url_token+"/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20"
                    #构造一个用户的关注者列表
                    namelist.insert(count, name)#将每一层获取的用户放入一个列表

                    if namelist[count] == "excited-vczh":#与初始的入口url进行比较
                        print("------------->"+name+"与excited-vczh出现互关<-------------")
                    for k in range(count):#当前的用户如果与namelist中的任意一个用户重复，就要改变flag，从而跳出死循环递归
                        if namelist[k] == namelist[count]:
                            print("------------->"+namelist[k+1]+"与"+namelist[count]+"出现互关<-------------")
                            flag1 = flag1 + 1

                    count = count + 1#层数计数加一
                    getuserContent(listurl,headurl,namelist,count,flag1)递归层层遍历
                    flag = 0#跳出死循环递归后，需要将flag归零，才能继续执行程序
                    namelist.pop()#最深的一层递归执行完成后，namelist列表中的最后一个将会被销毁
                    count = count - 1完成一次递归，递归层数减一
            #构造入口链接中新的用户列表
            listurl = "https://www.zhihu.com/api/v4/members/excited-vczh/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset="+str((i+1)*20)+"&limit=20"
    except:
        print("未获取url")

if __name__ == "__main__":
    kv = {'user-agent':'Mozillar/5.0'}
    headurl = "https://www.zhihu.com/people/excited-vczh/activities"
    listurl = "https://www.zhihu.com/api/v4/members/excited-vczh/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20"
    namelist = []
    count = 0#关注者的数量
    flag1 = 0
    getuserContent(listurl,headurl,namelist,count,flag1)

给出一个使用xlwt库将数据存到Excel的demo

import xlwt
excel = xlwt.Workbook()
#cell_overwrite_ok 是为了避免重复写入出错
sheet = excel.add_sheet('Test1',cell_overwrite_ok=True)
sheet.write(0,0,'This is a test1')#在第一行第一列插入This is a test1
sheet.write(0,1,'This is a test')#在第一行第二列插入This is a test
excel.save('C:/Users/23504/Desktop/Test1.xls')

Tip1：比较麻烦的是想通最后一层递归结束后怎么做才能继续执行下去

Tip2：分析两个链接，一个是用户的页面（通过这个获取用户信息），另一个是用户关注者列表（通过这个得到更多用户）

Tip3：有些用户设置了权限不让看信息，因此可能出现错误，需要使用try进行规避

Tip4：爬取多了，知乎会进行安全验证，最好使用IP池，避免因此为题导致程序中断

Tip5：程序写的好乱，自己有点晕。。。