知乎用户分布研究

                       

前言

虽然知乎早已不是最开始的样子了,但是其用户还是很广泛的。我原本打算做的写个爬虫,把用户的居住地,学历,专业等信息爬下来。然后持久化到数据库中,最后写个web服务,用图标的形式展示出来。

但是echarts地图这块,还需努力。尽管做了调试,效果还是不甚理想。汗颜(⊙﹏⊙)b

框架搭建

正如前言部分所述,这里用到的技术还是挺多的。
简要的来展示一下项目目录吧。

C:\Users\biao\Desktop\network\code\zhihu-range>tree . /f文件夹 PATH 列表卷序列号为 E0C6-0F15C:\USERS\BIAO\DESKTOP\NETWORK\CODE\ZHIHU-RANGE│  dbhelper.py│  scheduler.py│  spider.py│  zhihu.db│  __init__.py│├─web│  │  service.py│  │  __init__.py│  ││  ├─static│  │      china.js│  │      echarts.js│  │      echarts.min.js│  │      jquery-2.2.4.min.js│  ││  └─templates│          index.html│└─__pycache__        dbhelper.cpython-36.pyc        spider.cpython-36.pyc
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

模块化

接下来就一点点的对每一个小模块进行实现吧。

爬虫

爬虫部分需要注意的有这么几点。

  • 请求头上的authorization
  • 然后是请求频率的控制,通过添加随机时延可以明显的改善防爬虫限制

  • 获取关注我的人的信息:

https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20
  
  
  • 1

  • 获取我关注的人的信息:
https://www.zhihu.com/api/v4/members/zhi-ai-89-18/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20
  
  
  • 1
  • 获取的信息:
https://www.zhihu.com/api/v4/members/zhi-ai-89-18?include=locations%2CemploymentsXXXXXXXXXXXX
  
  
  • 1

明确了这点,基本上对于爬虫就没有什么问题了。详见代码部分。

# coding: utf8# @Author: 郭 璞# @File: spider.py                                                                 # @Time: 2017/5/22                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 爬虫,爬取地域数据import requestsimport jsonimport reimport mathclass Spider(object):    def __init__(self):        """        初始化请求头,必备一个authorization,否则无法获取到数据。        """        self.headers = {            'authorization': 'Bearer Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a',            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',            'Host': 'www.zhihu.com',            'x-udid': 'ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=',        }        self.cookie = {            'Cookie': 'q_c1=cbf69b836d4645b29f057b71be86c00e|1493896915000|1493896915000; r_cap_id="NWY3YjIzYzlmOTg0NDVhM2FmMzdjNzA1YzY5NTBlYmU=|1494146108|664527b0598db30d7734ff56ea5ac12b17cbe2d8"; cap_id="MWRhOTIzNGYzZDdjNDA3MjhiNTg1MGQ3ZDJlMjQ5NWE=|1494146108|94fc913a73ce89aeb3b60439fdcc69687baf438d"; d_c0="ABAC0r-WuAuPTsVSA2wl0bXj3UZqixKgbPE=|1494146110"; _zap=c27db1fb-911e-48bd-babe-3b6e66c3e558; _xsrf=55d8c6a475335b06ee3e848612afdd80; aliyungf_tc=AQAAAJ+R5xghJQIAlnF1b59VTAruEEc9; acw_tc=AQAAAGxlvy3TLgIAlnF1bxgpA2LSD8+W; s-q=%E6%A2%81%E5%8B%87; s-i=1; sid=p74htbkp; z_c0=Mi4wQUFEQWRCUTdBQUFBRUFMU3Y1YTRDeGNBQUFCaEFsVk5SMmsyV1FEeC11Uy03U2Zmc0pmSG8wTm55V2RSdjBSd3hn|1495413191|2fac9f462ad7607baaea9fca2a64abe72134af4a; __utma=155987696.1489589582.1495414813.1495414813.1495414813.1; __utmb=155987696.0.10.1495414813; __utmc=155987696; __utmz=155987696.1495414813.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'        }    def parse_homepage(self, username):        # 方式一        # homeurl = "https://www.zhihu.com/people/{}".format(username)        # response = requests.get(url=homeurl, headers=self.headers)        # if response.status_code == 200:        #     followees_number = int(re.findall(re.compile('followingCount":(\d+),'), response.text)[0])        #     followers_number = int(re.findall(re.compile('se,"followerCount":(\d+),'), response.text)[0])        #     print("关注了", followees_number)        #     print("被关注", followers_number)        #     return (followees_number, followers_number)        # else:        #     print(response.status_code)        ###-------------------------------------------------        """            返回`username`对应的居住地, 学校名称,专业名称            :param username:            :return:            """        # 方式二        tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format(            username)        response = requests.get(url=tempurl, headers=self.headers)        if response.status_code == 200:            data = json.loads(response.text)            return (data['following_count'], data['follower_count'])        else:            print(response.status_code)    def get_location_edu(self, username):        """        返回`username`对应的居住地, 学校名称,专业名称        :param username:        :return:        """        tempurl = 'https://www.zhihu.com/api/v4/members/{}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'.format(username)        response = requests.get(url=tempurl, headers=self.headers)        if response.status_code == 200:            data = json.loads(response.text)            try:                location = data['locations'][0]['name']            except:                location = "未填写"            # 处理学校            try:                school = data['educations'][0]['school']['name']                major = data['educations'][0]['major']['name']            except:                school = "未填写"                major = "未填写"            return (username, location, school, major)        else:            print(response.status_code)    def get_followees(self, username):        """        获取 :username 所关注的人的列表        :param username:        :return:        """        # 先获取用户关注的人的总数,来确定分页的范围        homeparsed = self.parse_homepage(username=username)        print(homeparsed)        followees_number = homeparsed[0]        pages = math.ceil(followees_number/20)        # 设置一个集合,去除重复元素        followee_result = []        counter = 1        for offset in range(pages):            tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followees?offset={offset}&limit=20'.format(username=username, offset=offset*20)            response = requests.get(url=tempurl, headers=self.headers)            if response.status_code == 200:                data = json.loads(response.text)                followees = data['data']                for followee in followees:                    # print(counter, ":  ", followee['url_token'])                    followee_result.append(followee['url_token'])                    counter += 1            else:                print(response.status_code)        # 返回无重复的username所关注的人列表        return list(set(followee_result))    def get_followers(self, username):        """            获取关注了 :username 的人的列表            :param username:            :return:            """        # 先获取关注username的人的总数,来确定分页的范围        homeparsed = self.parse_homepage(username=username)        print(homeparsed)        followers_number = homeparsed[1]        pages = math.ceil(followers_number / 20)        # 设置一个集合,去除重复元素        follower_result = []        counter = 1        for offset in range(pages):            tempurl = 'https://www.zhihu.com/api/v4/members/{username}/followers?offset={offset}&limit=20'.format(                username=username, offset=offset * 20)            response = requests.get(url=tempurl, headers=self.headers)            if response.status_code == 200:                data = json.loads(response.text)                followees = data['data']                for followee in followees:                    # print(counter, ":  ", followee['url_token'])                    follower_result.append(followee['url_token'])                    counter += 1            else:                print(response.status_code)        # 返回无重复的username所关注的人列表        return list(set(follower_result))if __name__ == '__main__':    spider = Spider()    # spider.get_followees(username='tianshansoft')    # spider.parse_homepage(username='zhi-ai-89-18')    # location = spider.get_location_edu(username='zhi-ai-89-18')    # print(location)    # print(spider.parse_homepage(username='tianshansoft'))    # followee_result = spider.get_followees(username='tianshansoft')    # print(followee_result)    # print(len(followee_result))    followers_result = spider.get_followers(username='tianshansoft')    print(len(followers_result))    print(followers_result[:100])
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171

数据库

数据库为了更加简单,方便。这里就采用sqlite3好了。因为本次的需求很简单,所以只需要一张表就可以了。

create table user(    id INTEGER not null  primary key autoincrement,    username varchar(36) not null,    location varchar(255),    school varchar(255),    major varchar(255));
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

然后还需要一个数据库工具类,要不然每次都写那么多重复的代码,也没什么意义。

# coding: utf8# @Author: 郭 璞# @File: dbhelper.py                                                                 # @Time: 2017/5/22                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 数据库相关操作工具类import sqlite3class DbConfig(object):    DATABASE_FILE_PATH = 'zhihu.db'class DbHelper(object):    def __init__(self):        self.conn = sqlite3.connect(DbConfig.DATABASE_FILE_PATH)    def create_table(self):        # 自增字段关键字AUTOINCREMENT.        sql = """        create table user(        id INTEGER not null  primary key autoincrement,        username varchar(36) not null,        location varchar(255),        school varchar(255),        major varchar(255)        );        """        cursor = self.conn.cursor()        cursor.execute(sql)        cursor.close()    def add(self, data=()):        cursor = self.conn.cursor()        sql = "insert into user(name, location, school, major) values('{}', '{}', '{}', '{}');".format(data[0], data[1], data[2], data[3])        cursor.executescript(sql)        # cursor.execute(sql)        cursor.close()    def get_data(self):        cursor = self.conn.cursor()        sql = "select location, count(location) as numbers from user group by location"        cursor.execute(sql)        resultset = cursor.fetchall()        print(resultset)if __name__ == '__main__':    dbhelper = DbHelper()    # dbhelper.create_table()    # data = {    #     'username': 'zhi-ai-89-18',    #     'location': '大连',    #     'school': '大连理工大学',    #     'major':'软件',    # }    # data = ('tianshansoft', '上海', 'weizhi', 'software')    # dbhelper.add(data=data)    dbhelper.get_data()
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66

这里简单的以需求驱动开发,我需要的功能也就存储数据,查询数据,所以这个工具类写的很简单。但是从功能上来说,却是足够了。

最后来看看,之乎用户地区的人数分布情况。用到的SQL语句如下:

select location, count(location) as numbers from user   group by location ORDER BY numbers DESC
  
  
  • 1

结果如下:

知乎用户地理分布

调度器

调度器是一个概念化的名词。作用就是粘合爬虫和数据持久层。根据六度空间理论,社交网是一个超大的互联。所以基本上来说爬虫是爬不干净所有用户的,于是只能退而求其次,爬取一部分吧。虽然是一部分,但是这还是相当于随机抽样,部分与整体的差别不会很大。

下面简要的来做下调度(说是简要,是因为没有做去重操作)

# coding: utf8# @Author: 郭 璞# @File: scheduler.py                                                                 # @Time: 2017/5/22                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 程序调度器,用于粘合各个模块,实现配合工作。import spiderimport dbhelperimport time, randomsp = spider.Spider()entrance = 'ghostcomputing'queue = [entrance]container = []LEVEL = 3counter = 0dbhelper = dbhelper.DbHelper()while queue:    if counter>=10000:        break    else:        temp = queue.pop(0)        followees = sp.get_followees(username=temp)        queue.extend(followees)        counter += (len(followees)-1)        # 随即休眠        timeseed = random.randint(1, 5)        print('随即休眠{}秒!'.format(timeseed))        time.sleep(timeseed)        # 获取关注username的人的详细信息        for index, followee in enumerate(followees):            # container.append(sp.get_location_edu(username=followee))            data = sp.get_location_edu(username=followee)            dbhelper.add(data=data)            print('{} 信息获取完成'.format(followee))            # 随即休眠            if index%28==0:                timeseed = random.randint(1, 3)                print('随即休眠{}秒!'.format(timeseed))                time.sleep(timeseed)print(container)
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58

web服务

echarts最好的使用就是前后端分离,所以使用接口技术来为前端的图标提供数据是一个不错的选择。之前写过一个用PHP做后台提供数据的,这里同样可以。使用JQuery也很方便。

不过,这里我打算试用一下Flask,更加的轻量。但是使用之前需要注意一个问题,那就是对于模板引擎来说,HTML代码已经不能算是原来的HTML代码了,其中对于JavaScript, CSS这些文件的路径要手动处理一下,否则他们无法被正确的找到。

函数 : url_for"static的path,一般为static", filename="想要在src上显示的值,通常是改文件在static中的路径")比如:我想要一个<script src="echarts.js">那么:模板中要这么写: <script src="{{ echarts_path}}">在后台就可以这么写:echarts_path = url_for('static', filename='echarts.js')return render_template('index.html', echarts_path=echarts_path)
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

明白了这一点,就可以把脚本和样式应用到我们自己的模板上了。

http://echarts.baidu.com/echarts2/doc/example/map15.html

而我只画出了一个中国地图。。。 。。。
实现效果图

待做… …

总结

回顾就是,爬虫那块对接口的数据获取,操作sqlite3,以及web服务中静态资源的显示。其他图形化展示继续加油。

           

再分享一下我老师大神的人工智能教程吧。零基础!通俗易懂!风趣幽默!还带黄段子!希望你也加入到我们人工智能的队伍中来!https://blog.csdn.net/jiangjunshow

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
可以使用 Python 的 Requests 和 BeautifulSoup 库来爬取知乎用户信息。首先需要登录知乎获取 cookie,然后通过模拟登录获取到用户的个人主页,再使用 BeautifulSoup 解析页面获取用户信息。 以下是示例代码: ```python import requests from bs4 import BeautifulSoup # 登录知乎并获取 cookie session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} login_url = 'https://www.zhihu.com/signin' response = session.get(login_url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') _xsrf = soup.find('input', attrs={'name': '_xsrf'})['value'] captcha_url = soup.find('img', attrs={'class': 'Captcha-englishImg'})['src'] # 模拟登录获取用户信息 login_data = { '_xsrf': _xsrf, 'email': 'your_account', 'password': 'your_password', 'captcha': input('请输入验证码' + captcha_url), 'remember_me': 'true' } session.post(login_url, headers=headers, data=login_data) user_url = 'https://www.zhihu.com/people/username' response = session.get(user_url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') # 解析页面获取用户信息 name = soup.find('span', attrs={'class': 'ProfileHeader-name'}).text headline = soup.find('span', attrs={'class': 'RichText ztext ProfileHeader-headline'}).text description = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-description'}).find('span', attrs={'class': 'RichText ztext'}).text.strip() location = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-location'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip() business = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-business'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip() employment = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-employment'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip() position = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-position'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip() education = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-education'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip() major = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-major'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip() ``` 以上代码中,需要替换 `your_account` 和 `your_password` 为你的知乎登录账号和密码,并将 `username` 替换为你要爬取的用户用户名。另外,为了防止被知乎反爬虫机制检测到,最好加上一些随机的等待时间和 User-Agent 等信息。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值