papi酱微博数据可视化（截至2020年12月9日）

最新推荐文章于 2024-11-09 12:20:31 发布

原创最新推荐文章于 2024-11-09 12:20:31 发布

· 4k 阅读

57 ·

版权

文章标签：

#数据可视化 #python #数据分析 #微博

python数据分析专栏收录该内容

6 篇文章

订阅专栏

papi酱微博数据可视化（截至2020年12月9日）

1. 爬取微博数据

使用爬虫爬取网站时，首选的是m站，其次是wap站，最后考虑PC站，因为PC站的各种验证最多。然而PC站的信息最全，可以使用高级搜索，针对某具体时间段和关键词进行爬取。此次针对某微博用户进行微博数据可视化。

（1）需要的模块

import urllib
import urllib.request
import time
import json
import xlwt

（2）针对用户进行爬取

papi酱拥有3000w+的粉丝数量，发布的博文大多都是有具体含义的，对其微博数据进行可视化，获取该用户的社区反馈情况和发布博文的规律等信息。

获取用户uid：右键查看源代码找到uid为2714280233。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sjLdOZ3r-1607691620980)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211172546901.png)]$

id = "2714280233"

（3）设置代理ip

使用代理ip爬虫是反爬虫手段之一，很多网站会检测某一时间段内某个ip的访问次数，访问次数过多，就会禁止该ip访问。

因此爬虫时可以设置多个代理，隔一段时间换一个，及时其中一个被封，也可调用其他ip进行完成爬虫任务。

在urllib.request库中，通过ProxyHandler来设置使用代理服务器。网上有很多免费代理ip池（https://www.kuaidaili.com/free/），根据需要选择。但是一般这种仅适合个人爬虫需求，因为很多免费代理ip可能同时被很多人使用，可使用时间短，速度慢，匿名度不高，所以专业的爬虫工程师或爬虫公司需要使用更高质量的私密代理，通常这种代理需要找专门的供应商购买，再通过用户名/密码授权使用。

# 设置代理IP
proxy_addr = "122.241.72.191:808"

例子：爬取豆瓣页面

import urllib.request
import random
 
url ="https://www.douban.com/"
 
header={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"}
 
# 代理IP列表随机抽取
proxy_list = [{"http" : "220.168.52.245:55255"},
              {"http" : "124.193.135.242:54219"},
              {"http" : "36.7.128.146:52222"},
              ]
 
# 随机选择一个代理
proxy = random.choice(proxy_list)
print(proxy)
 
# 使用选择的代理构建代理处理器对象
httpproxy_handler = urllib.request.ProxyHandler(proxy)
opener = urllib.request.build_opener(httpproxy_handler)
request = urllib.request.Request(url, headers=header)
response = opener.open(request)
data = response.read().decode('utf-8', 'ignore')
print(data)

在这里插入图片描述

（4）headers

右键->检查->Network->Headers->user-agent

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-07gouQbu-1607691418689)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211173413166.png)]$

（5）定义页面打开函数

# 定义页面打开函数
def use_proxy(url, proxy_addr):
    req = urllib.request.Request(url)
    req.add_header("User-Agent",
                   "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Mobile Safari/537.36")
    proxy = urllib.request.ProxyHandler({'http': proxy_addr})
    opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)
    data = urllib.request.urlopen(req).read().decode('utf-8', 'ignore')
    return data

（6）获取微博主页的containerid，爬取微博内容时需要此id

def get_containerid(url):
    data = use_proxy(url, proxy_addr)
    content = json.loads(data).get('data')
    for data in content.get('tabsInfo').get('tabs'):
        if (data.get('tab_type') == 'weibo'):
            containerid = data.get('containerid')
    return containerid

(7)获取微博用户的基本信息：微博昵称、微博地址、微博头像、关注人数、粉丝数、性别、等级等

def get_userInfo(id):
    url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + id
    data = use_proxy(url, proxy_addr)
    content = json.loads(data).get('data')
    profile_image_url = content.get('userInfo').get('profile_image_url')
    description = content.get('userInfo').get('description')
    profile_url = content.get('userInfo').get('profile_url')
    verified = content.get('userInfo').get('verified')
    guanzhu = content.get('userInfo').get('follow_count')
    name = content.get('userInfo').get('screen_name')
    fensi = content.get('userInfo').get('followers_count')
    gender = content.get('userInfo').get('gender')
    urank = content.get('userInfo').get('urank')
    print("微博昵称：" + name + "\n" + "微博主页地址：" + profile_url + "\n" + "微博头像地址：" + profile_image_url + "\n" + "是否认证：" + str(
        verified) + "\n" + "微博说明：" + description + "\n" + "关注人数：" + str(guanzhu) + "\n" + "粉丝数：" + str(
        fensi) + "\n" + "性别：" + gender + "\n" + "微博等级：" + str(urank) + "\n")
    return name

（8）保存图片

def savepic(id,pic_urls, created_at, page, num):
    pic_num = len(pic_urls)
    srcpath = 'weibo/weibo_img/'+id+'/'
    if not os.path.exists(srcpath):
            os.makedirs(srcpath)
    picpath = str(created_at) + 'page' + str(page) + 'num' + str(num) + 'pic'
    for i in range(len(pic_urls)):
        picpathi = picpath + str(i)
        path = srcpath + picpathi + ".jpg"
        urllib.request.urlretrieve(pic_urls[i], path)

（9）获取微博内容信息,并保存到文本中

# 获取微博内容信息,并保存到文本中，内容包括：每条微博的内容、微博详情页面地址、点赞数、评论数、转发数等
def get_weibo(id, file):
    i = 1
    while True:
        url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + id
        weibo_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + id + '&containerid=' + get_containerid(
            url) + '&page=' + str(i)
        try:
            data = use_proxy(weibo_url, proxy_addr)
            content = json.loads(data).get('data')
            cards = content.get('cards')
            if (len(cards) > 0):
                for j in range(len(cards)):
                    print("-----正在爬取第" + str(i) + "页，第" + str(j) + "条微博------")
                    card_type = cards[j].get('card_type')
                    if (card_type == 9):
                        mblog = cards[j].get('mblog')
                        attitudes_count = mblog.get('attitudes_count')  # 点赞数
                        comments_count = mblog.get('comments_count')  # 评论数
                        created_at = mblog.get('created_at')  # 发布时间
                        reposts_count = mblog.get('reposts_count')  # 转发数
                        scheme = cards[j].get('scheme')  # 微博地址
                        text = mblog.get('text')
                        source = mblog.get('source')
                        pictures = mblog.get('pics')  # 正文配图，返回list
                        pic_urls = []  # 存储图片url地址
                        if pictures:
                            for picture in pictures:
                                pic_url = picture.get('large').get('url')
                                pic_urls.append(pic_url)
#                         print(pic_urls)

                        # 保存文本
                        with open(file, 'a', encoding='utf-8') as fh:
                            if len(str(created_at)) < 6:
                                created_at = "2020-" + str(created_at)
                            # 2020年发布的文章是没有年份的  所以如果发布时间长度小于6就说明没有年份   加上2020即可
                            # 页数、条数、微博地址、发布时间、微博内容、点赞数、评论数、转发数、图片链接
                            fh.write(str(i) + '\t' + str(j) + '\t' + str(scheme) + '\t' + str(
                                created_at) + '\t' + text + '\t' + str(attitudes_count) + '\t' + str(
                                comments_count) + '\t' + str(reposts_count) + '\t' + str(source) + '\t' + str(
                                pic_urls) + '\n')
                            print(text)

                    # 保存图片
                    savepic(id,pic_urls, created_at, i, j)
                i += 1
                '''休眠1s以免给服务器造成严重负担'''
                time.sleep(1)
            else:
                break
        except Exception as e:
            print(e)
            pass

（10）文本转换成csv的函数

def txt_csv(filename, csvname):
    """
    :文本转换成csv的函数
    :param filename txt文本文件名称、
    :param xlsname 表示转换后的excel文件名
    """
    try:
        with open(filename, 'r', encoding='utf-8') as f:
            csv = xlwt.Workbook()
            # 生成excel的方法，声明excel
            sheet = csv.add_sheet('sheet1', cell_overwrite_ok=True)
            # 页数、条数、微博地址、发布时间、微博内容、点赞数、评论数、转发数
            sheet.write(0, 0, '爬取页数')
            sheet.write(0, 1, '爬取当前页数的条数')
            sheet.write(0, 2, '微博地址')
            sheet.write(0, 3, '发布时间')
            sheet.write(0, 4, '微博内容')
            sheet.write(0, 5, '点赞数')
            sheet.write(0, 6, '评论数')
            sheet.write(0, 7, '转发数')
            sheet.write(0, 8, "手机小尾巴")
            sheet.write(0, 9, '图片链接')
            x = 1
            while True:
                # 按行循环，读取文本文件
                line = f.readline()
                if not line:
                    break  # 如果没有内容，则退出循环
                for i in range(0, len(line.split('\t'))):
                    item = line.split('\t')[i]
                    sheet.write(x, i, item)  # x单元格行，i 单元格列
                x += 1  # excel另起一行
            csv.save(csvname)  # 保存csv文件
    except:
        raise

（11）开始

if __name__ == "__main__":
    name = get_userInfo(id)
    file = "G:\爬虫内容\papi-weibo-1211\weibo" + str(name) + id + ".txt"
    get_weibo(id, file)

    txtname = file
    csvname = "G:\爬虫内容\papi-weibo-1211\weibo" + str(name) + id + ".csv"
    txt_csv(txtname, csvname)

开始爬取：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-URU7rgmw-1607691418714)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211181524288.png)]$

查看图片：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-raDIjOw2-1607691418719)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211181542314.png)]$

查看数据：

首先将编码从ANSI改为utf-8

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-x0RR9sXG-1607691418738)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211181857881.png)]$

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qTsemvBG-1607691418752)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211181914517.png)]$

下面对获取的数据进行预处理，去除微博文本中的html链接等与内容无关的符号，并且添加year、month、day列等，并将处理后的文档保存。

2. 数据预处理

（1）读取数据

import pandas as pd
df = pd.read_csv("weibopapi.csv")
df.head()

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kJKEEci6-1607691418761)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211195132457.png)]$

选取列3-8：

df = df.iloc[:,3:8]

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZRwsavRg-1607691418767)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211195519841.png)]$

（2）更改列名

df.columns=["date","content","likes","comments","forwards"]

（3）去除html链接等符号

import re
def delete_html(text_context):
  re_tag = re.compile('</?\w+[^>]*>')  # HTML标签   去除<>中间有若干字母内容啥的  把表情也删了
# re.sub用于替换字符串中的匹配项。将字符串中的网址替换成 '' 网址就是re_tag 
  new_text = re.sub(re_tag, '', text_context)
  re1 = re.compile('【')
  new_text = re.sub(re1, '', new_text)
  re2 = re.compile('】')
  new_text = re.sub(re2, '', new_text)
  re3 = re.compile('#')
  new_text = re.sub(re3, '', new_text)
  re4 = re.compile('//@')# 去掉转发符号
  new_text = re.sub(re4, '', new_text)
    # 将若干个逗号替换成一个逗号
  new_text = re.sub(",+", ",", new_text)  # 合并逗号
# 将若干个逗号替换成一个逗号
  new_text = re.sub("/+", ",", new_text)  # 合并/号
# 将若干个空格替换成一个空格
  new_text = re.sub(" +", " ", new_text)  # 合并空格
    # 将若干个省略号替换成一个省略号
  new_text = re.sub("[...|…|。。。]+", ".", new_text)  # 合并句号
  new_text = re.sub("-+", "--", new_text)  # 合并-
  text_content = re.sub("———+", "———", new_text)  # 合并-
  return text_content

for i in range(df.shape[0]):
    a = delete_html(df.iloc[i]["content"])
    df.loc[i,"content-clean"] =[a]

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RjdDSsDF-1607691418770)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211195617295.png)]$

（4）添加year、month、day列

df["year"] = pd.DataFrame([df.shape[0],df.shape[1]])
for i in range(df.shape[0]):
    df["year"][i]= str(df["date"][i]).split("-")[0]
    
df["month"] = pd.DataFrame([df.shape[0],df.shape[1]])
for i in range(df.shape[0]):
    df["month"][i]= str(df["date"][i]).split("-")[1]
    
df["day"] = pd.DataFrame([df.shape[0],df.shape[1]])
for i in range(df.shape[0]):
    df["day"][i]= str(df["date"][i]).split("-")[2]

在这里插入图片描述

（5）保存数据

df.to_csv("weibopapi-yuchuli.csv")

3. 数据可视化

查看papi酱每年发布的博文总数，可以看到2016年和2017年发布的博文数量最多，同时查看点赞数、转发数和评论数，看到2016年的值与其他年份相比很明显比较多，查看2016-2017的日历图可以看到发布博文的分布情况。

（1）每年对应每个月发布多少博文

桑基能量分流图，也叫桑基能量平衡图。它是一种特定类型的流程图，图中延伸的分支的宽度对应数据流量的大小，通常应用于能源、材料成分、金融等数据的可视化分析。
流程图的一种。
由若干个三元素组成（节点，边，流量）。
遵循守恒定律，无论怎么流动，开端和末端数据始终一致。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-erFjHojg-1607691418789)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211200350978.png)]$

（2）每年发布的博文数量和点赞数、转发数、评论数

列1：柱状图-点赞数、转发数、评论数

列2：折线图-每年发布的博文数量

设置各自轴的值区间！

from pyecharts import options as opts
from pyecharts.charts import Bar, Line
x_data = ["2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020"]
c = (
    Bar()
    .add_xaxis(xaxis_data=x_data)
    .add_yaxis("点赞数", [   71753,    71270,   218127,  3395972, 34584627, 21158158,
        8244047, 11768999,  8409407],label_opts=opts.LabelOpts(is_show=False))
    .add_yaxis("转发数", [   1683,    3042,   12509,  428865, 4462778, 3190882, 1512825,
        941783,  439016],label_opts=opts.LabelOpts(is_show=False))
    .add_yaxis("评论数", [  15059,    9074,   30705,  448485, 4464258, 3134553, 1858660,
        995640,  530994],label_opts=opts.LabelOpts(is_show=False))
    
    .extend_axis(
        yaxis=opts.AxisOpts(
            name="每年发布的博文数量",
            type_="value",
            min_=0,
            max_=200,
            interval=50,
            axislabel_opts=opts.LabelOpts(formatter="{value}"),
        )
    )# 右边的轴    
    .set_global_opts(
        tooltip_opts=opts.TooltipOpts(
            is_show=True, trigger="axis", axis_pointer_type="cross"
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            axispointer_opts=opts.AxisPointerOpts(is_show=True, type_="shadow"),
        ),
        yaxis_opts=opts.AxisOpts(
            name="数量",
            type_="value",
            min_=0,
            max_=35000000,
            interval=5000000,
            axislabel_opts=opts.LabelOpts(formatter="{value}"),
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
    )
    .set_series_opts(label_opts=opts.LabelOpts(is_show=False))  # 不显示图上的数值了
)

line = (
    Line()
    .add_xaxis(x_data)
    .add_yaxis(
        "每年发布的博文数量",
        yaxis_index=1,
        y_axis=[ 93,  70,  71, 122, 153, 159,  97, 102,  19],
        label_opts=opts.LabelOpts(is_show=False),
    )
)

c.overlap(line).render_notebook()

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wlkcPoji-1607691418792)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211201002175.png)]$

（3）日历图

查看每天发布的微博数量，可以查看每月或者每周的情况。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BYCRKxQp-1607691418794)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211201325799.png)]$
在这里插入图片描述

（4）每年发布的博文数量-玫瑰图

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0UDfD9Qz-1607691418847)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211202610408.png)]$

发现在周一发布博文比较多：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ert2xZsI-1607691418851)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211202059307.png)]$

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1gGop1cd-1607691418857)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211202132523.png)]$

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ra3ULwKn-1607691418865)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211202205539.png)]$

（5）时间轴 Timeline()

准备好2012-2020年每年的周一~周日对应的微博数量，保存到week_count_year中备用

attr = ["周一","周二","周三","周四","周五","周六","周日"]
tl = Timeline()
for i in range(2012, 2021):
    pie = (
        Pie()
        .add(
            "",
            week_count_year(i),
            rosetype="area",
            radius=["30%", "55%"],
        )
        .set_global_opts(title_opts=opts.TitleOpts("papi酱{}年发布微博数量".format(i)),legend_opts=opts.LegendOpts(orient="vertical", pos_right="2%", pos_top="20%"))
        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
    )
    tl.add(pie, "{}年".format(i))
# tl.render("timeline_pie.html")
tl.render_notebook()

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JSHuntfd-1607691418868)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211202648915.png)]$

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A6TUhEiK-1607691418872)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211202822635.png)]$

4. 主题提取

微博文本话题的获取可以通过对微博文本主题内容进行分析得到，文本主题是对文本消息的高度抽象，只要理解了文本主题就可以高效地使用这些离散的、无序的文本数据。主题模型技术能够实现从文本数据中提取主题的过程。从微博文本中挖掘出的主题信息可以用来进行突发事件监测、事件态势预测、精准营销等。然而文本主题提取的结果通常是一堆词簇，结果复杂。面对这些问题，文本可视化技术便应运而生，它主要将大量的文本数据或者一些复杂的内容和规律以视觉符号的形式表达出来，这使得人们可以利用自身的视觉感知能力来快速获取大数据中包含的核心信息。

（1）准备去除html链接等无用信息的文本内容

# 把content取出来 保存成列表
content= df["content-clean"].values.tolist()

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kwyOi3Qr-1607691418878)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211203213347.png)]$

（2）jieba分词

对每行都进行分词并保存

import jieba
content_S = []
a=[]
for line in content:
    current_segment = jieba.lcut(line)
    a=[]
    for i in current_segment:
        if i!=" ":
            a.append(i)
    content_S.append(a)

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0VExSIyb-1607691418883)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211203332906.png)]$

（3）去除停用词

stopwords=pd.read_csv("./stopwords.txt",index_col=False,sep="\t",quoting=3,names=['stopword'], encoding='utf-8')
stopwords.head(5)

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AlsM0C5k-1607691418897)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211203411127.png)]$

def drop_stopwords(contents,stopwords):
    contents_clean = []
    all_words = []
    for line in contents:
        line_clean = []
        for word in line:
            if word in stopwords:  # 出现在停用词当中，我们就给他去掉
                continue
            line_clean.append(word) # 否则就拿过来-每行存一次
            all_words.append(str(word))# 做词云-存所有出现的词
        contents_clean.append(line_clean)
    return contents_clean,all_words
    #print (contents_clean)

# 去停用词

# 传到函数中的contents和stopwords是列表格式的，我们需要对现有的数据进行数据转换
contents = df_content.content_S.values.tolist()    
stopwords = stopwords.stopword.values.tolist()
# 调用我们设置的函数进行停用词的删除过程
contents_clean,all_words = drop_stopwords(contents,stopwords)

（4）保存所有的词

# 保存的所有的word
df_all_words=pd.DataFrame({'all_words':all_words})

import numpy
# groupby  分组   按照词进行分组
# 排序 按照count大小进行排序-对索引进行排序
words_count=df_all_words.groupby(by=['all_words'])['all_words'].agg([("count",numpy.size)])
words_count=words_count.reset_index().sort_values(by=["count"],ascending=False)# 按count大小进行索引的排序
words_count.head()

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Jm6LHtKe-1607691418901)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211203613066.png)]$

使用词云查看发布博文的词的分布情况，可以看到哪些是经常出现的词，从而分析博主发文的风格：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-h4xTQvnz-1607691418909)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211205302194.png)]$

（5）保存删除停用词后的词

# 保存删除停用词后的内容 contents_clean
df_content=pd.DataFrame({'contents_clean':contents_clean})

# 现在上面是list of list 格式 要改成list of str格式  每一个“”里面存一个list的文本  
w=[]
s=[]
for i in range(df_content["contents_clean"].shape[0]):
    w = " ".join(df_content["contents_clean"].loc[i])
    s.append(w)
s

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-E6fBe4Gw-1607691418913)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211203733151.png)]$

（6）主题提取

from sklearn.feature_extraction.text import CountVectorizer
# 这里的向量化只是频率的向量化 出现该词几次就写几
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                max_features=1000,
                                 max_df = 0.99,
                                min_df = 0.001) #去除文档内出现几率过大或过小的词汇
tf = tf_vectorizer.fit_transform(s)   # 得到词频向量tf   LDA是基于词频的不是TF-IDF
# 词频是各个词在文章d中出现的次数/文章d中所有的词的数量
# 1000个词
# 选取出现最多的那1000个词

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10, max_iter=50,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
# 10类

lda.fit(tf)

docres = lda.fit_transform(tf)

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

n_top_words = 8
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IPec3uT3-1607691418917)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211203924873.png)]$

import numpy as np
docres = lda.fit_transform(tf)
# 文档所属每个类别的概率
LDA_corpus = np.array(docres)
print('类别所属概率:\n', LDA_corpus)
# 每篇文章中对每个特征词的所属概率矩阵：list长度等于分类数量
# print('主题词所属矩阵：\n', lda.components_)
# 构建一个零矩阵
LDA_corpus_one = np.zeros([LDA_corpus.shape[0]])
# 对比所属两个概率的大小，确定属于的类别
LDA_corpus_one = np.argmax(LDA_corpus, axis=1) # 返回沿轴axis最大值的索引，axis=1代表行；最大索引即表示最可能表示的数字是多少
print('每个文档所属类别：', LDA_corpus_one)

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hD1foEQR-1607691418924)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211204022163.png)]$

将每个文档所属的类别保存到原表中：

df["category"] = pd.DataFrame([df.shape[0],df.shape[1]])

for i in range(len(LDA_corpus_one)):
    df["category"][i] = LDA_corpus_one[i]

（7）准备绘制主题河流图

def day_num_category(year):
    df_day = df[df["year"]==year]
    group_category = df_day.groupby("category")
    group_category_1 = pd.DataFrame(group_category.count())
    data1=[]
    for i in group_category_1.index:
        data1.append([str(int(year)),group_category.count()["date"][i],str(i)])
    return data1

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KEAoou6y-1607691418928)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211204158252.png)]$

data = []
for j in group_date[0]: 
    for k in day_num_category(j):
        data.append(k)

2012-2020年发布博文的主题演化模式：

# 按照时间绘制主题河流趋势

from pyecharts.charts import ThemeRiver
import pyecharts.options as opts
from pyecharts.globals import ThemeType
x_data =[]
for i in range(0,16):
    x_data.append(str(i))
y_data =[['2012', 17, '0.0'],...]
(
    ThemeRiver(init_opts=opts.InitOpts(width="1400px", height="800px"))
    .add(
        series_name=x_data,
        data=y_data,
        singleaxis_opts=opts.SingleAxisOpts(
            pos_top="50", pos_bottom="50", type_="time"
        ),
       
    )
    .set_global_opts(
        tooltip_opts=opts.TooltipOpts(trigger="axis", axis_pointer_type="line"),
        title_opts=opts.TitleOpts(title="",pos_bottom = "80%", pos_right = "50%"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(is_show = 0),legend_opts=opts.LegendOpts(is_show=True))
    .render("theme_river.html")
)

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oixTmrOq-1607691418931)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211204403220.png)]$

5. 主题内容可视化

将上述可视化方法制作可视化大屏，将其显示出来：
在这里插入图片描述

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iXRYB8Bo-1607691418947)(C:\Users\liubing\AppData\Roaming\Typora\typora-user-images\image-20201211170615365.png)]$