舆情/热点聚类算法研究（三）：基于GTE文本向量模型与倒排索引对算法进行优化-CSDN博客

本文链接：https://blog.csdn.net/qq_34101233/article/details/136045692

本文详细描述了如何通过Python爬虫获取舆情数据，然后使用GTE文本向量模型和倒排索引优化舆情聚类算法，以提高聚类精度和处理大数据集的速度。

摘要由CSDN通过智能技术生成

前文

舆情/热点聚类算法研究（一）：通过python爬虫实现舆情/热点数据准备

舆情/热点聚类算法研究（二）：基于word2vec、TF-IDF、Single-Pass实现舆情聚类

一、当前存在问题

二、优化方式

2.1 用GTE文本向量模型代替word2vec

2.2 使用倒排索引

三、完整代码

一、当前存在问题

前文说道，基于word2vec+TF-IDF加权来进行向量化的方法对于词语的顺序和语义信息并不敏感，对最后聚类的精度有一定的影响。其次，因为Single-Pass聚类算法会随着数据量的增大，聚类时间会急剧上升，原方法在处理两万以上的数据集时，聚类会变得非常缓慢，在处理五万以上的数据集时，预计所需处理时间超过一天，这显然无法达到应用的要求。

于是我们有以下两个优化思路：

一是通过当前文本向量化的优秀模型来代替word2vec与TF-IDF的组合，可以在一定程度上提高聚类的准确度，本文使用的是由通义实验室所研发的的GTE文本向量模型。

二是改进Single-Pass算法，由于前算法每次进入一条数据时，会对已有簇中心都进行一次比较，当随着簇的增长，这一比较时间会不断增大，所以优化的思路就指向了减少需要对比的簇数目，相应的解决办法在搜索引擎的应用上已经非常成熟，即引入倒排索引。

经过两项优化之后，算法在处理大量数据的时候，也能有很好的表现，在五万条数据集上测试，聚类时间仅花费不到两分钟。

二、优化方式

2.1 用GTE文本向量模型代替word2vec

模型具体信息参见：GTE文本向量-中文-通用领域

图 1 GTE文本向量模型介绍

2.2 使用倒排索引

一般而言，我们对搜索这个任务，会使用正向索引，即先有文章，再有内容。而倒排索引，则是从内容出发，找到包含内容的所有文章。在当前背景下，因为正向搜索需要遍历整个簇，在簇的数量达到一定大小之后，时间消耗会大大增加，此时我们需要减少需要遍历的簇个数。

所以现在，我们从每一篇包含的词语出发，如果当前数据与需要对比的簇没有特征词上的重叠，那我们认为这两个数据没有相似的可能性（也存在特殊情况，但大部分情况如此）。所以我们从数据中提取的特征词出发，只对比包含这些特征词的簇，大大减少了需要遍历的簇个数，时间上得到较大优化。

图 2 倒排索引示意图

以下是倒排索引的代码演示：

class InvertedIndex:  
    def __init__(self):  
        self.index = {}
  
    def add_document(self, doc_id, sentence): 
        words = jieba.analyse.extract_tags(sentence, topK=12, withWeight=False, allowPOS=())
        for word in words:
            if word not in self.index:  
                self.index[word] = [] 
            if doc_id not in self.search(word):
                self.index[word].append(doc_id)  
  
    def search(self, word):  
        if word in self.index:  
            return self.index[word]  
        else:  
            return []

    def show_index(self):
        print(self.index)

三、完整代码

import os
import re
import json
import math
import numpy as np
import pandas as pd
import jieba.analyse

from modelscope.models import Model
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

import time

model_id = "damo/nlp_gte_sentence-embedding_chinese-base"
pipeline_se = pipeline(Tasks.sentence_embedding,
                       model=model_id,
                       sequence_length=512
                       ) # sequence_length 代表最大文本长度，默认值为128

# sentences = [  
#     "1月29日一早，上海中环外圈上中路隧道不到300米，单车撞护栏后与另外两车相撞，事故占据4号车道，3辆事故车都需要牵引，后方通行缓慢。",  
#     "在1月29日清晨，上海中环外圈上中路隧道附近发生了一起交通事故。一辆单车在撞上护栏后，又与另外两辆车发生了碰撞。这起事故占据了4号车道，导致三辆事故车辆都需要牵引清除。受此影响，后方的交通通行速度变得缓慢。",  
#     "太气了，在乎的不是这点钱】1月28日，四川德阳，一女子爆料视频：老公在农贸市场买红萝卜8.9斤，被收了15元。女子在家复称只有4.4斤，少了一半。随后女子和老公找到商贩，商贩夫妻称看错了退了7元。",  
#     "在1月29日清晨，上海中环外圈上中路隧道附近发生了一起交通事故" ,
#     "1月31日，徐闻县公安局发布通报，依法对插队砸车的奔驰车主王某作出行政拘留10日并罚款500元的处理",
#     "网传插队砸车的奔驰车主系河北高校教师 校方回应",
#     "经查，1月29日15时许，一辆白色小轿车与一辆黑色商务车在徐闻港排队购票上船期间，因通行顺序问题引发纠纷。黑色商务车乘客王某(男，40岁，河北省人) 下车站到白色小轿车前，拦车辱骂并用拳头打砸白色小轿车引擎盖，导致引擎盖凹陷。随后，双方各自驾车前行离开。",
#     "#男子当小三破坏军婚获刑10个月#"
# ]  

sentences = []

with open('./testdata.txt','r',encoding='utf-8') as file: 
    # next(file)
    for sen in file:
        # print(sen)
        if sen not in ('','\n'):
            if len(sen)>5:
                # print(f"{sen},句子长度{len(sen)}")
                sentences.append(sen)

# 计算句子向量
def cal_sentence2vec(sentence):
    inputs2 = {
        "source_sentence": [sentence]
    }
    result = pipeline_se(input=inputs2)
    return result['text_embedding']

# 计算句子之间的相似度  
def cosine_similarity(vec1, vec2):  
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))  

class InvertedIndex:  
    def __init__(self):  
        self.index = {}
  
    def add_document(self, doc_id, sentence): 
        words = jieba.analyse.extract_tags(sentence, topK=12, withWeight=False, allowPOS=())
        for word in words:
            if word not in self.index:  
                self.index[word] = [] 
            if doc_id not in self.search(word):
                self.index[word].append(doc_id)  
  
    def search(self, word):  
        if word in self.index:  
            return self.index[word]  
        else:  
            return []

    def show_index(self):
        print(self.index)

# 实现增量式SinglePass聚类算法  
class SinglePassClusterV2:  
    def __init__(self, threshold=0.7):  
        self.threshold = threshold  
        self.centroids = []  # 簇中心
        self.count = []  # 簇内条目数
        self.Index = InvertedIndex()  # 倒排索引
  
    def assign_cluster(self, vector, sentence):
        if not self.centroids:
            self.centroids.append(vector)
            self.count.append(1)
            self.Index.add_document(0, sentence)
            return 0
        
        # 创建搜索集
        candidate_list = []
        words = jieba.analyse.extract_tags(sentence, topK=12, withWeight=False, allowPOS=())
        for word in words:
            can_list = set(self.Index.search(word))
            candidate_set = set(candidate_list).union(can_list)
            candidate_list = list(candidate_set)
        
        max_sim = -1  
        cluster_idx = -1  

        if candidate_list != []:
            for idx in candidate_list:  
                sim = cosine_similarity(vector, self.centroids[idx]) 
                if sim > max_sim:  
                    max_sim = sim  
                    cluster_idx = idx  

            if max_sim < self.threshold:
                cluster_idx = len(self.centroids)  
                self.centroids.append(vector)
                self.Index.add_document(cluster_idx, sentence)
                self.count.append(1)
            else: # 重新计算中心
                self.centroids[cluster_idx] = 0.1*vector + 0.9*self.centroids[cluster_idx]
                self.Index.add_document(cluster_idx, sentence)
                self.count[cluster_idx] += 1
        else:
            cluster_idx = len(self.centroids)
            self.centroids.append(vector)
            self.Index.add_document(cluster_idx, sentence)
            self.count.append(1) 

        return cluster_idx

    def fit(self, doc_vectors, sentences):
        clusters = [] 
        count = []
        for vector, sentence in zip(doc_vectors, sentences):  
            start_time = time.perf_counter()  # 获取开始时间（高精度）
            cluster_id = self.assign_cluster(vector, sentence)
            end_time = time.perf_counter()  # 获取结束时间（高精度）
            if cluster_id%2000==0 and cluster_id!=0:
                print("运行时间: ", end_time - start_time, "秒", '————当前处理', cluster_id)
            clusters.append(cluster_id)
        return clusters,self.count
    
print('向量化开始')
start_time = time.perf_counter()  # 获取开始时间（高精度）
doc_vectors = np.vstack([cal_sentence2vec(doc) for doc in sentences])  
end_time = time.perf_counter()  # 获取结束时间（高精度）
print("向量化运行时间: ", end_time - start_time)

# 5. 聚类并输出结果
print('聚类开始')
sp_cluster = SinglePassClusterV2(threshold=0.8)
start_time = time.perf_counter()  # 获取开始时间（高精度）
clusters, count = sp_cluster.fit(doc_vectors, sentences)
end_time = time.perf_counter()  # 获取结束时间（高精度）
print("聚类运行时间: ", end_time - start_time)

# # 按话题排序输出
# with open('聚类结果.txt','w',encoding = 'utf-8') as file:
#     # 输出每个文档的聚类结果  
#     for i in range(0,max(clusters)+1):
#         # print(f"-----------话题：{i}-------------")
#         file.write(f"-----------话题：{i}-------------\n")
#         j = 0
#         for doc, cluster_id in zip(sentences,  clusters): 
#             if cluster_id==i:
#                 # print(f"[{j}]--> {doc}")  
#                 file.write(f"[{j}]--> {doc}")
#                 j+=1
#         # print("\n")
#         file.write("\n")

# # 按簇内个数输出
# with open('聚类结果.txt','w',encoding = 'utf-8') as file:
#     # 输出每个文档的聚类结果  
#     for c in range(200,0,-1):
#         for i in range(0,max(clusters)+1):
#             if count[i]==c:
#                 # print(f"-----------话题：{i}-------------")
#                 file.write(f"-----------话题：{i}-------------\n")
#                 j = 0
#                 for doc, cluster_id in zip(sentences,  clusters): 
#                     if cluster_id==i:
#                         # print(f"[{j}]--> {doc}")  
#                         file.write(f"[{j}]--> {doc}")
#                         j+=1
#                 # print("\n")
#                 file.write("\n")

至此，该舆情聚类项目结束。