从零开始的文本TF-IDF向量构造和基于余弦相似度的文本分类

最新推荐文章于 2025-08-19 19:08:42 发布

原创最新推荐文章于 2025-08-19 19:08:42 发布 · 3.6k 阅读

29 ·

CC 4.0 BY-SA版权

NLP 专栏收录该内容

25 篇文章

订阅专栏

一、任务需求

1、给定数据库里面的N行数据每行代表一篇文章，属性分别是[id, title, summuary,content] ，从mysql数据库获取数据并生成DataFrame格式的数据，有两列，分别是id 和content。id对应数据表里的id,content根据 content > summuary > title 重要程度排序，对应数据表相关列的数据。

2、对DataFrame里面的content进行分词、去停用词，构造每篇文章的TF-IDF向量。

3、对于新的文章，构造TF-IDF向量后和已有文章进行相似度比较，给出最相似的五篇文章的id。

给出文本TFIDF向量的定义：文本的one-hot编码的升级版，每个编码由对应词语的tfidf值代替，本质上依然是一个稀疏向量。

二、任务流程

1、数据库内容的获取、DataFrame数据的构造

import pymysql
import pandas as pd
df = pd.DataFrame()
train = pd.DataFrame()
db = pymysql.connect('填写相关配置')
cursor = db.cursor()
# SQL 查询语句
sql = "SELECT * FROM item WHERE item_id < %s" % (1811151105292680200222)

#执行SQL语句
cursor.execute(sql)
 # 获取所有记录列表
results = cursor.fetchall()
for row in results:
    item_id = row[0]
    title = row[2]
    summuary = row[4]
    content = row[8]
    s ={
        'item_id': [item_id],
        'title': [title],
        'summuary': [summuary],
        'content': [content]
    }
    s = pd.DataFrame(s)
    df = df.append(s,ignore_index=True)
for index, row in df.iterrows():
    if row['content'] is not None and len(row['content'])>10:
        s1 = {
            'item_id': [row['item_id']],
            'content': [row['content']]
        }
        train = train.append(pd.DataFrame(s1),ignore_index=True)
    elif row['summuary'] is not None and len(row['summuary'])>5:
        s2 = {
            'item_id': [row['item_id']],
            'content': [row['summuary']]
        }

        train = train.append(pd.DataFrame(s2), ignore_index=True)
    else:
        s3 = {
            'item_id': [row['item_id']],
            'content': [row['title']]
        }
        train = train.append(pd.DataFrame(s3), ignore_index=True)
train = train[:4698]
train.to_csv('corpora.csv',encoding='utf-8')

此处遇到的问题：1、定义空的DataFrame的时候，不需要自己定义列名。字典的键会作为DataFrame的列名。2、append（）方法要写ignore_index = True。3、在构造字典的时候，字典的值需要用以列表的形式存在，这样在执行train.append(pd.DataFrame(s), ignore_index=True) 的时候才不会报错。4、DataFrame获取前n行的操作和列表一样。

2、TF-IDF生成

这里，最开始使用了gensim的 Tfidfmodel，但是发现它生成的TF-IDF并不是包含了所有单词的稀疏矩阵。因为任务要求是得到每个文章TFIDF向量的稀疏矩阵（原因，gensim生成的稀疏向量是BOW向量，不是one-hot向量）。sklearn里面的 tfidf模块，使用过程发生一个问题，（感觉sklearn对中文文本的处理并不好）。至于jieba 自带的tfidf，暂时只知道用来求前n个关键词,并不知道如何获得给定文本的TFIDF向量。故只能手写。（ps 2018-12-28修改：sklearn可以完美解决上述问题，故手写版本可以被sklearn完全取代，最后实践中使用了sklearn版本。）

参照这篇文章：https://mlln.cn/2018/08/18/%E6%96%87%E6%9C%AC%E5%90%91%E9%87%8F%E7%B3%BB%E5%88%97-%E5%A6%82%E4%BD%95%E5%9F%BA%E4%BA%8E%E8%AF%8D%E9%A2%91%E7%9F%A9%E9%98%B5%E5%92%8CTF-IDF%E6%9D%83%E9%87%8D%E6%9E%84%E5%BB%BA%E8%AF%8D%E5%90%91%E9%87%8F/

1、首先生成词典和词频数矩阵

def word_matrix(documents):
    '''计算词频矩阵'''
    docs = [d for d in documents]
    docs = [word for word in docs]
    # 获取所有词
    words = list(set(chain(*docs)))
    # 词到ID的映射, 使得每个词有一个ID
    dictionary = dict(zip(words, range(len(words))))
    # 创建一个空的矩阵, 行数等于词数, 列数等于文档数
    matrix = np.zeros((len(words), len(docs)),dtype='float32')
    # 逐个文档统计词频
    for col, d in enumerate(docs):
        # 统计词频
        count = Counter(d)
        for word in count:
            # 用word的id表示word在矩阵中的行数
            id = dictionary[word]
            # 把词频赋值给矩阵
            matrix[id, col] = count[word]
         

    return matrix, dictionary

这是整个任务最关键的一步，生成了词典和词频矩阵。词典的作用：使得词语和其id一一对应。同时，对于新的分词之后的文本，需要去掉词典里不存在的词。词频矩阵的行代表每个单词在每篇文章里出现的次数，词频矩阵的列代表每篇文章里包含的词语及该词语出现的次数。

2、通过词频矩阵计算每个词语的tf-idf值

#计算tf
def tf(matrix):
    # 计算每个文档的总词数
    sm = np.sum(matrix, axis=0)
    print(sm)
    # 每个词的词频除以每个文档的词频
    try:
        a = matrix/sm
    except:
        print('error')
    print('tf运算结束')
    return matrix / sm


#计算idf |D|: 语料库中的文档总数。 分子: 包含词ti的文档数
def idf(matrix):
    '''计算IDF'''
    # 文档总数
    D = matrix.shape[1]
    # 包含每个词的文档数
    j = np.sum(matrix>0, axis=1)
    print(j)
    return np.log(D / j)


#计算每个单词的tf_idf值
def tf_idf(matrix):
    return tf(matrix) * idf(matrix).reshape(matrix.shape[0], 1)

涉及到了array的四则运算。值得注意的一点就是：二维array除以一维array，就是二维数组的每一列里面的所有数分别除一维数组的对应列的值。虽然研究生一直做复杂网络相关的东西，也是一直使用矩阵。在这里第一次感受到矩阵运算的方便。

最后返回的就是需要的文本的TFIDF向量，以数组的形式存在。

3、保存生成的模型

#保存tfidf_matrix
tfidf_matrix = tf_idf(matrix)
np.save('./doc/numpy.npy',tfidf_matrix)
#保存dictionary
list3_file = open('./doc/dictionary.pickle', 'wb')
pickle.dump(dictionary, list3_file)
list3_file.close()

使用了npy 和pickle两种数据存储格式。对数据的存取方面的问题，未完待续。

遇到的问题：最开始使用了10000篇文章，把matrix存为了txt格式，发现有30G，计算tf的时候电脑直接卡死。后来改成了5000篇文章。

4、下载模型

#加载dictionary
word_dict = open('./doc/dictionary.pickle', 'rb')
word_dict = pickle.load(word_dict)

#加载idf字典
with open('./doc/newidf.txt', 'r', encoding='utf-8')as f:
    dic = []
    for line in f.readlines():
        b = line.strip().split(' ')
        dic.append(b)
dic = dict(dic)
#加载matrix
matrix = np.load('./doc/numpy.npy')

5、生成新文章的TF-IDF向量

#生成一个一维向量，长度为字典的长度。然后对词语所在位置进行赋值。使用了自定义的idf模型。
newdoc = np.zeros(len(word_dict)) 
data = readfile('./doc/re0.txt')
doc = chinese_word_cut(data)
k = len(doc)  # 计算单词总数
result = {}
for word in doc:
    result[word] = doc.count(word)
    tf = float(result[word] / k)
    if word in list(dic.keys()):
        idf = float(dic[word])
    else:
        idf = 0.0
    tfidf = tf * idf
    if word in word_dict.keys():
        newdoc[word_dict[word]] = tfidf
print(np.sum(newdoc))

自定义idf模型的生成，之前博客有写。

6、计算文章之间的余弦相似度

# 计算文章和旧文章的余弦相似度。
def sim(array1,array2):
    num = float(np.matmul(array1,array2))
    s = np.linalg.norm(array1)*np.linalg.norm(array2)
    return num/s

matrix = np.load('./doc/numpy.npy')
print(matrix.shape[1])
doc2id = []
for i in range(matrix.shape[1]):
    doc2id.append(sim(newdoc,matrix[:,i]))
# 返回列表里面最大的三个数值的索引。
max_num_index_list = map(doc2id.index, heapq.nlargest(6, doc2id))
print(list(max_num_index_list))

余弦相似度：https://www.jianshu.com/p/ec834ec0d51f