文本向量化 | 8种向量化工具概述与实践实操总结

最新推荐文章于 2025-03-24 16:27:07 发布

大耳朵爱学习

最新推荐文章于 2025-03-24 16:27:07 发布

阅读量1.8k

点赞数 11

文章标签：前端大语言模型 AI大模型人工智能自然语言处理大模型文本向量化

本文链接：https://blog.csdn.net/2401_85379281/article/details/142379498

版权

随着bert动态向量的出现，已经可以直接将文本输入到bert中进行编码，得到句子向量表示，但这种方法无法有效建模长文本，也有直接使用预训练词向量进行向量加和得到向量表示。

实际上，我们还有其他无监督的向量化表示方法，例如，上篇文章中介绍的lda模型，也有one-hot、weighted one-hot 、主题向量表示等。

而作为自然语言处理的从业者，熟悉向量化工作十分必要，当前sklearn和gensim提供了一个十分友好的向量化方法，本文主要根据实际经验，对其中用到的一些向量化工具进行介绍，供大家一起思考。

模型类型	功能含义
sklearn-CountVectorizer	onehot向量表示
gensim-Dictionary	onehot向量表示
sklearn-TfidfVectorizer	Tfidf向量表示
gensim-TfidfModel	Tfidf向量表示
sklearn-HashingVectorizer	hash向量表示
sklearn-CountVectorizer	onehot向量表示
gensim-LsiModel	主题向量表示
sklearn-PCA	降维向量表示
gensim-dense2vec	稠密向量与稀疏向量的互转

1、sklearn-CountVectorizer onehot向量表示

one-hot表示方法，是最简单粗暴的一种向量表示，以词表大小作为向量维度，每个词对应的频次作为向量权重，这种方法优点在于直观、可解释，但在词表大小很大时，容易造成维度灾难。CountVectorizer类会将文本中的词语转换为词频矩阵。

sklearn-CountVectorizer模块，提供了一个基于计数的one-hot向量表示方法，通过fit_transform函数计算各个词语出现的次数，通过get_feature_names()可获取词袋中所有文本的关键字，通过toarray()可看到词频矩阵的结果。

使用方法如下：

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer()
>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
## 每一维度对应于一个词
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the',
       'third', 'this'], ...)
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
## 利用预训练好的vectorizer生成新输入文本的one-hot向量表示。
>>> vectorizer.transform(['Something completely new.']).toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)

2、gensim-Dictionary onehot向量表示

与sklearn相对应，gensim也提供了一个oneehot向量的表示方法，不过，gensim使用稀疏向量表示，需要使用matutils进一步转换。实现如下：

>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>> dataset = api.load("text8")
>>> len(list(dataset))
1701
>>> list(dataset)[0][:10]
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
## 基于dataset数据构造词典映射表{id:word}
>>> dict = Dictionary(dataset)
## 基于dict对corpus中的文档进行one-hot表示，此处得到的是sparse稀疏向量表示
>>> corpus = [dict.doc2bow(line) for line in dataset]
## (0, 180)表示第1个词，出现184词。
>>> corpus[0]
[(0, 184), (1, 1), (2, 1), (3, 3), (4, 7), (5, 1), (6, 1), (7, 1)] 
## 使用matutils，得到One-hot向量
>>> gensim.matutils.sparse2full(corpus[0], len(dict))
array([184.,   1.,   1., ...,   0.,   0.,   0.], dtype=float32)

3、sklearn-TfidfVectorizer Tfidf向量表示

one-hot表示法，并没有考虑到词语之间的权重信息，因此，通常我们还会计算词语在上下文中的权重，即TFIDF值，然后生成TFIDF向量。

TF-IDF(term frequency-inverse document frequency)是文本加权方法，采用统计思想，即文本出现的次数和整个语料中文档频率来计算字词的重要度，可以过滤一些常见但是无关紧要的字词。

sklearn的TfidfVectorizer接口提供了一个快速的实现方法，如下：

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third','this'], dtype=object)
>>> X.toarray()
array([[0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674],
       [0.        , 0.27230147, 0.        , 0.27230147, 0.        ,
        0.85322574, 0.22262429, 0.        , 0.27230147],
       [0.55280532, 0.        , 0.        , 0.        , 0.55280532,
        0.        , 0.28847675, 0.55280532, 0.        ],
       [0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674]])
>>> vectorizer.transform(['Something completely new.']).toarray()
array([[0., 0., 0., 0., 0., 0., 0., 0., 0.]])

4、gensim-TfidfModel Tfidf向量表示

与sklearn相对应，gensim也提供了一个TfidfModel向量的表示方法，主要用于特征其提取，使用方法如下：

>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>> dataset = api.load("text8")
>>> len(list(dataset))
1701
## 文本格式是一个list集合，每个list是一个词袋。
>>> list(dataset)[0][:10]
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
## 基于dataset数据构造词典映射表{id:word}
>>> dict = Dictionary(dataset)
## 基于dict对corpus中的文档进行one-hot表示，此处得到的是sparse稀疏向量表示
>>> corpus = [dct.doc2bow(line) for line in dataset]
## (0, 180)表示第1个词，出现184词。
[(0, 184), (1, 1), (2, 1), (3, 3), (4, 7), (5, 1), (6, 1), (7, 1), (8, 12), (9, 2)]
## 基于TFIDF模型，将文档转换为tfidf表示
>>> model = TfidfModel(corpus) 
>>> vector = model[corpus[0]] 
## (1, 180)表示第2个词，tfidf权重为0.006704047545684609。
>>> vector
[(1, 0.006704047545684609), (2, 0.0030255603220721273), (3, 0.003156168449586299), (4, 0.0036673470201144674), (5, 0.004575122435127926), (6, 0.0028052608258295926)]
## 使用matutils，得到TFIDF稠密向量
>>> gensim.matutils.sparse2full(vector, len(dict))
array([0.        , 0.00670405, 0.00302556, ..., 0.        , 0.        ,
       0.        ], dtype=float32)

5、sklearn-HashingVectorizer hash向量表示

如果处理语料中词表的维度过大，如几十万，几百万，普通的CountVectorizer存在但词库很大时，占用大内存，因此，使用hash技巧，并用稀疏矩阵存储编译后的矩阵，能很好解决这个问题。CountVectorizer省略了vocabulary这个映射（不管是CountVectorizer还是TfidfVectorizer，都是在内存中有一个word2id的映射。

>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> vectorizer = CountVectorizer()
>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> hv = HashingVectorizer(n_features=10)
>>> X = hv.transform(corpus)
>>> X.toarray()
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -0.57735027,  0.57735027, -0.57735027,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.81649658,  0.        ,  0.40824829, -0.40824829,  0.        ],
       [ 0.        ,  0.5       ,  0.        ,  0.        , -0.5       ,
        -0.5       ,  0.        ,  0.        , -0.5       ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -0.57735027,  0.57735027, -0.57735027,  0.        ]])
>>>

6、gensim-LsiModel 主题向量表示

对于高维度向量，可以使用Lsi模型，LSA(latent semantic analysis)潜在语义分析，也被称为LSI(latent semantic index)，是Scott Deerwester, Susan T. Dumais等人在1990年提出来的一种新的索引和检索方法。

LSA将词和文档映射到潜在语义空间，将文档表示到此空间的过程就是SVD奇异值分解和降维，相比传统向量空间，潜在语义空间的维度更小，语义关系更明确，从而去除了原始向量空间中的一些“噪音”。

gensim提供了这一快速接口。

>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>> dataset = api.load("text8")
>>> dict = Dictionary(dataset)
>>> corpus = [dct.doc2bow(line) for line in dataset]
## 可以看到corpus的one-hot稀疏表示
>>> corpus[0][:10]
[(0, 184), (1, 1), (2, 1), (3, 3), (4, 7), (5, 1), (6, 1), (7, 1), (8, 12), (9, 2)]
>>> from gensim.models import LsiModel
>>> model = LsiModel(corpus, id2word=dict,topics=200)
>>> vectorized_corpus = model[corpus]
## 可以看到corpus的lsi稀疏表示
>>> vectorized_corpus[0][:10]
[(0, 845.0072222053356), (1, 183.94615498027318), (2, 82.20552099282916), (3, 71.3810602349439), (4, 10.293558851876867), (5, -67.18668365319287), (6, 56.10662446207876), (7, -77.90946372354908), (8, 6.5202521849796655), (9, 69.65334514740843)]
## 可以看到corpus的lsi稀疏表示转换为稠密表示
>>> vecs = np.array([gensim.matutils.sparse2full(vectorized_corpus[i],  len(vectorized_corpus[0]) for i in range(len(vectorized_corpus))])
>>> vecs.shape
(1701, 200)
## 可以看到向量
>>> vecs[0]
array([ 8.45007202e+02,  1.83946152e+02,  8.22055206e+01,  7.13810577e+01,
        1.02935591e+01, -6.71866837e+01,  5.61066246e+01, -7.79094620e+01,
        6.52025223e+00,  6.96533432e+01, -1.86651154e+01, -1.74752796e+00,
        ...                  ...
       
       -1.33720660e+00,  1.16774054e+01, -1.43264065e+01, -3.71589518e+00,
        1.14572453e+00,  3.82190704e+00,  6.41467988e-01, -1.08295698e+01,
        1.56089294e+00, -1.00550718e+01,  1.39018860e+01, -1.04908431e+00],
      dtype=float32)

7、sklearn-PCA 降维向量表示

sklearn中提供了丰富的降维接口，包括pca、MNF降维等，对于特征降维都很有意义。

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>> dataset = api.load("text8")
>>> dict = Dictionary(dataset)
>>> corpus = [dct.doc2bow(line) for line in dataset]
## 基于TFIDF模型，将文档转换为tfidf表示
>>> model = TfidfModel(corpus) 
>>> vecs = np.array([gensim.matutils.sparse2full(model[corpus[i]], len(dict)) for i in range(len(corpus))])
>>> vecs
array([[845.0072  , 183.94615 , -82.20549 , ...,   0.      ,   0.      ,
          0.      ],
       [985.2497  , 151.77135 ,  19.771235, ...,   0.      ,   0.      ,
          0.      ],
       [950.9654  , 106.46165 , -95.182686, ...,   0.      ,   0.      ,
          0.      ],
       ...,
       [970.2823  ,  11.336755,   4.147303, ...,   0.      ,   0.      ,
          0.      ],
       [915.1458  ,  30.193762, -30.925209, ...,   0.      ,   0.      ,
          0.      ],
       [536.71436 ,  25.485744, -17.785593, ...,   0.      ,   0.      ,
          0.      ]], dtype=float32)
## 当前vec维度是(1701, 253854)
>>> vecs.shape
(1701, 253854)
## 使用PCA降维，维度设为200
>>> pca = PCA(n_components=200)
>>> pca.fit(vecs)
>>> pca.transform(vecs)
array([[-184.58066   ,  153.96861   ,  -49.6336    , ...,    1.8536146 ,
          19.330368  ,    1.3660431 ],
       [-151.02054   ,    7.800228  ,   31.350155  , ...,    3.245814  ,
          -0.29225713,   -0.4923439 ],
       [-105.97502   ,  -27.774136  , -113.11984   , ...,    2.6157355 ,
           2.400637  ,   -4.75448   ],
       ...,
       [ -10.722229  ,    7.9685206 ,   11.58477   , ...,   -2.8036292 ,
          -4.413674  ,   -2.474709  ],
       [ -30.114237  ,   54.270386  ,  -20.074902  , ...,    3.5624058 ,
           4.967129  ,   -5.3770394 ],
       [ -28.956291  ,  279.285     ,    9.95085   , ...,   -7.473279  ,
           1.5352868 ,   79.82677   ]], dtype=float32)
## 降维后，大小为(1701, 200)
>>> new_vec.shape
(1701, 200)

8、gensim-dense2vec：稠密向量与稀疏向量的互转

实际上，在向量表示中，包括稠密向量和稀疏向量两种，在文本处理中经常需要使用gensim进行建模，然后再放入sklearn中进行分类、聚类等任务。

而sklearn处理稠密向量，因此，两者之间的互转也十分重要，gensim提供了这一功能。

参考：https://radimrehurek.com/gensim/matutils.html

1）将稀疏向量转换为稠密向量

>>> doc = [(0, 1.0), (1, 2.0), (3, 3.0), (4, 4.0), (6, 8.0), (7, 4.0), (8, 5.0)]
>>> length = 9
>>> gensim.matutils.sparse2full(doc, length)
array([1., 2., 0., 3., 4., 0., 8., 4., 5.], dtype=float32)
2）将稠密向量转换为稠密向量

2）将稠密向量转换为稠密向量

>>> gensim.matutils.dense2vec([1,2,3,4,5])
[(0, 1.0), (1, 2.0), (2, 3.0), (3, 4.0), (4, 5.0)]
>>> gensim.matutils.dense2vec([1,2,0,3,4,0,8,4,5])
[(0, 1.0), (1, 2.0), (3, 3.0), (4, 4.0), (6, 8.0), (7, 4.0), (8, 5.0)]