文本向量化

最新推荐文章于 2024-08-06 13:49:14 发布

zhuxiaohai68

最新推荐文章于 2024-08-06 13:49:14 发布

阅读量2.1k

点赞数 2

分类专栏：机器学习文章标签：文本向量化

机器学习专栏收录该内容

6 篇文章 1 订阅

订阅专栏

文本向量化

CountVectorizer
Tf–idf term weighting

CountVectorizer

最简单的例子：

from sklearn.feature_extraction.text import CountVectorizer
X_test = ['I sed about sed the lack',
'of any Actually']
count_vec=CountVectorizer(stop_words=None)
print count_vec.fit_transform(X_test).toarray()

[[1 0 0 1 1 0 2 1]
[0 1 1 0 0 1 0 0]]

print '\nvocabulary list:\n\n',count_vec.vocabulary_

vocabulary list:
{u’about’: 0, u’i’: 3, u’of’: 5, u’lack’: 4, u’actually’: 1, u’sed’: 6, u’the’: 7, u’any’: 2}

使用停用词的例子：

from sklearn.feature_extraction.text import CountVectorizer
X_test = [u'没有 你 的 地方 都是 他乡',u'没有 你 的 旅行 都是 流浪']
count_vec=CountVectorizer(token_pattern=r"(?u)\b\w\w+\b")
print count_vec.fit_transform(X_test).toarray()
,,,
token pattern是指，什么样的东西才能纳入属性(token)的范畴。默认是至少两个字符组成的东西才能算做一个token，标点符号总是被忽略，并被作为token的分隔符;
这里给了一个正则表达式，（？u）规定了 这个正则表达式按unicode来解析
\b表示匹配单词的边界，单词包括unicode的单词字符、字母、数字、下划线
\w表示匹配一个unicode单词（解释同上）
\w+表示匹配一个unicode单词至少一次
所以整个正则表达式的意思上，匹配至少含 两个unicode单词的东西，这样的东西才算一个token
,,,

[[1 1 0 1 0 1]
[0 0 1 1 1 1]]

print '\nvocabulary list:\n'
for key,value in count_vec.vocabulary_.items():
    print key,value

vocabulary list:

他乡 0
地方 1
旅行 2
没有 3
都是 5
流浪 4

Tf–idf term weighting

在这里插入图片描述

counts = [[3, 0, 1],
          [2, 0, 0],
          [3, 0, 0],
          [4, 0, 0],
          [3, 2, 0],
          [3, 0, 2]]
tfidf = transformer.fit_transform(counts)
tfidf

<6x3 sparse matrix of type ‘<… ‘numpy.float64’>’
with 9 stored elements in Compressed Sparse … format>

tfidf.toarray()

array([[ 0.81940995, 0. , 0.57320793],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 0.47330339, 0.88089948, 0. ],
[ 0.58149261, 0. , 0.81355169]])
在这里插入图片描述