1.Count Vector
单词在一句话中出现的频数,向量化表示,句子——>数值向量。
优点:简单易理解。
缺点:当语料库很大时,向量化后可能会是稀疏矩阵,后续计算会出现麻烦。
优化:可用出现最频繁的词来构建语料库。
例子:
句子1:这只皮靴号码大了,那只号码合适
句子2:这只皮靴号码不小,那只更合适
import pandas as pd
import math
import jieba
# 用jieba进行分词
textA = '这只皮靴号码大了,那只号码合适'
textB = '这只皮靴号码不小,那只更合适'
bowA = list(jieba.cut(textA))
# print(bowA)
print('/'.join(bowA))
bowA.remove(',')
# print(bowA)
bowB = list(jieba.cut(textB))
print('/'.join(bowB))
bowB.remove(',')
list_ = [bowA, bowB]
#print(list_)
# 构建语料库
word_set = set(bowA).union(set(bowB))
print(word_set)
word_dictA = dict.fromkeys(word_set, 0)
word_dictB = dict.fromkeys(word_set, 0)
# 构建语料库中单词所对应的索引
word_index_dict = {}
for index, word in enumerate(word_set):
word_index_dict[word]=index
#print(word_index_dict)
# 计算count vector
count_vector = []
for text in list_:
vector_list=[0]*len(word_set)
for word in text:
vector_list[word_index_dict[word]]+=1
#print(vector_list)
count_vector.append(vector_list)
print('count vector:',count_vector)
2.TF-IDF
TF-IDF(词频-逆文件频率):对于一个word,在文档出现的频率高,但在语料库里出现频率低,那么这个word对该文档的重要性比较高,用于找出文章中的关键词。
优点:简单快速,好理解。
缺点:单纯用词频来衡量重要性,忽略了单词在文章中位置带来的影响,不够全面。
2.1.计算方法
TF(词频计算方法):
(1)某个词在文章中出现的次数 / 文章的总词数(常用)
(2)某个词在文章中出现的次数 / 该文出现次数最多词的出现次数
IDF (逆文件频率计算方法):
反文档频率(IDF)= log( 语料库的文档总数 / (包含该词的文档数 + 1) )
TF-IDF:
TF-IDF= TF*IDF
例子:
句子1:这只皮靴号码大了,那只号码合适
句子2:这只皮靴号码不小,那只更合适
import pandas as pd
import math
import jieba
import numpy as np
# 用jieba进行分词
textA = '这只皮靴号码大了,那只号码合适'
textB = '这只皮靴号码不小,那只更合适'
bowA = list(jieba.cut(textA))
# print(bowA)
'/'.join(bowA)
bowA.remove(',')
#print(bowA)
bowB = list(jieba.cut(textB))
'/'.join(bowB)
bowB.remove(',')
# 构建语料库
word_set=set(bowA).union(set(bowB))
#print(word_set)
word_dictA = dict.fromkeys(word_set,0)
word_dictB = dict.fromkeys(word_set,0)
for word in bowA:
word_dictA[word] += 1
for word in bowB:
word_dictB[word] += 1
print('出现次数:\n',pd.DataFrame([word_dictA,word_dictB]))
# 计算TF
def computeTF(word_dict,bow):
tf_Dict = {}
bowcount = len(bow)
for key,values in word_dict.items():
tf_Dict[key] = values / float(bowcount)
return tf_Dict
tfbowa = computeTF(word_dictA,bowA)
tfbowb = computeTF(word_dictB,bowB)
print('TF:\n',pd.DataFrame([tfbowa,tfbowb]))
# 计算IDF
def computeIDF(dolist):
idf_dict={}
# 语料库文档总数
n=len(dolist)
# 初始化idf字典
idf_dict = dict.fromkeys(dolist[0].keys(),0)
# 包含该词文档数
for doc in dolist:
for word, val in doc.items():
if val>0:
idf_dict[word] += 1
# 计算idf
for word, val in idf_dict.items():
idf_dict[word]=math.log10(n+10/(float(val)+1))
print('文档总数:',n)
#print('IDF:\n',idf_dict)
return idf_dict
idf = computeIDF([word_dictA,word_dictB])
print('IDF:\n',pd.DataFrame([idf]))
# 计算TF-IDF
def computeTF_IDF(tfbow, idf):
tfidf={}
for word,val in tfbow.items():
tfidf[word] = idf[word]*val
return tfidf
tfidf_A=computeTF_IDF(tfbowa,idf)
tfidf_B=computeTF_IDF(tfbowb,idf)
print('TF-IDF:\n',pd.DataFrame([tfidf_A,tfidf_B]))