文本表示就是将文本数据表示成数值型数据的方法,尽可能的表示出文本的信息,不流失关键信息。
词袋法
词袋法一般是基于单词在词袋中的频率或根据频率衍生出来的特征来表示文本,不考虑词语出现的顺序,每个出现过的词汇单独作为一列特征,仅仅将单词看做是一些词汇的集合。本质是一种one-hot表示法的,以频率为主旨的加权方法改进:
- Count Vector
最简单的词袋模型--频次编码
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(
analyzer = 'word',#tokenise by character ngrams
max_features = 4000,# keep the most common 1000 ngrams 取最高频的前4000词)
vec.fit(x_train)#fit 完 拿到字典和向量对应的映射器
vec.transform(x)#得到文本分类器最简单的入模变量
优化的频次编码,加入2-gram,3-gram的统计特征,当把文本特征提取出来后,就可以直接用传统的机器学习方法(朴素贝叶斯,支持向量机)来分类。当然可以进一步在训练的时候加上交叉验证的方法。
#优化 可以把特征做的更好一些 加入2-gram,3-gram的统计特征 词库量放大
#使用CountVectorizer进行特征提取
from sklearn.naive_bayes import MultinomialNB #朴素贝叶斯
vec = CountVectorizer(
analyzer='word',#tokenise by character ngrams
ngram_range=(1,3),
max_features = 10000,# keep the most common 1000 ngrams 取最高频的前4000词
)
classifier = MultinomialNB()
classifier.fit(vec.transform(x_train),y_train)
classifier.score(vec.transform(x_test),y_test)
#交叉验证
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score,precision_score
seed = 10
def stratifiedFfold_cv(x,y,clf_class,folds=5,**kwargs):
folds = KFold(n_splits=folds, shuffle=True, random_state=seed)
y_pred = y[:]
for fold, (train_index, test_index) in enumerate(folds.split(x, y)):
x_train,x_test = x[train_index],x[test_index]
y_train = y[train_index]
clf = clf_class(**kwargs)
clf.fit(x_train,y_train)
y_pred[test_index] = clf.predict(x_test)
return y_pred
NB = MultinomialNB
print(precision_score(y,stratifiedFfold_cv(vec.transform(x),np.array(y),NB),average='macro'))
- TF-IDF Vector
TF-IDF是TF(Term Frequency,词频)和 IDF(Inverse Document Frequency,逆向文档频率)
的乘积,TF-IDF体现了一个单词在某篇文档中的重要性信息,但是不能体现位置信息,和词的上下文信息。
使用TF-IDF提取特征,使用SVM训练
from sklearn.feature_extraction.text import TfidfVectorizer
vec=TfidfVectorizer(
analyzer='word',
ngram_range=(1,4),
max_features=20000
)
vec.fit(x_train)
classfier = SVC(kernel='linear')
classfier.fit(vec.transform(x_train),y_train)
print(classfier.score(vec.transform(x_test),y_test))
抽象成类的形式,便于维护
class TextClassifierSVM():
def __init__(self,classifier = SVC(kernel='linear')):#可以换内核
self.classifier = classifier
self.vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1,4),max_features=20000)
def features(self,X):
return self.vectorizer.transform(X)
def fit(self,X,y):
self.vectorizer.fit(X)
self.classifier.fit(self.features(X),y)
def predict(self,X):
return self.classifier.predict(self.features([X]))
def score(self,X,y):
return self.classifier.score(self.features(X),y)
text_classifier = TextClassifierSVM()
text_classifier.fit(x_train,y_train)
print(text_classifier.predict('我是一个mooc的学习者'))
print(text_classifier.score(x_test,y_test))
- Co-Occurence Vector
基于一个重要的研究结果:Harris 在 1954 年提出上下文(context)相似的词,其语义也相似。Firth 在 1957年对分布假说进行了进一步阐述和明确:词的语义由其上下文决定。
Context Window:假设context window为2,那么一个特定词的context为其前后各两个临近词(加起来一共四个)。
Co-Occurence(共现):在上下文窗口内(context window)两个词共同出现了。
假设语料库中有n个word,那么构建的Co-Occurence Vector的维度为
Word Embedding
Co-Occurence将词分布式表示,获得了词的上下文信息但是使得维度变大。词嵌入首先将Vector每一个元素由整形改为浮点型,变为整个实数范围的表示。然后将原来稀疏的巨大维度压缩嵌入到一个更小维度的空间,实现降维。
- Ngram
基于pytorch的词嵌入Ngram模型:
#-*- coding:utf-8 -*-
'''
@project: Exuding
@author: txd
@time: 2019-12-05 09:29:32
'''
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
CONTEXT_SIZE = 2
EMBEEDING_DIM = 10
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
#需要将单词三个分组,每个组前两个作为传入的数据,而最后一个作为预测的结果
trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2])
for i in range(len(test_sentence)-2)]
#给每个单词编码,也就是用数字来表示每个单词,这样才能够传入word embeding得到词向量。
vocb = set(test_sentence) # 通过set将重复的单词去掉
word_to_idx = {word: i for i, word in enumerate(vocb)}#如此简洁实现单词表构建
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}
###定义模型
class NgramModel(nn.Module):
def __init__(self, vocb_size, context_size, n_dim):
super(NgramModel, self).__init__()
self.n_word = vocb_size
self.embedding = nn.Embedding(self.n_word, n_dim)
self.linear1 = nn.Linear(context_size*n_dim, 128)
self.linear2 = nn.Linear(128, self.n_word)
def forward(self, x):
emb = self.embedding(x)
emb = emb.view(1, -1)
out = self.linear1(emb)
out = F.relu(out)
out = self.linear2(out)
log_prob = F.log_softmax(out)
return log_prob
ngrammodel = NgramModel(len(word_to_idx), CONTEXT_SIZE, 100)
criterion = nn.NLLLoss()
optimizer = torch.optim.SGD(ngrammodel.parameters(), lr=1e-3)
for epoch in range(100):
print('epoch: {}'.format(epoch+1))
print('*'*10)
running_loss = 0
for data in trigram:
word, label = data
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
label = Variable(torch.LongTensor([word_to_idx[label]]))
# forward
out = ngrammodel(word)
loss = criterion(out, label)
running_loss += loss.item()
# backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('Loss: {:.6f}'.format(running_loss / len(word_to_idx)))
word, label = trigram[3]
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
out = ngrammodel(word)#预测的结果
_, predict_label = torch.max(out, 1)
print(_, predict_label)
predict_word = idx_to_word[predict_label.item()]#得到单词
print('real word is {}, predict word is {}'.format(label, predict_word))
- Word2Vec
word2vec 是 Google 于 2013 年开源的一个实现词嵌入的具体方法。
word2vec 中的数学原理详解www.cnblogs.com