ML9自学笔记

十九岁的花季少女

于 2022-06-28 20:49:09 发布

阅读量360

点赞数

分类专栏：机器学习文章标签： python 机器学习开发语言

本文链接：https://blog.csdn.net/xiaomi5410/article/details/125508846

版权

机器学习专栏收录该内容

27 篇文章 1 订阅

订阅专栏

文本特征

import pandas as pd
import numpy as np
import re
import nltk #pip install nltk

基本预处理

corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

在这里插入图片描述
要做的是基于文章分类，是属于动物主题还是天气主题。
执行这句话之后在弹出窗口最右册一列下滑安装需要的，stopwords。

nltk.download()

去掉一些不凸显主题的词。

#加载停用词
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # 去掉特殊字符
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
    # 转换成小写
    doc = doc.lower()
    doc = doc.strip()
    # 分词
    tokens = wpt.tokenize(doc)
    # 去停用词
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # 重新组合成文章
    doc = ' '.join(filtered_tokens)
    return doc

norm_corpus = normalize_corpus(corpus)
norm_corpus

处理结果，可以去与之前比较。
在这里插入图片描述

词袋模型

词袋模型统计了词频。

from sklearn.feature_extraction.text import CountVectorizer
print (norm_corpus)
#实例化
cv = CountVectorizer(min_df=0., max_df=1.)

cv.fit(norm_corpus)
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
#形成array格式
cv_matrix = cv_matrix.toarray()
cv_matrix

这几个句子的单词构成的语料库。
在这里插入图片描述
编码之后的结果。出现一次是1，出现两次是2，不出现0。

好看一下。

N-Grams模型

弥补词袋模型没有上下文信息的缺点。
考虑了词之间的组合，ngram_range=(2,2)表示两个词组合。
一般就是用两个词，因为会把矩阵变大，而且稀疏。

bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)

在这里插入图片描述

TF-IDF 模型

TP：词频，IDP：逆文档频率
如果一个词在语料库中的词频不大，但是在当前样本中出现多次，就说明他的IDF值很大，比较重要，有区分度。

from sklearn.feature_extraction.text import TfidfVectorizer 
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

在这里插入图片描述

Similarity特征

文章之间的相似性也可以当作特征。

from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

在这里插入图片描述

聚类特征

不太好用

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)

主题模型

不太常用

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
features

词嵌入模型 word2vec

解决了之前提到的，忽略词语上下文之间的联系。不知上下文，比如苹果香蕉，键盘鼠标，做出来这几个词在空间中应该是相近的。

from gensim.models import word2vec

wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]

# 需要设置一些参数
feature_size = 10    # 词向量维度
window_context = 10  # 滑动窗口                                                                        
min_word_count = 1   # 最小词频             

w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, 
                          window=window_context, min_count = min_word_count)

在这里插入图片描述
针对一句话的。将句子每个词对应位置的那一维向量值加起来➗总词数得到句子这一维的值。

def average_word_vectors(words, model, vocabulary, num_features):
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector
    
   
def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

构造词向量。

w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
                                             num_features=feature_size)
pd.DataFrame(w2v_feature_array) #lstm