文本分类- Embedding/Wordvec+DNN

最新推荐文章于 2024-09-06 19:51:15 发布

Python风控模型与数据分析

最新推荐文章于 2024-09-06 19:51:15 发布

阅读量440

点赞数

分类专栏：深度学习自然语言处理文章标签： embedding dnn 人工智能

本文链接：https://blog.csdn.net/a7303349/article/details/132391801

版权

自然语言处理同时被 2 个专栏收录

6 篇文章 1 订阅

订阅专栏

深度学习

3 篇文章 0 订阅

订阅专栏

一、Embedding介绍

嵌入（Embedding）是一种将离散的符号（例如单词、类别、标签等）映射到连续向量空间中的技术。嵌入通常用于将高维、离散的数据表示为低维、连续的向量，以便于在机器学习模型中进行处理。对比one-hot高纬稀疏的问题，Embedding可以灵活控制维数、将大量信息保留在短向量中。

前文的word2vec就是一种经典的Embedding方式，常见的还有其他glove、fasttext等静态词向量，以及gpt、bert等动态词向量。Embedding的核心思想是通过训练一个模型，学习到符号之间的语义关系，使得具有相似语义的符号在嵌入向量空间中距离较近。这种连续的嵌入表示更能够捕捉符号之间的关联性和语义信息，从而在各种机器学习任务中取得更好的效果。

Embedding的训练过程可以通过以下步骤进行：

初始化嵌入空间：为每个符号（如单词）随机初始化一个初始嵌入向量。
训练数据准备：将符号映射为其对应的嵌入向量，并将这些向量作为模型的输入。
模型训练：在模型中，通过优化算法（如随机梯度下降）来调整嵌入向量，使得在训练数据上的损失函数最小化。这个损失函数通常度量了嵌入向量在模型任务中的表现，比如在分类、回归或者聚类等任务中的性能。
学习嵌入：通过多次迭代，模型会逐渐调整嵌入向量，使得它们能够在向量空间中捕获符号之间的语义关系。训练完成后，得到的嵌入向量可以在模型中用于各种任务，如分类、情感分析、文本生成等。

在自然语言处理中，Embedding是一种广泛使用的嵌入方法，它将单词映射到连续向量空间，使得相似语义的单词在向量空间中距离较近，并且可以便捷高质量地承接文本分类、相似度等下游任务。

二、文本分类实战

中文Word2Vec：https://github.com/Embedding/Chinese-Word-Vectors

Google英文Word2Vec：https://code.google.com/archive/p/word2vec/

1、数据读取及预处理

 # 1、导包
import re
import os
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve,roc_auc_score
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
import lightgbm as lgb
import matplotlib.pyplot as plt
import gc

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers

# 2、数据读取+预处理
data=pd.read_excel('Inshorts Cleaned Data.xlsx')
 
def data_preprocess(data):
    df=data.drop(['Publish Date','Time ','Headline'],axis=1).copy()
    df.rename(columns={'Source ':'Source'},inplace=True)
    df=df[df.Source.isin(['YouTube','India Today'])].reset_index(drop=True)
    df['y']=np.where(df.Source=='YouTube',1,0)
    df=df.drop(['Source'],axis=1)
    return df
 
df=data.pipe(data_preprocess)
print(df.shape)
df.head()

# 导入英文停用词
from nltk.corpus import stopwords  
from nltk.tokenize import sent_tokenize
stop_english=stopwords.words('english')  
stop_spanish=stopwords.words('spanish') 
stop_english

# 4、文本预处理：处理简写、小写化、去除停用词、词性还原
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords  
from nltk.tokenize import sent_tokenize
import nltk
 
def replace_abbreviation(text):
    
    rep_list=[
        ("it's", "it is"),
        ("i'm", "i am"),
        ("he's", "he is"),
        ("she's", "she is"),
        ("we're", "we are"),
        ("they're", "they are"),
        ("you're", "you are"),
        ("that's", "that is"),
        ("this's", "this is"),
        ("can't", "can not"),
        ("don't", "do not"),
        ("doesn't", "does not"),
        ("we've", "we have"),
        ("i've", " i have"),
        ("isn't", "is not"),
        ("won't", "will not"),
        ("hasn't", "has not"),
        ("wasn't", "was not"),
        ("weren't", "were not"),
        ("let's", "let us"),
        ("didn't", "did not"),
        ("hadn't", "had not"),
        ("waht's", "what is"),
        ("couldn't", "could not"),
        ("you'll", "you will"),
        ("i'll", "i will"),
        ("you've", "you have")
    ]
    result = text.lower()
    for word_replace in rep_list:
        result=result.replace(word_replace[0],word_replace[1])
#     result = result.replace("'s", "")
    
    return result
 
def drop_char(text):
    result=text.lower()
    result=re.sub('[^\w\s]',' ',result) # 去掉标点符号、特殊字符
    result=re.sub('\s+',' ',result) # 多空格处理为单空格
    return result
 
def stemed_words(text,stop_words,lemma):
    
    word_list = [lemma.lemmatize(word, pos='v') for word in text.split() if word not in stop_words]
    result=" ".join(word_list)
    return result
 
def text_preprocess(text_seq):
    stop_words = stopwords.words("english")
    lemma = WordNetLemmatizer()
    
    result=[]
    for text in text_seq:
        if pd.isnull(text):
            result.append(None)
            continue
        text=replace_abbreviation(text)
        text=drop_char(text)
        text=stemed_words(text,stop_words,lemma)
        result.append(text)
    return result
 
df['short']=text_preprocess(df.Short)
df[['Short','short']]

# 5、划分训练、测试集
test_index=list(df.sample(2000).index)
df['label']=np.where(df.index.isin(test_index),'test','train')
df['label'].value_counts()

2、文本序列编码

按照词频排序，创建长度为6000的高频词词典、来对文本进行序列化编码。

from tensorflow.keras.preprocessing.text import Tokenizer
def word_dict_fit(train_text_list,num_words):
    '''
        train_text_list: ['some thing today ','some thing today2']
    '''
    tok_params={
        'num_words':num_words,  # 词典的长度，仅保留词频top的num_words个词
        'filters':'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
        'lower':True, 
        'split':' ', 
        'char_level':False, 
        'oov_token':None, # 设定词典外的词编码
    }
    tok = Tokenizer(**tok_params) #　分词
    tok.fit_on_texts(train_text_list)
    
    return tok

def word_dict_apply_sequences(tok_model,text_list,len_vec):
    '''
        text_list: ['some thing today ','some thing today2']
    '''
    list_tok = tok_model.texts_to_sequences(text_list) # 编码映射
    
    pad_params={
        'sequences':list_tok,
        'maxlen':len_vec,  # 补全后向量长度
        'padding':'pre', # 'pre' or 'post',在前、在后补全
        'truncating':'pre', # 'pre' or 'post',在前、在后删除长度多余的部分
        'value':0, # 补全0
    }
    seq_tok = pad_sequences(**pad_params) # 补全编码向量，返回二维array
    return seq_tok

num_words,len_vec = 6000,40
tok_model= word_dict_fit(df[df.label=='train'].short,num_words)
tok_train = word_dict_apply_sequences(tok_model,df[df.label=='train'].short,len_vec)
tok_test = word_dict_apply_sequences(tok_model,df[df.label=='test'].short,len_vec)
tok_test

3、Embedding+DNN

神经网络结构中，Embedding层可以初始化一个Embedding矩阵、然后跟随模型一起训练，Embedding层的输入是前面的序列数据。这里使用Embedding层将序列化数据转为Embedding矩阵、输出为三维张量，然后再展开成二维数据、传给全连接层。

from tensorflow.keras import models,layers,optimizers

'''
    Embedding层参数
        input_dim：字典长度，即输入数据最大下标+1
        output_dim：大于0的整数，代表全连接嵌入的维度
        input_length: sequence长度
'''
def dnn_model(x_train,y_train,x_test,y_test,maxlen):
    model = models.Sequential([
        layers.Embedding(input_dim=6000,output_dim=128,input_length=maxlen),
        layers.Flatten(),
        layers.Dense(32,activation="relu"),
        layers.Dropout(rate=0.4),
        layers.Dense(32,activation="relu"),
        layers.Dropout(rate=0.4),
        layers.Dense(1,activation="sigmoid")
    ])
    
    model.compile( # 模型编译
    #     optimizer = "rmsprop",
        optimizer = optimizers.RMSprop(lr=0.001),
        loss = "binary_crossentropy",
        metrics = ["accuracy"],
    )

    history=model.fit(
        x_train,y_train,
        batch_size=2000,
        epochs=5,  # 迭代次数
        validation_data=(x_test,y_test),
    )
    return model,history

model,history=dnn_model(tok_train,df[df.label=='train'].y,tok_test,df[df.label=='test'].y,40)
model.summary()

def ks_auc_value(y_value,df,model):
    y_pred=model.predict(df)
    fpr,tpr,thresholds= roc_curve(list(y_value),list(y_pred))
    ks=max(tpr-fpr)
    auc= roc_auc_score(list(y_value),list(y_pred))
    return ks,auc

ks_auc_value(df[df.label=='train'].y,tok_train,model)
'''
output:
    (0.7835286320884429, 0.938647027371186)

'''

ks_auc_value(df[df.label=='test'].y,tok_test,model)
'''
output:
    (0.7034280775878625, 0.905442793355099)

'''

4、Wordvec预训练模型+DNN

word2vec可以自行训练、也可以使用已训练好的模型直接调用/微调，这里自行训练好word2vec模型、然后按照词典构建Embedding矩阵，设置为Embedding层的值，并且冻结Embedding层（即该层值在反向传播过程中保持不变）。

（1）word2vec模型训练

import gensim
def word2vec_train(sentences):
    # 训练词向量矩阵用于embedding权重时，max_vocab_size设置为None
    '''
        sentences为分词、去停用词、去符号回车等的结果、单层列表
        如：["only", "you", "can", "prevent", "forest", "fires"]
    '''
    params={
        'sg':1, # 1对应skip-gram,0对应CBOW
        'cbow_mean':1, # CBOW时使用，1计算向量均值、0计算向量和
        'min_count':1, # 最低词频阈值，低于min_count的词过滤掉
        'vector_size':128, # 词向量维度，取值几十到几百
        'window':5, # 滑动窗口,当前词与上下文词最远距离
        'workers':1, # 计算使用线程数
        'hs':1, # 设置1，将使用分层softmax进行模型训练；设置0且“negative”为非零，则将使用负采样
        'negative':3, # 负样本采样数量
        'seed':1,
        'max_vocab_size':None, # 词典最大词汇量
        'shrink_windows':False, # 设置True时，对每个目标词，从[1,windows]范围进行均匀采样、以确定每个词的实际窗口大小
        'ns_exponent':1,
        'sample':0.001, # 高频字随机降采样的阈值
        'epochs':5, # 语料上的迭代次数
        'alpha':0.025, # learning rate
        'corpus_file':None, # 指定sentences文档路径，与sentences仅传一个就好
    }
    model = gensim.models.word2vec.Word2Vec(sentences=sentences, **params)
    
    return model

word2vec_model=word2vec_train(list(df.short.str.split()))

（2）根据词典构建embedding矩阵

def get_embedding_matrix(word_index,word2vec_model,num_word,vec_len):
    embedding_matrix=np.zeros((num_word,vec_len))
    for word,index in list(tok_model.word_index.items())[0:num_word]:
        # 错开一位:word_index的词编码是从1开始的,pad_sequences后补0、所以0位置没有词向量，正式词从1开始
        try:
            embedding_matrix[index]=word2vec_model.wv[word]
        except:
            pass
    return embedding_matrix


embedding_matrix=get_embedding_matrix(tok_model.word_index,word2vec_model,6000,128)
embedding_matrix

（3）word2vec与训练模型+DNN

'''
    Embedding层参数
        input_dim：字典长度，即输入数据最大下标+1
        output_dim：大于0的整数，代表全连接嵌入的维度
        input_length: sequence长度
        weights：权重矩阵值设置
        trainable：Embedding层是否参与训练
'''
def dnn_model(x_train,y_train,x_test,y_test,maxlen):
    model = models.Sequential([
        layers.Embedding(input_dim=6000,output_dim=128,input_length=maxlen,weights=[embedding_matrix],trainable=False),
        layers.Flatten(),
        layers.Dense(32,activation="relu"),
        layers.Dropout(rate=0.4),
        layers.Dense(32,activation="relu"),
        layers.Dropout(rate=0.4),
        layers.Dense(1,activation="sigmoid")
    ])
    
    model.compile( # 模型编译
    #     optimizer = "rmsprop",
        optimizer = optimizers.RMSprop(lr=0.001),
        loss = "binary_crossentropy",
        metrics = ["accuracy"],
    )

    history=model.fit(
        x_train,y_train,
        batch_size=2000,
        epochs=5,  # 迭代次数
        validation_data=(x_test,y_test),
    )
    return model,history

model,history=dnn_model(tok_train,df[df.label=='train'].y,tok_test,df[df.label=='test'].y,40)

（4）模型效果评估

ks_auc_value(df[df.label=='train'].y,tok_train,model)
'''
output:
    (0.7041902248769555, 0.9172368082724525)

'''

ks_auc_value(df[df.label=='test'].y,tok_test,model)
'''
output:
    (0.690816849113373, 0.9032042202803917)

'''

5、Wordvec+DNN微调

在word2vec预训练模型基础上，按照下游任务进行微调-将trainable参数值设置为True

'''
    Embedding层参数
        input_dim：字典长度，即输入数据最大下标+1
        output_dim：大于0的整数，代表全连接嵌入的维度
        input_length: sequence长度
        weights：权重矩阵值设置
        trainable：Embedding层是否参与训练
'''
def dnn_model(x_train,y_train,x_test,y_test,maxlen):
    model = models.Sequential([
        layers.Embedding(input_dim=6000,output_dim=128,input_length=maxlen,weights=[embedding_matrix],trainable=True),
        layers.Flatten(),
        layers.Dense(32,activation="relu"),
        layers.Dropout(rate=0.4),
        layers.Dense(32,activation="relu"),
        layers.Dropout(rate=0.4),
        layers.Dense(1,activation="sigmoid")
    ])
    
    model.compile( # 模型编译
    #     optimizer = "rmsprop",
        optimizer = optimizers.RMSprop(lr=0.001),
        loss = "binary_crossentropy",
        metrics = ["accuracy"],
    )

    history=model.fit(
        x_train,y_train,
        batch_size=2000,
        epochs=5,  # 迭代次数
        validation_data=(x_test,y_test),
    )
    return model,history

model,history=dnn_model(tok_train,df[df.label=='train'].y,tok_test,df[df.label=='test'].y,40)

ks_auc_value(df[df.label=='train'].y,tok_train,model)
'''
output:
    (0.709054116574492, 0.9192694771708021)

'''

ks_auc_value(df[df.label=='test'].y,tok_test,model)
'''
output:
    (0.7110140195890148, 0.9032112220728508)

'''