自然语言处理学习笔记之中文文本分类

最新推荐文章于 2024-05-15 11:01:10 发布

余音丶未散

最新推荐文章于 2024-05-15 11:01:10 发布

阅读量8.9k

点赞数 11

分类专栏：机器学习自然语言处理文章标签：自然语言处理文本向量化文本预处理文本分类

本文链接：https://blog.csdn.net/q383700092/article/details/79159776

版权

机器学习同时被 2 个专栏收录

34 篇文章 1 订阅

订阅专栏

自然语言处理

1 篇文章 0 订阅

订阅专栏

1. 中文处理的编码问题

中文的编码unicode
Python 会自动的先将解码,然后再编码
Python2.7默认编码是 ANSCII Python3 默认编码是 Unicode
Python2.7解决中文乱码：
1). 文件开头

#encoding:utf-8

2). python2.7使用sys.defaultencoding参考
默认sys.defaultencoding指明的方式来解码,当解码目标不是ANSCII就会报错
在编码转换时首先要将该数据以自身编码的格式换成unicode码,再将这个unicode按utf8编码

import sys #这里只是一个对sys的引用，只能reload才能进行重新加载
stdi,stdo,stde=sys.stdin,sys.stdout,sys.stderr
reload(sys) #通过import引用进来时,setdefaultencoding函数在被系统调用后被删除了，所以必须reload一次
sys.stdin,sys.stdout,sys.stderr=stdi,stdo,stde
sys.setdefaultencoding('utf-8')

2. 去掉标点符号(非文本部分)

一般情况下，中文需要去掉无用的标点符号，包括英文及中文标点。

import re
##去除标点符号
def remove_punctuation(line):
    #中文标点 ！？｡＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.
    #英文标点 !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
    try:
      line = re.sub("[！？。｡＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+".decode("utf-8"), "",line.decode("utf-8"))
    except Exception as e:
      print "error"
    return line

3. 中文分词

此处只举例一种常见分词工具结巴分词

import jieba
##结巴分词
def cutline(line):
    line=str(line) #防止只有数字的识别为float
    words = jieba.cut(line, cut_all=False)
    re=" ".join(words)
    return re

4. 去掉停用词

某些场景不需要去掉停用词，分析使用。
常用的中文停用词表下载

#获取停用词表
def get_stopwords(path):
    f= open(path)
    stopwords=[]
    for line in f:
        stopwords.append(line.strip())
    return stopwords
stopwords=get_stopwords("./stopwords.txt")
final=[]
for seg in seg_list:
    seg=seg.encode("utf8")
    if seg not in stopwords:
           final.append(seg)

5. 文本向量化

常见向量化方法
###5.1. 词袋模型
sklearn文本特征处理文档
词袋模型
词集模型(Set of Words,简称SoW) one-hot

##词袋向量化之sklearn
import pandas as pd
import re
import jieba
import cPickle as pickle
import numpy as np
#获取文本 一个df格式的文本
path='./data/nlpmaildata.pkl'
f2 = file(path, 'rb')
d = pickle.load(f2)
f2.close()
#获取停用词表
def get_stopwords(path):
    f= open(path)
    stopwords=[]
    for line in f:
        stopwords.append(line.strip().decode("utf-8"))
    return stopwords
stopwords=get_stopwords("./data/stopwords.txt")

#词袋信息向量化
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(stop_words=stopwords)
#输入是带空格的分词后list
# d_x=vectorizer.fit_transform(d["title"]).toarray()  #训练并转换
vectorizer.fit(d["title"])
d_x2=vectorizer.transform(d["title"]).toarray()
#返回满足条件的索引所在位置
# print np.where(d_x[0]>0)

#对应字典获取
vocab_dir=vectorizer.get_feature_names()
d_y=list(d["label2"])

使用tfidf对词袋模型进行标准化
使用词的tfidf值进行词袋对应的值替换

from sklearn.feature_extraction.text import TfidfVectorizer
vector = TfidfVectorizer(stop_words=stopwords)
vector.fit(d["title"])
weightlist=vector.transform(d["title"]).toarray()
wordlist = vector.get_feature_names()#获取词袋模型中的所有词

###5.2. word2vec
原理参考推荐https://www.cnblogs.com/f-young/p/7906451.html
Word2Vec使用的是Distributed representation，包括Continous Bag of Words Model(CBOW)和Skip-Gram Model

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
gensim的Word2vector使用
pip install gensim
输入数据要求是：分词后数据，以空格为单词的分隔符
"""
from gensim.models import Word2Vec
import pandas as pd
import cPickle as pickle
path='./data/nlpmaildata2.pkl'
f2 = file(path, 'rb')
d = pickle.load(f2)
f2.close()


modelpath="./data/w2c_model"
sentences=list(d["title"])
sentences= [s.decode("utf-8").encode('utf-8').split() for s in sentences]

model = Word2Vec(sentences, sg=1, size=64,  window=5,  min_count=1,  negative=3, sample=0.001, hs=1, workers=4)
# 1.sg=1是skip-gram算法，对低频词敏感；默认sg=0为CBOW算法。
# 2.size是输出词向量的维数，值太小会导致词映射因为冲突而影响结果，值太大则会耗内存并使算法计算变慢，一般值取为100到200之间。
# 3.window是句子中当前词与目标词之间的最大距离，3表示在目标词前看3-b个词，后面看b个词（b在0-3之间随机）。
# 4.min_count是对词进行过滤，频率小于min-count的单词则会被忽视，默认值为5。
# 5.negative和sample可根据训练结果进行微调，sample表示更高频率的词被随机下采样到所设置的阈值，默认值为1e-3。
#作者在论文中说到，当样本量比较小的时候，选择5-20个negative words效果会比较好，当样本量比较大的时候，2-5个negative words就能得到很好的效果
# 6.hs=1表示层级softmax将会被使用，默认hs=0且negative不为0，则负采样将会被选择使用。
# 7.workers控制训练的并行，此参数只有在安装了Cpython后才有效，否则只能使用单核。
# model["英文"]
model.save(modelpath)
# model = Word2Vec.load(fname)

#模型使用（词语相似度计算等）
# model.most_similar(positive=['woman', 'king'], negative=['man'])
# #输出[('queen', 0.50882536), ...]

# model.doesnt_match("breakfast cereal dinner lunch".split())
# #输出'cereal'

# model.similarity('woman', 'man')
# #输出0.73723527

# model['computer']  # raw numpy vector of a word
#输出array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

###5.3. ngram2vec
###5.4. Hash Trick
用哈希技巧矢量化大文本语料库

原理举例 hash(词1)=位置5 hash(词2)=位置5 位置5的值=1+1or新的哈希函数
参考1
参考2：用于大规模多任务学习的特征散列.

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer2=HashingVectorizer(n_features = 100,norm = None,stop_words=stopwords)
vectorizer2.fit(d["title"])
hashlist=vectorizer2.transform(d["title"]).toarray()

###5.5. 其他
随机初始化词向量

6. 文本分类常用模型

介绍一些常见的文本处理的传统机器学习模型和流行的深度学习模型

6.1 朴素贝叶斯

##朴素贝叶斯
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y)

pre_reduce = clf.predict(test)
#评价标准
from sklearn import metrics
print "Accuracy : %.2f" % metrics.accuracy_score(label, pre_reduce)
print "recall : %.2f" % metrics.recall_score(label, pre_reduce)
print "F1 : %.2f" % metrics.f1_score(label, pre_reduce)

6.2 fastText

文本分类之fastText
方法一：自己编写
方法二：Facebook开源工具
https://github.com/facebookresearch/fastText#text-classification
paper:https://arxiv.org/pdf/1607.01759.pdf
fastText的核心思想就是：将整篇文档的词及n-gram向量叠加平均得到文档向量，然后使用文档向量做softmax多分类
字符级n-gram特征的引入以及分层Softmax分类

#方法二 fastText对词向量生成考虑到上下文  基于Hierarchical(分层) Softmax
# 输入格式  词(空格分开)_lable_标签  eg：英媒 称 威 __label__affairs
import pandas as pd
import re
import jieba
import cPickle as pickle
import numpy as np

##读取文件
path='./data/nlpmaildatasample2.csv'
d = pd.read_csv(path,header=None)
d.columns=['title','lable']

dtrain=d[0:d.shape[0]/5*3]
dtest=d[d.shape[0]/5*3:d.shape[0]]

#生成训练文件
def w2file(data,filename):
    f = open(filename,"w")
    for i in range(data.shape[0]):
        outline = d['title'][i] + "\t__label__" + str(d['lable'][i]) + "\n"
        f.write(outline)
    f.close()

w2file(dtrain,"./data/fasttext_train.txt")
w2file(dtest,"./data/fasttext_test.txt")

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import fastText
#训练模型
classifier = fastText.FastText.train_supervised("./data/fasttext_train.txt",lr=0.1, dim=100,wordNgrams=1,label=u"__label__")
#参数
# train_supervised(input, lr=0.1, dim=100, ws=5, epoch=5, minCount=1, minCountLabel=0, minn=0, maxn=0, neg=5, wordNgrams=1, loss=u'softmax', bucket=2000000, thread=12, lrUpdateRate=100, t=0.0001, label=u'__label__', verbose=2, pretrainedVectors=u'')
# input_file     training file path (required)
# output         output file path (required)
# lr             learning rate [0.05]
# lr_update_rate change the rate of updates for the learning rate [100]
# dim            size of word vectors [100]
# ws             size of the context window [5]
# epoch          number of epochs [5]
# min_count      minimal number of word occurences [5]
# neg            number of negatives sampled [5]
# word_ngrams    max length of word ngram [1]
# loss           loss function {ns, hs, softmax} [ns]
# bucket         number of buckets [2000000]
# minn           min length of char ngram [3]
# maxn           max length of char ngram [6]
# thread         number of threads [12]
# t              sampling threshold [0.0001]
# silent         disable the log output from the C++ extension [1]
# encoding       specify input_file encoding [utf-8]
((u'__label__0',), array([ 0.77616984]))
#测试模型 help(classifier)
result = classifier.test("./data/fasttext_test.txt")
print result
texts=[str(t).decode("utf-8") for t in dtest["title"]] #预测与输入编码必须一致
##predict输出格式((u'__label__0',), array([ 0.77616984]))
y_pred = [int(e[0].replace("__label__","")) for e in classifier.predict(texts)[0]] #预测输出结果为元组
y_test=list(dtest["lable"])
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
print("Accuracy: %0.2f" % accuracy_score(y_test, y_pred))
print("F1: %0.2f" % f1_score(y_test, y_pred))

6.3 TextCNN

文本分类之textCNN 疑问MaxPooling1D？使用计算？区别和GlobalMaxPooling1D MaxPooling1D可以使用pooling步长
最后输出时注意激活函数与loss相对应 sigmoid与binary_crossentropy(对数损失，logloss) 适用于二分类 softmax与categorical_crossentropy适用于多分类
论文：Convolutional Neural Networks for Sentence Classification
论文解读：http://www.jeyzhang.com/cnn-apply-on-modelling-sentence.html
这里写图片描述
采用了6个卷积核，每个卷积尺度卷积核为2，2x3个尺度=6个卷积核
每个卷积核参与卷积计算–例如4卷积核 7-4+1=4个特征图4x1矩阵-maxpooling为1个数，6个卷积核拼接起来输出6维向量-经过softmax输出结果
输入层：词个数x词向量维数—矩阵的类型可以是静态的(static)word vector是固定不变，动态的(non static)word vector也当做是可优化的参数这一过程称为 Fine tune
卷积层：若干个Feature Map–不同大小滤波器卷积核大小为nxk k是词向量维度 1D默认宽度为词向量维度
池化层：Max-over-time Pooling–输出为各个Feature Map的最大值们，即一个一维的向量
全连接 + Softmax层：池化层的一维向量的输出通过全连接的方式，连接一个Softmax层
Dropout：倒数第二层的全连接部分，L2正则化，减轻过拟合
词向量变种：
CNN-rand：对不同单词的向量作随机初始化，BP的时候作调整 Embedding层选择随机初始化方法–即反向传播的时候embedding层权重也参与反向传播(初始化值可以看做网络层权重)
static：拿word2vec, FastText or GloVe训练好的词向量
non-static：拿word2vec, FastText or GloVe训练好的词向量，训练过程中再对它们微调Fine tuned(自己理解：1.先用其他大文本语料训练w2v再用本文本训练w2v 2.反向传播的时候embedding层w2v表示的权重也参与反向传播)
multiple channel ：类比于图像中的RGB通道, 这里也可以用 static 与 non-static 搭两个通道来搞
结论：
CNN-static较与CNN-rand好，说明pre-training的word vector确实有较大的提升作用（这也难怪，因为pre-training的word vector显然利用了更大规模的文本数据信息）；
CNN-non-static较于CNN-static大部分要好，说明适当的Fine tune也是有利的，是因为使得vectors更加贴近于具体的任务；
CNN-multichannel较于CNN-single在小规模的数据集上有更好的表现，实际上CNN-multichannel体现了一种折中思想，即既不希望Fine tuned的vector距离原始值太远，但同时保留其一定的变化空间
github：https://github.com/yoonkim/CNN_sentence
code参考
http://blog.csdn.net/diye2008/article/details/53105652?locationNum=11&fps=1
glove embedding参考http://blog.csdn.net/sscssz/article/details/53333225

from __future__ import print_function

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, Flatten,GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding,Dropout
from keras.models import Model
from keras.optimizers import *
from keras.models import Sequential
from keras.layers import merge
import pandas as pd
import cPickle as pickle
import numpy as np
import gensim

##数据获取
print('Loading data...')
path='./data/nlpmaildatasample2.csv'
d = pd.read_csv(path,header=None)
d.columns=['title','lable']

all_data=set()
for line in d["title"]:
   ws=line.split(" ")
   for w in ws:
     if w == ' ' or w == '' or w=="\t":
        continue
     all_data.add(w)
words=list(all_data)
word_to_id = dict(zip(words, range(len(words))))
dx=[]
for line in d["title"]:
    ws=line.split(" ")
    dx.append([word_to_id[w] for w in ws if w in word_to_id])
# dy=list(d['lable'])
dy=d['lable']


print('Average  sequence length: {}'.format(np.mean(list(map(len, dx)), dtype=int)))

# set parameters:
maxlen=np.max(list(map(len, dx))) #maxlen = 400  最长文本词数
max_features = 20000  #字典允许最大大小
batch_size = 32
embedding_dims = 64  #词向量长度
epochs = 2
w2vpath="./data/w2c_model"

x_train, y_train, x_test, y_test = dx[0:len(dx)/5*3],dy[0:len(dx)/5*3],dx[len(dx)/5*3:len(dx)],dy[len(dx)/5*3:len(dx)]
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


print('Indexing word vectors.')
embeddings_index = {}
model = gensim.models.Word2Vec.load(w2vpath)
for word in words:
    embeddings_index[word]=model[word]
print('Found %s word vectors.' % len(embeddings_index))

print('Preparing embedding matrix.')
nb_words = min(max_features, len(word_to_id))
embedding_matrix = np.zeros((nb_words + 1, embedding_dims))
for word, i in word_to_id.items():
    if i > max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector # word_index to word_embedding_vector ,<20000(nb_words)


# 神经网路的第一层，词向量层，本文使用了预训练word2vec词向量，可以把trainable那里设为False
embedding_layer = Embedding(nb_words+1,
                            embedding_dims,
                            input_length=maxlen,
                            weights=[embedding_matrix],
                            trainable=False)
print('Build model...')
##最简单cnn
# model = Sequential()
# model.add(Embedding(nb_words + 1,
#                     embedding_dims,
#                     input_length=maxlen))
# model.add(Dropout(0.2))
# model.add(Conv1D(250,#filters
#                  3,#kernel_size
#                  padding='valid',
#                  activation='relu',
#                  strides=1))
# model.add(GlobalMaxPooling1D())
# model.add(Dense(250))#hidden layer:
# model.add(Dropout(0.2))
# model.add(Activation('relu'))
# model.add(Dense(1))
# model.add(Activation('sigmoid'))
# model.compile(loss='binary_crossentropy',
#               optimizer='adam',
#               metrics=['accuracy'])
# model.fit(x_train, y_train,
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(x_test, y_test))

###3层合并model 经过词向量表达的文本为一维数据，因此在TextCNN卷积用的是一维卷积
#left model
model_left = Sequential()
#https://keras.io/layers/embeddings/
# model.add(Embedding(max_features,embedding_dims,input_length=maxlen))
model_left.add(embedding_layer)
model_left.add(Conv1D(128, 5, activation='relu')) #128输出的维度 5卷积核大小
model_left.add(MaxPooling1D())#5
model_left.add(Conv1D(128, 5, activation='relu'))
model_left.add(MaxPooling1D())#5
model_left.add(Conv1D(128, 5, activation='relu'))
model_left.add(MaxPooling1D()) #35 #model_left.add(GlobalMaxPooling1D())
model_left.add(Flatten())

model_right = Sequential()
model_right.add(embedding_layer)
model_right.add(Conv1D(128, 4, activation='relu'))
model_right.add(MaxPooling1D())#4
model_right.add(Conv1D(128, 4, activation='relu'))
model_right.add(MaxPooling1D())#4
model_right.add(Conv1D(128, 4, activation='relu'))
model_right.add(MaxPooling1D())#28
model_right.add(Flatten())

model_3 = Sequential()
model_3.add(embedding_layer)
model_3.add(Conv1D(128, 6, activation='relu'))
model_3.add(MaxPooling1D())#3
model_3.add(Conv1D(128, 6, activation='relu'))
model_3.add(MaxPooling1D())#3
model_3.add(Conv1D(128, 6, activation='relu'))
model_3.add(MaxPooling1D())#30
model_3.add(Flatten())

merged = Merge([model_left, model_right,model_3], mode='concat') # 将三种不同卷积窗口的卷积层组合 连接在一起，当然也可以只是用三个model中的一个，一样可以得到不错的效果，只是本文采用论文中的结构设计
model = Sequential()
model.add(merged) # add merge
model.add(Dense(128, activation='relu')) # 全连接层
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid')) # softmax对应多分类，输出文本属于类别中每个类别的概率 使用softmax需要修改loss

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

score = model.evaluate(x_train, y_train, verbose=0) # 评估模型在训练集中的效果，准确率约99%
print('train score:', score[0])
print('train accuracy:', score[1])
score = model.evaluate(x_test, y_test, verbose=0)  # 评估模型在测试集中的效果，准确率约为97%，迭代次数多了，会进一步提升
print('Test score:', score[0])
print('Test accuracy:', score[1])

6.4 TextRNN

t 时刻输出不仅取决于之前时刻的序列输入，还取决于将来时刻序列输入
embedding—>bi-directional lstm—>concat output—>average----->softmax
lstm中的Xt-1,Xt代表的是一个样本中的每一个词所有循环只在一个样本中循环
TimeDistributed包装器=把一个层应用到输入的每一个时间步上-http://keras-cn.readthedocs.io/en/latest/layers/wrapper/
思考：
分类的时候不只使用最后一个隐藏元的输出，而是把所有隐藏元的输出做K-MaxPooling再分类
在双向GRU前添加单层卷积层提取一次ngram特征-C-GRU

from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.datasets import imdb

import pandas as pd
import cPickle as pickle
import numpy as np
import gensim

##数据获取
print('Loading data...')
path='./data/nlpmaildatasample2.csv'
d = pd.read_csv(path,header=None)
d.columns=['title','lable']

all_data=set()
for line in d["title"]:
   ws=line.split(" ")
   for w in ws:
     if w == ' ' or w == '' or w=="\t":
        continue
     all_data.add(w)
words=list(all_data)
word_to_id = dict(zip(words, range(len(words))))
dx=[]
for line in d["title"]:
    ws=line.split(" ")
    dx.append([word_to_id[w] for w in ws if w in word_to_id])
# dy=list(d['lable'])
dy=d['lable']

# set parameters:
maxlen=np.max(list(map(len, dx))) #maxlen = 400  最长文本词数
max_features = len(word_to_id)+1
batch_size = 32
embedding_dims=128

x_train, y_train, x_test, y_test = dx[0:len(dx)/5*3],dy[0:len(dx)/5*3],dx[len(dx)/5*3:len(dx)],dy[len(dx)/5*3:len(dx)]
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, embedding_dims, input_length=maxlen))
model.add(Bidirectional(LSTM(64))) ### 输出维度64 GRU
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
# try using different optimizers and different optimizer configs
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
#lstm常选参数model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
# a stateful LSTM model
#lahead: the input sequence length that the LSTM
# https://github.com/keras-team/keras/blob/master/examples/lstm_stateful.py
# model = Sequential()
# model.add(LSTM(20,input_shape=(lahead, 1),
#               batch_size=batch_size,
#               stateful=stateful))
# model.add(Dense(1))
# model.compile(loss='mse', optimizer='adam')


print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=4,
          validation_data=[x_test, y_test])

6.5 TextRCNN

使用Word2vec定义词向量矩阵
recurrent structure (convolutional layer)：
词向量矩阵
left(无意义补0+去最后一个词) max_token对应词向量为0向量
right(去第一个词+无意义补0)
lstm(left)+词向量矩阵+lstm(right)===上一个词+当前词+下一个词
structure:1)recurrent structure (convolutional layer) 2)max pooling 3) fully connected layer+softmax
Recurrent convolutional neural networks for text classification
http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745
tensoflow版https://github.com/brightmart/text_classification/blob/master/a04_TextRCNN/p71_TextRCNN_model.py

import pandas as pd
import cPickle as pickle
import numpy as np
import gensim
from keras.preprocessing import sequence
from keras import backend
from keras.layers import Dense, Input, Lambda, LSTM, TimeDistributed
from keras.layers.merge import concatenate
from keras.layers.embeddings import Embedding
from keras.models import Model

##数据获取
print('Loading data...')
path='./data/nlpmaildatasample2.pkl'
f2 = file(path, 'rb')
d = pickle.load(f2)
f2.close()
# path='./data/nlpmaildatasample2.csv'
# d = pd.read_csv(path,header=None)
# d.columns=['title','lable']

all_data=set()
for line in d["title"]:
   ws=line.split(" ")
   for w in ws:
     if w == ' ' or w == '' or w=="\t":
        continue
     all_data.add(w)
words=list(all_data)
word_to_id = dict(zip(words, range(len(words))))
dx=[]
for line in d["title"]:
    ws=line.split(" ")
    dx.append([word_to_id[w] for w in ws if w in word_to_id])
# dy=list(d['lable'])
dy=d['lable']


print('Average  sequence length: {}'.format(np.mean(list(map(len, dx)), dtype=int)))

# set parameters:
maxlen=np.max(list(map(len, dx))) #maxlen = 400  最长文本词数
max_features = 20000  #字典允许最大大小
batch_size = 32
embedding_dims = 64  #词向量长度
epochs = 2
hidden_dim_1 = 200
hidden_dim_2 = 100
w2vpath="./data/w2c_model"

x_train, y_train, x_test, y_test = dx[0:len(dx)/5*3],dy[0:len(dx)/5*3],dx[len(dx)/5*3:len(dx)],dy[len(dx)/5*3:len(dx)]
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


print('Indexing word vectors.')
embeddings_index = {}
model = gensim.models.Word2Vec.load(w2vpath)
for word in words:
    embeddings_index[word]=model[word]
print('Found %s word vectors.' % len(embeddings_index))

print('Preparing embedding matrix.')
max_token = min(max_features, len(word_to_id))
embedding_matrix = np.zeros((max_token + 1, embedding_dims))
for word, i in word_to_id.items():
    if i > max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector # word_index to word_embedding_vector ,<20000(max_token)

print('Build model...')
document = Input(shape = (None, ), dtype = "int32")
left_context = Input(shape = (None, ), dtype = "int32")
right_context = Input(shape = (None, ), dtype = "int32")

embedder = Embedding(max_token + 1, embedding_dims, weights = [embedding_matrix], trainable = False)
doc_embedding = embedder(document)
l_embedding = embedder(left_context)
r_embedding = embedder(right_context)

# I use LSTM RNNs instead of vanilla RNNs as described in the paper.
forward = LSTM(hidden_dim_1, return_sequences = True)(l_embedding) # See equation (1).
backward = LSTM(hidden_dim_1, return_sequences = True, go_backwards = True)(r_embedding) # See equation (2).
together = concatenate([forward, doc_embedding, backward], axis = 2) # See equation (3).

semantic = TimeDistributed(Dense(hidden_dim_2, activation = "tanh"))(together) # See equation (4).

# Keras provides its own max-pooling layers, but they cannot handle variable length input
# (as far as I can tell). As a result, I define my own max-pooling layer here.
pool_rnn = Lambda(lambda x: backend.max(x, axis = 1), output_shape = (hidden_dim_2, ))(semantic) # See equation (5).

output = Dense(1, input_dim = hidden_dim_2, activation = "sigmoid")(pool_rnn) # See equations (6) and (7).NUM_CLASSES=1

model = Model(inputs = [document, left_context, right_context], outputs = output)
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

##生成左右上下文
print('Build left and right data')
doc_x_train = np.array(x_train)
# We shift the document to the right to obtain the left-side contexts.
left_x_train = np.array([[max_token]+t_one[:-1].tolist() for t_one in x_train])
# We shift the document to the left to obtain the right-side contexts.
right_x_train = np.array([t_one[1:].tolist()+[max_token] for t_one in x_train])

doc_x_test = np.array(x_test)
# We shift the document to the right to obtain the left-side contexts.
left_x_test = np.array([[max_token]+t_one[:-1].tolist() for t_one in x_test])
# We shift the document to the left to obtain the right-side contexts.
right_x_test = np.array([t_one[1:].tolist()+[max_token] for t_one in x_test])


# history = model.fit([doc_x_train, left_x_train, right_x_train], y_train, epochs = 1)
# loss = history.history["loss"][0]
model.fit([doc_x_train, left_x_train, right_x_train], y_train,
          batch_size=batch_size,
          epochs=4,
          validation_data=[[doc_x_test, left_x_test, right_x_test], y_test])

6.6 分层注意网络（Hierarchical Attention Network）

单双向lstm 之后加 + Attention HAN模型
paper:Hierarchical Attention Networks for Document Classification
加入Attention之后最大的好处自然是能够直观的解释各个句子和词对分类类别的重要性
Structure:
1.embedding
2.Word Encoder: 词级双向GRU，以获得丰富的词汇表征
3.Word Attention:词级注意在句子中获取重要信息
4.Sentence Encoder: 句子级双向GRU，以获得丰富的句子表征
5.Sentence Attetion: 句级注意以获得句子中的重点句子
6.FC+Softmax
HierarchicalAttention: 1.Word Encoder. 2.Word Attention. 3.Sentence Encoder 4.Sentence Attention 5.linear classifier. 2017-06-13
Attention层是一个MLP+softmax机制
code参考：https://github.com/richliao/textClassifier
https://github.com/philipperemy/keras-attention-mechanism
https://github.com/codekansas/keras-language-modeling/blob/master/keras_models.py
https://github.com/codekansas/keras-language-modeling
https://github.com/EdGENetworks/attention-networks-for-classification
https://github.com/brightmart/text_classification/tree/master/a05_HierarchicalAttentionNetwork
原理解说：https://www.zhihu.com/question/68482809/answer/268320399

7. 总结

本文的code全在
https://github.com/lytforgood/TextClassification

余音丶未散

关注

11
点赞
踩
68

收藏

觉得还不错? 一键收藏
0
评论
自然语言处理学习笔记之中文文本分类

1. 中文处理的编码问题中文的编码不是utf8，而是unicode Python 会自动的先将解码,然后再编码 Python2.7默认编码是 ANSCII Python3 默认编码是 Unicode Python2.7解决中文乱码： 1). 文件开头#encoding:utf-82). python2.7使用sys.defaultencoding参考默认sys.de...
复制链接

扫一扫

专栏目录