使用tensorflow对原始文本数据进行预处理

最新推荐文章于 2024-07-23 13:31:07 发布

大雄没有叮当猫

最新推荐文章于 2024-07-23 13:31:07 发布

阅读量3.5k

点赞数 1

分类专栏： tensorflow 机器学习深度学习

深度学习同时被 3 个专栏收录

54 篇文章 2 订阅

订阅专栏

机器学习

49 篇文章 2 订阅

订阅专栏

tensorflow

34 篇文章 0 订阅

订阅专栏

微信公众号:数据挖掘与分析学习

现在关于tensorflow的教程还是太少了，有也都是歪果仁写的。比如以下几个：
TensorFlow-Examples
tensorflow_tutorials
TensorFlow-Tutorials
Tensorflow-101
个人感觉这些教程对于新手来说讲解的并不细致，几乎都是作者写好了代码放到ipython notebook上，大家下载到本地run一run，很开心地得到结果，实际并不明白为什么要这么搭建，每一步得到什么样的结果。或者自己很想弄懂这些牛人的代码，但是官方的api文档对于入门来说还不够友好，看了文档也不太清楚，这时候十分渴望有人来指导一把。
因此我就萌生了写一个”手把手&零门槛的tensorflow中文教程”的想法。希望更多的人能了解deep learning和tensorflow，大家多多提意见，多多交流！
今天来解读的代码还是基于CNN来实现文本分类，这个问题很重要的一步是原始数据的读取和预处理，详细代码参看
(1) load data and labels
实验用到的数据是烂番茄上的moview reviews，先看看提供的数据长什么样
sorry, 图片缺失
可以看到，每一行是一条review，数据进行过初步的处理，但是类似于”doesn’t/it’s”这种并没有进行分割。后面会讲到这个问题。

import tensorflow as tf

import numpy as np

import re

def clean_str(string):

"""

Tokenization/string cleaning for all datasets except for SST.

Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py

"""

string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)

string = re.sub(r"\'s", " \'s", string)

string = re.sub(r"\'ve", " \'ve", string)

string = re.sub(r"n\'t", " n\'t", string)

string = re.sub(r"\'re", " \'re", string)

string = re.sub(r"\'d", " \'d", string)

string = re.sub(r"\'ll", " \'ll", string)

string = re.sub(r",", " , ", string)

string = re.sub(r"!", " ! ", string)

string = re.sub(r"\(", " \( ", string)

string = re.sub(r"\)", " \) ", string)

string = re.sub(r"\?", " \? ", string)

string = re.sub(r"\s{2,}", " ", string)

return string.strip().lower()

def load_data_and_labels():

"""

加载电影评论文本数据，并对数据进行分割

"""

#从本地加载数据

positive_examples=list(open("/data/machine_learning/分类数据/rt-polaritydata/rt-polarity.pos",encoding='utf8').readlines()) #积极的评论

positive_examples=[s.strip() for s in positive_examples]

negative_examples=list(open("/data/machine_learning/分类数据/rt-polaritydata/rt-polarity.neg",encoding='utf8').readlines()) #消极的评论

negative_examples=[s.strip() for s in negative_examples]

x_text=positive_examples+negative_examples

x_text=[clean_str(sent) for sent in x_text]

x_text=[s.split(" ") for s in x_text]

#生成标签

positive_labels=[[0,1] for _ in positive_examples]

negative_labels=[[1,0] for _ in negative_examples]

y=np.concatenate([positive_labels,negative_labels],0)

return [x_text,y]

这个函数的作用是从文件中加载positive和negative数据，将它们组合在一起，并对每个句子都进行分词，因此x_text是一个二维列表，存储了每个review的每个word；它们对应的labels也组合在一起，由于labels实际对应的是二分类输出层的两个神经元，因此用one-hot编码成0/1和1/0，然后返回y。
其中，f.readlines()的返回值就是一个list，每个元素都是一行文本（str类型，结尾带有”\n”），因此其实不需要在外层再转换成list()
用s.strip()函数去掉每个sentence结尾的换行符和空白符。
去除了换行符之后，由于刚才提到的问题，每个sentence还需要做一些操作（具体在clean_str()函数中），将标点符号和缩写等都分割开来。英文str最简洁的分词方式就是按空格split，因此我们只需要将各个需要分割的部位都加上空格，然后对整个str调用split(“ “)函数即可完成分词。
labels的生成也类似。

(2) padding sentence

def pad_sentence(sentences,padding_word="<PAD/>"):

"""

根据样本中最长的句子长度对其它文本样本进行补齐

"""

sequence_length=max(len(x) for x in sentences)

padded_sentences=[]

for i in range(len(sentences)):

sentence=sentences[i]

num_padding=sequence_length-len(sentence)

new_sequence=sentence+[padding_word]*num_padding

padded_sentences.append(new_sequence)

return padded_sentences

为什么要对sentence进行padding？

因为TextCNN模型中的input_x对应的是tf.placeholder，是一个tensor，shape已经固定好了，比如[batch, sequence_len]，就不可能对tensor的每一行都有不同的长度，因此需要找到整个dataset中最长的sentence的长度，然后在不足长度的句子的末尾加上padding words，以保证input sentence的长度一致。

由于在load_data函数中，得到的是一个二维列表来存储每个sentence数据，因此padding_sentences之后，仍以这样的形式返回。只不过每个句子列表的末尾可能添加了padding word。

(3) build vocabulary

def build_vocab(sentences):

"""

Builds a vocabulary mapping from word to index based on the sentences.

Returns vocabulary mapping and inverse vocabulary mapping.

"""

# Build vocabulary

word_counts = Counter(itertools.chain(*sentences))

# Mapping from index to word

vocabulary_inv = [x[0] for x in word_counts.most_common()]

vocabulary_inv = list(sorted(vocabulary_inv))

# Mapping from word to index

vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}

return [vocabulary, vocabulary_inv]

Counter接受的参数是iterable，但是现在有多个句子列表，如何将多个sentence word list中的所有word由一个高效的迭代器生成呢？

这就用到了itertools.chain(*iterables)，具体用法参考这里

将多个迭代器作为参数, 但只返回单个迭代器, 它产生所有参数迭代器的内容, 就好像他们是来自于一个单一的序列.

由此可以得到整个数据集上的词频统计，word_counts。

但是要建立字典vocabulary，就需要从word_counts中提取出每个pair的第一个元素也就是word（相当于Counter在这里做了一个去重的工作），不需要根据词频建立vocabulary，而是根据word的字典序，所以对vocabulary进行一个sorted，就得到了字典顺序的word list。首字母小的排在前面。

再建立一个dict，存储每个word对应的index，也就是vocabulary变量。

(4) build input data

def build_input_data(sentences,labels,vocabulary):

x=np.array([[vocabulary[word] for word in sentence] for sentence in sentences])

y=np.array(labels)

return [x,y]

由上面两个函数我们得到了所有sentences分词后的二维列表，sentences对应的labels，还有查询每个word对应index的vocabulary字典。
但是！！想一想，当前的sentences中存储的是一个个word字符串，数据量大时很占内存，因此，最好存储word对应的index，index是int，占用空间就小了。
因此就利用到刚生成的vocabulary，对sentences的二维列表中每个word进行查询，生成一个word index构成的二维列表。最后将这个二维列表转化成numpy中的二维array。
对应的lables因为已经是0,1的二维列表了，直接可以转成array。
转成array后，就能直接作为cnn的input和labels使用了。