pytorch文本分类_pytorch中的cnns进行文本分类

pytorch文本分类

“Deep Learning is more than adding layers”

“深度学习不仅仅是增加层次”

The objective of this blog is to develop a step by step text classifier by implementing convolutional neural networks. So, this blog is divided into the following sections:

该博客的目的是通过实现卷积神经网络来逐步开发文本分类器。 因此,此博客分为以下部分:

  • Introduction

    介绍

  • Preprocessing

    前处理

  • The model

    该模型

  • Training

    训练

  • Evaluation

    评价

So, let’s get started!

所以,让我们开始吧!

介绍 (Introduction)

The text classification problem can be addressed from different approaches, for example, considering the frequency of occurrence of words in a given text with respect to the occurrence of these words in the complete corpus.

可以通过不同的方法来解决文本分类问题,例如,考虑到给定文本中单词出现频率相对于完整语料库中这些单词的出现频率

On the other hand, there exists other approaches where the text is modeled as a sequence of words or characters, this type of approach makes use mainly of models based on Recurrent Neural Network architectures.

另一方面,存在将文本建模为单词或字符序列的其他方法,这种方法主要利用基于递归神经网络体系结构的模型。

If you want to know more about text classification with LSTM recurrent neural networks, take a look at this blog: Text Classification with LSTMs in PyTorch

如果您想了解有关LSTM递归神经网络的文本分类的更多信息,请查看此博客: PyTorch中的LSTM文本分类。

However, there is another approach where the text is modeled as a distribution of words in a given space. This is achieved through the use of Convolutional Neural Networks (CNNs).

但是,还有另一种方法将文本建模为给定空间中单词分布。 这是通过使用卷积神经网络(CNN)实现的。

So, we are going to start from the last mentioned approach, we are going to build a model to classify text considering the distribution in space of a set of words that make up the text using an architecture based on CNNs.

因此,我们将从最后提到的方法开始,我们将使用基于CNN的架构,考虑组成文本的一组单词在空间中的分布,构建一个模型来对文本进行分类。

Let’s start!

开始吧!

前处理 (Preprocessing)

The data used in this model was obtained from the Kaggle contest: Real or Not? NLP with Disaster Tweets

该模型中使用的数据是从Kaggle竞赛中获得的:真实还是非真实? NLP与灾难鸣叫

The first lines of the dataset look like Figure 1:

数据集的第一行如图1所示:

Image for post
Figure 1. Head of dataset | Image by author
图1.数据集的头部图片作者

As we can see, it is necessary to create a preprocessing pipeline to load the text, clean it, tokenize it, padding it and split into train and test sets.

如我们所见,有必要创建一个预处理管道来加载文本清理文本标记化文本填充文本并将其拆分为训练和测试集。

Load text. Since the text we are going to work with is already in our repository, we only need to call it locally and remove some columns that will not be useful.

载入文字。 由于我们要使用的文本已经在我们的存储库中,因此我们只需要在本地调用它并删除一些无用的列即可。

def load_data(self):
  # Reads the raw csv file and split into
  # sentences (x) and target (y)
  df = pd.read_csv(self.data)
  df.drop(['id','keyword','location'], axis=1, inplace=True)
  
  self.x_raw = df['text'].values
  self.y = df['target'].values

Clean text. In this case, we will need to remove special symbols and numbers from the text. We are only going to work with lowercase words.

干净的文字。 在这种情况下,我们将需要从文本中删除特殊符号和数字。 我们将只使用小写单词。

def clean_text(self):
  # Removes special symbols and just keep
  # words in lower or upper form
  
  self.x_raw = [x.lower() for x in self.x_raw]
  self.x_raw = [re.sub(r'[^A-Za-z]+', ' ', x) for x in self.x_raw]

Word tokenization. For tokenization, we are going to make use of the word_tokenize function from the nltk library (a very simple way to tokenize a sentence). After this, we will need to generate a dictionary with the “x” most frequent words in the dataset (this is in order to reduce the complexity of the problem). Therefore, as you can see in line 3 of code 3, tokenization is applied. In line 14 the most common “x” words are selected and in line 16 the dictionary of words is built (as you can see, the dictionary begins with the index 1, this is because we are reserving the index 0 to apply the padding).

单词标记化。 对于符号化,我们要利用word_tokenize功能从NLTK库(一个非常简单的方式来标记句子)。 此后,我们将需要用数据集中最频繁出现的单词“ x生成字典(这是为了降低问题的复杂性)。 因此,正如您在代码3的第3行中看到的那样,将应用标记化。 在第14行中,选择了最常见的“ x ”单词,在第16行中,构建了单词词典(如您所见,该词典以索引1开头,这是因为我们保留了索引0以应用填充) 。

  • 1
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值