nlp库_哪个是最好的NLP库?

nlp库

Ever since I started working on NLP(Natural Language Processing), I have been wondering which one is the best NLP library that can meet most of our common NLP requirements. Although it is true that there is no one-size fits all, and the choice of library would depend on the task at hand, I was still curious as to how different libraries would compare if they were to be bench-marked against a very simple task.

Ë版本,因为我开始在NLP(自然语言处理)的工作,我一直想知道哪一个是能够满足大多数的我们共同的NLP要求的最佳NLP库。 尽管确实没有一种适合所有人的选择,并且库的选择将取决于手头的任务,但我仍然对如果将不同的库标记为一个非常简单的库将如何比较它们感到好奇。任务。

With that in mind, I put on my developer hat and set out writing python code using various libraries, to evaluate them against a very common task. To keep things simple, I decided to use the Twitter text-classification problem for the evaluation. The most common NLP libraries today are NLTK, Spacy, WordBlob, Gensim, and of-course Deep Neural Network architectures using LSTM(Long Short Term Memory) or GRU(Gated Recurrent Unit)cells.

考虑到这一点,我戴上了开发人员的帽子,开始使用各种库编写python代码,以根据一个非常普通的任务评估它们。 为简单起见,我决定使用Twitter文本分类问题进行评估。 当今最常见的NLP库是NLTK,Spacy,WordBlob,Gensim,以及使用LSTM(长期短期记忆)或GRU(门控循环单元)单元的课程深层神经网络体系结构。

问题陈述 (The problem statement)

The dataset I am using consists of a collection of Twitter tweets. Some of the tweets are labeled as racist while others are not. This is a classical supervised learning based binary-classification problem. Our job is to create models based on different libraries, and use them to classify previously unseen text as racist or not.

我正在使用的数据集包含Twitter Twitter推文的集合。 有些推文被标记为种族主义者,而其他则没有。 这是一个经典的基于监督学习的二进制分类问题。 我们的工作是基于不同的库创建模型,并使用它们将以前看不见的文本归类为种族主义者。

Here is a look at some of the available tweets:

以下是一些可用的推文:

Image for post

The label 1 means the tweet is racist and label 0 means its not.

标签1表示该推文是种族主义者,标签0表示其不是种族主义者。

For the sake of brevity, I will only be focusing on the key sections of the code. For the full code, please feel free to visit my GitHub Machine-learning repository. Since I have already cleaned up the dataset and performed the EDA(Exploratory Data Analysis), I will not be covering those details here either.

为了简洁起见,我将只关注代码的关键部分。 有关完整的代码,请随时访问我的GitHub 机器学习存储库。 由于我已经清理了数据集并执行了EDA(探索性数据分析),因此在此也不再赘述。

I will be focusing on five different libraries here — NLTK, WordBlob, Spacy with CNN, Spacy with document vectors, and finally a deep neural network model with Bidirectional LSTM cells.

我将在这里集中在五个不同的库- NLTK,WordBlob,SpacyCNN,Spacy 与文档向量 ,并与双向LSTM细胞最后一个深层神经网络模型。

I will be using ROC-AUC Score and F1-Score as my evaluation metrics.

我将使用ROC-AUC分数和F1-分数作为我的评估指标。

So let’s get started.

因此,让我们开始吧。

NLTK (NLTK)

Let’s start with the NLTK package. Here we have used the NLTK library for tokenization and lemmatization of the tweets, and Gensim’s Word2Vec model for creating word vectors. Finally the XGBoost Classifier model is used for training and inference. Below is a snippet containing the relevant sections of the code

让我们从NLTK包开始。 在这里,我们已使用NLTK库对推文进行标记化和lemmatization ,并使用了Gensim的 Word2Vec模型来创建单词向量。 最后, XGBoost分类器模型用于训练和推断。 以下是包含代码相关部分的代码段

Using this technique, I was able to obtain an ROC-AUC Score of 0.8 and F1-Score of 0.68

使用这种技术,我可以得到0.8的ROC-AUC分数和0.68的 F1-分数

TextBlob (TextBlob)

TextBlob is a beginner friendly NLP library and provides a lot of cool features. It also provides built-in text classifiers. However I found TextBlob a lot slower than the other libraries I have used.

TextBlob是对初学者友好的NLP库,并提供了许多很酷的功能。 它还提供了内置的文本分类器。 但是我发现TextBlob比我使用的其他库慢很多。

For our purpose, we will use the built-in DecisionTreeClassifier of TextBlob. Below is the code snippet:

为此,我们将使用TextBlob的内置DecisionTreeClassifier 。 下面是代码片段:

Using this technique, I was able to get an ROC-AUC Score of 0.68 and F1-Score of 0.46

使用此技术,我可以得到0.68的ROC-AUC得分和0.46的 F1-Score

与CNN隔离 (Spacy with CNN)

Spacy is one of the most popular and widely used libraries for NLP, and offers very powerful features. It offers a build in text classifier named textcat which is available as a default pipe-line component. The textcat component supports BOW(bag of words), Simple-CNN(Convolution Neural Network) and Ensemble architectures.

Spacy是NLP最受欢迎和使用最广泛的库之一,并提供了非常强大的功能。 它提供了一个名为textcat的内置文本分类 ,可作为默认管道组件使用。 textcat组件支持BOW (单词袋),S imple-CNN (卷积神经网络)和Ensemble体系结构。

In our example, we will use the simple-cnn architecture.

在我们的示例中,我们将使用simple-cnn体系结构。

For this scenario I got an ROC-AUC Score of 0.78 and F1-Score of 0.66

对于这种情况,我的ROC-AUC得分为0.78 ,F1-得分为0.66

空间与文档向量 (Spacy with document vectors)

In this example, we will continue use Spacy, but instead of using it’s built-in text-classifier, I would use Spacy to generate document vectors and then feed those vectors to an XGBoost classifier. Lets see if that improves our score.

在此示例中,我们将继续使用Spacy,但不是使用它的内置文本分类器,而是使用Spacy生成文档向量 ,然后将这些向量提供给XGBoost分类器。 让我们看看是否可以提高我们的分数。

Here we get an ROC-AUC Score of 0.8 and F1-Score of 0.64

在这里,ROC-AUC得分为0.8 ,F1-得分为0.64

具有LSTM单元的深度神经网络 (Deep Neural Network with LSTM cells)

Finally we will be creating a neural network model using bi-directional LSTM cells. We will be using TensorFlow's Keras library and use its features like tokenization and padding of sequences

最后,我们将使用双向LSTM单元创建神经网络模型。 我们将使用TensorFlow的Keras库并使用其功能,例如标记化和序列填充

Here we get an ROC-AUC Score of 0.82 and F1-Score of 0.41

在这里,我们的ROC-AUC得分为0.82 ,F1-得分为0.41

结论 (Conclusion)

Let’s compare the scores now

让我们现在比较分数

Image for post

The ROC_AUC scores of the models are pretty similar. Looking at the the F1 scores, it seems like NLTK and Spacy are best suited for the job. However, it is definitely possible to further improve the models using various optimization techniques and hyper-parameters tuning, specially the LSTM based model.

这些模型的ROC_AUC分数非常相似。 从F1分数来看,NLTK和Spacy似乎最适合这份工作。 但是,绝对有可能使用各种优化技术和超参数调整来进一步改进模型,特别是基于LSTM的模型。

The complete code for all the above examples are available in my GitHub Machine-learning repository.

我的GitHub 机器学习存储库中提供了上述所有示例的完整代码。

Pre-trained models have been creating a lot of sensation lately. With the recent release of OpenAI’s GPT3, excitement is at an all-time high. I am curious as to how they would perform in our humble text classification task.

预先训练的模型最近引起了很多轰动。 随着最近发布的OpenAI的GPT3,人们的兴奋感空前高涨。 我对它们在我们卑微的文本分类任务中的表现感到好奇。

That’s what I am going to try next. Keep watching this space for more.

那就是我接下来要尝试的。 继续关注此空间以获取更多信息。

翻译自: https://towardsdatascience.com/which-is-the-best-nlp-d7965c71ec5f

nlp库

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值