nlp情感分析经典书籍推荐_通过监督学习对书籍进行情感分析-CSDN博客

nlp情感分析经典书籍推荐

In this tutorial, I will explain how to calculate the sentiment of a book through a Supervised Learning technique, based on Support Vector Machines (SVM).

在本教程中，我将解释如何基于支持向量机(SVM)通过监督学习技术来计算书籍的情感。

This tutorial calculates the sentiment analysis of the Saint Augustine Confessions, which can be downloaded from the Gutenberg Project Page. The masterpiece is split in 13 books (chapters). We have stored each book into a different file, named number.text (e.g. 1.txt and 2.txt). Each line of every file contains just one sentence.

本教程计算了圣奥古斯丁自白的情感分析，可以从古腾堡项目页面下载。杰作分为13本书(每章)。我们已经将每本书存储到一个名为number.text的不同文件中(例如1.txt和2.txt)。每个文件的每一行仅包含一个句子。

The jupyter notebook can be download from my Github repository: https://github.com/alod83/papers/tree/master/aiucd2021

可以从我的Github存储库下载jupyter笔记本： https : //github.com/alod83/papers/tree/master/aiucd2021

A similar experiment of sentiment analysis of a book can be found at the following link: https://towardsdatascience.com/sentiment-analysis-of-a-book-through-unsupervised-learning-df876563dd1b. In this case I explain how to exploit unsupervised learning techniques to perform sentiment analysis.

可以在以下链接中找到类似的书籍情感分析实验： https : //towardsdatascience.com/sentiment-analysis-of-a-book-through-unsupervised-learning-df876563dd1b 。在这种情况下，我解释了如何利用无监督学习技术进行情感分析。

入门 (Getting Started)

Supervised Learning needs some annotated text to train the model. Thus, the first step consists in reading the annotations file and store it into a dataframe. The annotation file contains for each sentence, the associated score, which is a positive, negative or null number.

监督学习需要一些带注释的文本来训练模型。因此，第一步是读取注释文件并将其存储到数据框中。注释文件包含每个句子的相关分数，分数是正数，负数或空数。

import pandas as pddf = pd.read_csv('sources/annotations.csv')
df

We can calculate some statistics regarding the annotations, such as the number of positive, negative and neutral scores, as well as the total number of annotations. We can use the count() method applied to the dataframe.

我们可以计算一些有关注释的统计信息，例如正，负和中性分数的数量，以及注释的总数。我们可以使用应用于数据框的count()方法。

df.count()
df[df['score'] > 0].count()
df[df['score'] < 0].count()
df[df['score'] == 0].count()

准备训练和测试集 (Prepare the training and test sets)

In order to calculate the sentiment of each sentence, we will exploit a Supervised Learning technique, which exploits a binary classification model. This model takes a sentence as input and returns 1 or 0, depending whether the sentence is positively rated or not. Since our model is binary, we must remove all the annotations with a neutral score.

为了计算每个句子的情感，我们将利用一种监督学习技术，该技术利用二进制分类模型。该模型将一个句子作为输入并返回1或0，这取决于该句子是否被正面评价。由于我们的模型是二进制的，因此我们必须删除所有带有中性分数的注释。

import numpy as np
# Remove any 'neutral' ratings equal to 0
df = df[df['score'] != 0]

Now we can add a column to the dataframe, called Positively Rated, containing 1 or 0, depending on a positive or negative score. We use the where() method to assign the appropriate value to this new column of the dataframe.

现在，我们可以在数据框中添加一列，称为Positively Rated ，该列包含1或0，具体取决于正分数或负分数。我们使用where()方法为数据框的这一新列分配适当的值。

df['Positively Rated'] = np.where(df['score'] > 0, 1, 0)

计算情绪 (Calculate the sentiment)

We define an auxiliary function, called calculate_indexes(), which receives as input the supervised learning model (which will be described later) and the CountVectorizer vector, as described later.

我们定义了一个辅助函数，称为calculate_indexes() ，该函数接收监督学习模型(将在后面描述)和CountVectorizer向量作为输入，如稍后所述。

Within the function, we open the file corresponding to each book through the open() function, we read all the lines through the function file.readlines() and for each line, we calculate the score, by applying the predict() function to the model.

在该函数中，我们通过open()函数open()与每本书对应的文件，然后通过函数file.readlines()读取所有行，并通过将predict()函数应用于该模型。

Then, we can define three indexes to calculate the sentiment of a book: the positive sentiment index (pi), the negative sentiment index (ni) and the neutral sentiment index (nui). The pi of a book corresponds to the number of positive sentences in a book divided per the total number of sentences of the book. Similarly, we can calculate the ni and nui of a book.

然后，我们可以定义三个指数来计算一本书的情绪：正情绪指数(pi)，负情绪指数(ni)和中性情绪指数(nui)。一本书的pi对应于一本书中肯定的句子数除以该书的句子总数。同样，我们可以计算一本书的ni和nui。

def calculte_indexes(model,vect):
    pos_index = []
    neg_index = []
    neutral_index = []
    all_text = ""
    for book in range(1,14):
        file = open('sources/' + str(book) + '.txt')
        lines = file.readlines()
        pos = 0
        neg = 0
        for line in lines:
            score = model.predict(vect.transform([line]))
            if score == 1:
                pos += 1
            else:
                neg += 1
            all_text += ' ' + line.lower() 
        n = len(lines)
        pos_index.append(pos / n)
        neg_index.append(neg / n)
    return pos_index,neg_index

Now we can train the algorithm. We define two different cases: in the first case we do not consider ngrams, in the second we do. We define a function called train_algorithm(), which can be invoked by specifying the usage of ngrams.

现在我们可以训练算法了。我们定义了两种不同的情况：在第一种情况下，我们不考虑ngram；在第二种情况下，我们考虑。我们定义了一个名为train_algorithm()的函数，可以通过指定ngram的用法来调用它。

In the function, firstly we split the dataset in two parts, training and test set through the scikit-learn function called train_test_split(). The training set will be used to train the algorithm, the test set will be used to test the performance of the algorithm.

在函数中，首先，我们通过称为train_test_split()的scikit-learn函数将数据集分为训练集和测试集两部分。训练集将用于训练算法，测试集将用于测试算法的性能。

X_train, X_test, y_train, y_test = train_test_split(df['sentence'], 
                            df['Positively Rated'], random_state=0)

Then we build the matrix of tokens count, i.e. the matrix which contains for each sentence which tokens are available. This can be done through the class CountVectorizer, which receives as input the number of ngrams to consider. In this tutorial, only two ngrams are considered. We consider only tokens with a minimum document frequency equal to 5.

然后，我们建立记号计数矩阵，即每个句子包含可用记号的矩阵。这可以通过CountVectorizer类完成，该类将输入要考虑的ngram数量作为输入。在本教程中，仅考虑两个ngram。我们仅考虑最小文档频率等于5的令牌。

vect = CountVectorizer(min_df=5).fit(X_train)

Then we can build the model: we use the LinearSVC() class contained in scikit-learn and we train it with the training set model.fit(X_train_vectorized, y_train).

然后，我们可以构建模型：我们使用scikit-learn包含的LinearSVC()类，并使用训练集model.fit(X_train_vectorized, y_train)对其进行训练。

model = LinearSVC()
model.fit(X_train_vectorized, y_train)

Finally, we test the performance of the algorithm by predicting the output for the test set and comparing results with real output contained in the test set. As metrics, we measure the AUC, but we could calculate also other metrics.

最后，我们通过预测测试集的输出并将结果与测试集中包含的实际输出进行比较来测试算法的性能。作为度量，我们可以测量AUC，但也可以计算其他度量。

predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

Once trained the model, we can calculate the indexes, through the calculte_indexes() function. Finally, we plot results.

训练好模型后，我们可以通过calculte_indexes()函数计算索引。最后，我们绘制结果。

Here the full the code of the function train_algorithm().

这里是函数train_algorithm()的完整代码。

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as pltdef train_algorithm(df, ngrams=False):
    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(df['sentence'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)
    # Fit the CountVectorizer to the training data
    vect = CountVectorizer(min_df=5).fit(X_train)
    if ngrams:
        # Fit the CountVectorizer to the training data specifiying a minimum 
        # document frequency of 5 and extracting 1-grams and 2-grams
        vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)
        print('NGRAMS')
    else:
        print('WITHOUT N-GRAMS')
    # transform the documents in the training data to a document-term matrix
    X_train_vectorized = vect.transform(X_train)
    # Train the model
    model = LinearSVC()
    model.fit(X_train_vectorized, y_train)
    # Predict the transformed test documents
    predictions = model.predict(vect.transform(X_test))print('AUC: ', roc_auc_score(y_test, predictions))
    pos_index, neg_index = calculte_indexes(model,vect)
    X = np.arange(1,14)
    plt.plot(X,pos_index,'-.',label='pos')
    plt.plot(X,neg_index, '--',label='neg')
    #plt.plot(X,neutral_index,'-',label='neu')
    plt.legend()
    plt.xticks(X)
    plt.xlabel('Libri')
    plt.ylabel('Indici')
    plt.grid()
    if ngrams:
        plt.savefig('plots/svm-ngram-bsi.png')
    else:
        plt.savefig('plots/svm-1gram-bsi.png')
    plt.show()

进行实验 (Run experiments)

Now we can run experiments with ngrams enabled and disabled.

现在，我们可以运行启用和禁用ngram的实验。

train_algorithm(df, ngrams=False)
train_algorithm(df, ngrams=True)

学过的知识 (Lesson learned)

In this tutorial I have shown you how to calculate the sentiment of every chapter of a book through a Supervised Learning technique:

在本教程中，我向您展示了如何通过“监督学习”技术来计算一本书每一章的情感：

Supervised Learning needs some training sets, i.e. some data annotated manually. So before starting to play with models, you should do the boring task of annotating a (small) part of your data. The bigger the training set the better the performance of the algorithm;
监督学习需要一些训练集，即一些手动注释的数据。因此，在开始使用模型之前，您应该完成无聊的任务，即对数据的一小部分进行注释。训练集越大，算法的性能越好；
To calculate the sentiment of a sentence, you should use a (binary) classification algorithm, such as the Support Vector Machines (SVM). The scikit-learn library provides many algorithms for classification;
要计算句子的情感，应使用(二进制)分类算法，例如支持向量机(SVM)。 scikit-learn库提供了许多分类算法。
Once you have chosen the model, you can train it with the training set;
选择模型后，即可使用训练集进行训练。
Don’t forget to test the performance of the model :)
不要忘记测试模型的性能:)