抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系

抑郁症损伤神经细胞吗

The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social media is a powerful indicator of the mental state of people at a given location and time. In order to study the link between the corona virus pandemic and the accelerating pace of depression and anxiety in the general population, I decided to explore tweets related to corona virus.

我们生活方式的急剧变化,加上为对抗日冕病毒爆发而采取的限制措施,隔离措施和社会疏远措施,已导致全世界范围内令人震惊的精神健康问题。 社交媒体是在给定位置和时间的人们心理状态的有力指标。 为了研究普通人群中冠状病毒大流行与抑郁和焦虑的加速步伐之间的联系,我决定探索与冠状病毒有关的推文。

这个博客是如何组织的? (How is this blog organized?)

In this blog post, I will first use keras to train a neural network to recognize depressive tweets. For this, I will use a data set of 10,314 tweets divided into depressive tweets (labelled 1) and non depressive tweets (labelled 0). This data set is made by Viridiana Romero Martinez. Here is the link to her github profile: https://github.com/viritaromero

在这篇博客文章中,我将首先使用keras训练神经网络来识别令人沮丧的推文。 为此,我将使用10,314条推文的数据集,分为压抑推文(标记为1)和非压抑推文(标记为0)。 该数据集由Viridiana Romero Martinez创建。 这是她的github个人资料的链接: https : //github.com/viritaromero

Once I have the network trained, I will use it for testing tweets scraped from twitter. To establish the link between COVID-19 and depression, I will obtain two different data sets. The first data set will be comprised of tweets with corona virus related keywords such as ‘COVID-19’, ‘quarantine’, ‘pandemic’, and ‘virus’. The second data set will be comprised of random tweets searched using neutral keywords such as ‘and’, ‘I’, ‘the’ etc. The second data set will serve as a control to check the percentage of depressive tweets in a random sample of tweets. This will allow us to measure the difference in percentage of depressive tweets in a random sample and a sample with COVID-19 specific tweets.

一旦对网络进行了培训,我将使用它来测试从Twitter抓取的推文。 为了建立COVID-19与抑郁症之间的联系,我将获得两个不同的数据集。 第一个数据集将包含带有与日冕病毒相关的关键字的推文,例如“ COVID-19”,“隔离”,“大流行”和“病毒”。 第二个数据集将包含使用中性关键字(例如“ and”,“ I”,“ the”等)搜索的随机推文。第二个数据集将用作检查随机抽样的抑郁性推文百分比的控件。鸣叫。 这将使我们能够测量随机样本和具有COVID-19特定推文的样本中压抑推文所占百分比的差异。

预处理数据 (Preprocessing the data)

Image for post
Image source: https://xaltius.tech/why-is-data-cleaning-important/
图片来源: https : //xaltius.tech/why-is-data-cleaning-important/

Before we can get started with training the neural networks, we need to collect and clean the data.

在开始训练神经网络之前,我们需要收集和清理数据。

导入库 (Importing the libraries)

To get started with the project, we will first need to import all the necessary libraries and modules.

要开始该项目,我们首先需要导入所有必需的库和模块。

Once we have all the libraries in place, we need to get the data and pre-process it. You can download the data set from this link: https://github.com/viritaromero/Detecting-Depression-in-Tweets/blob/master/sentiment_tweets3.csv

一旦所有库都准备就绪,我们需要获取数据并对其进行预处理。 您可以从以下链接下载数据集: https : //github.com/viritaromero/Detecting-Depression-in-Tweets/blob/master/sentiment_tweets3.csv

快速检查数据 (Quick examination of the data)

We can quickly check the structure of the data set by reading it into a pandas data frame.

我们可以通过将数据读取到熊猫数据框中来快速检查数据集的结构。

Now we will store the text of the tweets into an array called text. The corresponding labels of the tweets will be stored into a separate array called labels. The code is as follows:

现在,我们将推文的文本存储到名为text的数组中。 推文的相应标签将存储到称为标签的单独数组中。 代码如下:

Apologies for printing out a rather huge data set, but I just did it so that we can quickly examine the overall structure. The first thing I notice is that in the labels array, there are much more zeroes than ones. This means that we have roughly 3.5 times more non-depressive tweets than depressive tweets in the data set. In an ideal situation, I would like to train my neural network on a data set of equal number of depressive and non-depressive tweets. However, in order to obtain equal number of depressive and non-depressive tweets, I will have to substantially truncate my data. I think a larger and imbalanced data set is better than a very small and balanced data set, therefore, I am going to go ahead and use the data set in its original state.

很抱歉打印出相当大的数据集,但我只是这样做了,以便我们可以快速检查整体结构。 我注意到的第一件事是在labels数组中,零比1多得多。 这意味着我们在数据集中拥有的非抑郁性推文大约是抑郁性推文的3.5倍。 在理想情况下,我想在压抑和非压抑推文数量相等的数据集上训练我的神经网络。 但是,为了获得相等数量的压抑和非压抑推文,我将不得不截断我的数据。 我认为,较大的数据集和不平衡的数据集要比非常小的数据集和平衡的数据集更好,因此,我将继续使用原始状态的数据集。

清理数据 (Cleaning the data)

The second thing you’ll notice is that the tweets contain a lot of the so called ‘stopwords’ such as ‘a’, ‘the’, ‘and’ etc. These words are not important for classification of a tweet as depressive or non-depressive, hence we will remove these. We also need to remove the punctuation as it is again unnecessary and will only decrease the performance of our neural network.

您会注意到的第二件事是,这些推文包含很多所谓的“停用词”,例如“ a”,“ the”,“ and”等。这些词对于将推文归类为沮丧或不重要并不重要。 -抑郁,因此我们将其删除。 我们还需要删除标点符号,因为它再次是不必要的,只会降低神经网络的性能。

I decided to do a quick visualization of the data after cleaning using the amazing wordCloud library and the result is down below. Quite unsurprisingly, the most common word in depressive tweets is depression.

我决定在清理后使用令人惊叹的wordCloud库对数据进行快速可视化,结果显示如下。 毫不奇怪,抑郁推文中最常见的词是抑郁。

Image for post
Visualization of tweets using WordCloud
使用WordCloud可视化推文

数据令牌化 (Tokenization of the data)

What the on earth is tokenization?

到底什么是令牌化?

Basically, the neural networks do not understand raw text as we humans do. Therefore, in order to make the text more palatable to our neural network, we convert it into a series of ones and zeroes.

基本上,神经网络不像人类那样理解原始文本。 因此,为了使文本更适合我们的神经网络,我们将其转换为一系列的一和零。

Image for post
Image Source: inboundhow.com
图片来源:inboundhow.com

To tokenize text in keras, we import the tokenizer class. This class basically makes a dictionary lookup for a set number of unique words in our overall text. Then using the dictionary lookup, keras allows us to create vectors replace the word with its index value in the dictionary lookup.

要对keras中的文本进行标记化,我们导入了tokenizer类。 此类基本上是对整个文本中一定数量的唯一单词进行字典查找。 然后,使用字典查找,keras允许我们创建向量以在字典查找中将单词替换为其索引值。

We also go ahead and pad the shorter tweets and truncate the larger ones to make the maximum length of each vector equal to 100.

我们还继续填充较短的tweet,截断较大的tweet,以使每个向量的最大长度等于100。

You might be wondering, ‘huh, we only converted words to numbers, not ones and zeroes!’ You are right. There are two ways we can take care of that: either we can covert the numbers into one-hot-encoded vectors or create an embeddings matrix. One-hot-encoding vectors are usually very high dimensional and sparse whereas matrices are lower dimensional and dense. If you are interested, you can read more about it in the ‘Deep Learning with Python’ book by Francois Chollet. In this blog, I will be using matrices, but before we initialize them, we will need to take care of a few other things first.

您可能想知道,“呵,我们只将单词转换为数字,而不是一和零!” 你是对的。 有两种方法可以解决此问题:要么将数字隐蔽为一个热编码的矢量,要么创建一个嵌入矩阵。 一键编码矢量通常具有很高的维数和稀疏度,而矩阵则具有较​​低的维数和密集度。 如果您有兴趣,可以在Francois Chollet撰写的“ Python深度学习”一书中阅读有关它的更多信息。 在此博客中,我将使用矩阵,但是在初始化矩阵之前,我们需要先处理一些其他事项。

整理数据 (Shuffling the data)

Image for post
Sergi Viladesau on unsplah Sergi Viladesau在unsplah上拍摄

Another issue with the data that you might have identified earlier is that the text array contains all the non-depressive tweets first followed by the all depressive ones. We therefore need to shuffle the data to allow random samples of tweets to go into the training, validation, and test sets.

您之前可能已经确定的数据的另一个问题是,文本数组首先包含所有非压抑推文,然后是所有压抑推文。 因此,我们需要对数据进行混洗,以使随机的推文样本进入训练,验证和测试集。

分割数据 (Splitting the data)

Now we need to split the data into the training, validation, and test sets.

现在,我们需要将数据分为训练,验证和测试集。

Phew! Finally done with all the data munging!

! 最后完成所有数据处理!

制作神经网络 (Making a neural network)

Image for post
Image source: extremetech.com
图片来源:extremetech.com

Now we can start making the model architecture.

现在我们可以开始制作模型架构了。

I will be trying two different models: one with a pre-trained word embeddings layer and one with a trainable word embeddings layer.

我将尝试两种不同的模型:一种具有预训练的单词嵌入层,另一种具有可训练的单词嵌入层。

In order to define the neural network architecture, you need to understand how word embeddings work. There is a wealth of information online about word embeddings. This blog post is one of my favorites:

为了定义神经网络架构,您需要了解单词嵌入的工作方式。 在线上有大量有关单词嵌入的信息。 这篇博客文章是我的最爱之一:

https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

Now that you hopefully have an idea of the function of the embeddings layer, I will go ahead and create it in code.

现在,您希望对嵌入层的功能有所了解,我将继续在代码中创建它。

第一个模型 (First model)

For the first model, the architecture consists of a pre-trained word embeddings layer followed by two dense layers. The code from training the model is as follows:

对于第一个模型,该体系结构包括一个预训练的单词嵌入层,然后是两个密集层。 训练模型的代码如下:

Image for post
Image for post
Figure: Accuracy and loss for training and validation sets in model 1
图:模型1中的训练和验证集的准确性和损失

Here we can see that the model performs very well with an accuracy of 98 % on the test set. Overfitting is likely to not be an issue because the validation accuracy and loss are almost the same as the training accuracy and loss.

在这里,我们可以看到该模型在测试集上的表现非常好,准确度达到98%。 过度拟合可能不会成为问题,因为验证的准确性和损失与训练的准确性和损失几乎相同。

第二种模式 (The second model)

For the second model, I decided to exclude the pre-trained embeddings layer. The code is as follows.

对于第二个模型,我决定排除预训练的嵌入层。 代码如下。

Image for post
Image for post
Figure: Accuracy and loss for training and validation sets in model 2
图:模型2中的训练和验证集的准确性和损失

The accuracy of both the models on the test set are equally good. However, since the second model is less complex, I will be using it for predicting whether a tweet is depressive or not.

测试集上的两个模型的准确性都同样好。 但是,由于第二个模型不那么复杂,因此我将使用它来预测一条推文是否令人沮丧。

从Twitter获取COVID-19相关推文的数据 (Obtaining data from twitter for COVID-19 related tweets)

In order to obtain my data sets of tweets, I used twint which is an amazing webscraping tool for twitter. I prepared two different data sets of 1000 tweets each. The first one consisted of tweets containing corona related keywords such as ‘COVID-19’, ‘quarantine’, and ‘pandemic’.

为了获取我的tweet数据集,我使用了twint,twitter是一个很棒的Twitter抓取工具。 我准备了两个不同的数据集,每个数据集有1000条推文。 第一个由包含与电晕相关的关键字(例如“ COVID-19”,“隔离”和“大流行”)的推文组成。

Now in order to get a control sample to compare against, I searched for tweets containing neutral keywords such as ‘the’, ‘a’, ‘and’ etc. Using 1000 tweets from this sample, I made up the second control data set.

现在,为了比较一个对照样本,我搜索了包含中性关键字(例如“ the”,“ a”,“ and”等)的推文。使用该样本中的1000条推文,构成了第二个对照数据集。

Image for post
WordCloud of COVID related tweets
COVID相关推文的WordCloud

I cleaned the data sets using a similar procedure to the one I used for cleaning the training set. After cleaning the data, I fed it to my neural network to predict the percentage of depressive tweets. The results, I obtained were surprising.

我使用与清理训练集相似的过程清理了数据集。 清理数据后,我将其输入到我的神经网络以预测抑郁性推文的百分比。 我获得的结果令人惊讶。

One run of the code is shown below, I repeated it with different batches of data obtained using the same procedure as described above and calculated the average results.

下面显示了该代码的一次运行,我使用与上述相同的程序对不同批次的数据重复了该代码,并计算了平均结果。

On average, my model predicted, 35 % depressive tweets and 65 % non-depressive in a data set of tweets obtained using neutral keywords. 35% depressive tweets on a randomly obtained sample is an alarmingly high number. However, the number of depressive tweets with COVID-related keywords was even higher: 55 % depressive vs 45 % non-depressive. That is a 57 % increase in depressive tweets!

我的模型平均预测,在使用中性关键字获得的推文数据集中,有35%的抑郁推文和65%的非抑郁推文。 随机获得的样本上35%的压抑推文数量惊人地高。 但是,带有COVID相关关键字的压抑推文的数量甚至更高:55%的压抑和45%的非压抑。 令人沮丧的推文增加了57%!

This leads to the conclusion that there is indeed a correlation between COVID-19 and depressive sentiments in tweets on Twitter.

由此得出结论,在推特上的推文中,COVID-19与抑郁情绪之间确实存在关联。

结论 (Conclusion)

I hope this post helped you learn a bit more about sentiment analysis using machine learning and I hope you will try out a similar project yourself as well.

我希望这篇文章可以帮助您了解更多有关使用机器学习进行情感分析的知识,并且希望您自己也可以尝试一个类似的项目。

Happy coding!

祝您编码愉快!

credits: Slater on giphy
学分:斯吉特·吉菲

翻译自: https://towardsdatascience.com/exploring-the-link-between-covid-19-and-depression-using-neural-networks-469030112d3d

抑郁症损伤神经细胞吗

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值