Day 7. Towards Preemptive Detection of Depression and Anxiety in Twitter_mmda: a multimodal dataset for depression and anxi-CSDN博客

本文链接：https://blog.csdn.net/weixin_37996254/article/details/109679555

Title:
Towards Preemptive Detection of Depression and Anxiety in Twitter
在Twitter中对抑郁和焦虑的先发制人检测

Abstract:
Depression and anxiety are psychiatric disorders that are observed in many areas of everyday life. For example, these disorders manifest themselves somewhat frequently in texts written by nondiagnosed users in social media. However, detecting users with these conditions is not a straightforward task as they may not explicitly talk about their mental state, and if they do, contextual cues such as immediacy must be taken into account. When available, linguistic ﬂags pointing to probable anxiety or depression could be used by medical experts to write better guidelines and treatments. In this paper, we develop a dataset designed to foster research in depression and anxiety detection in Twitter, framing the detection task as a binary tweet classiﬁcation problem. We then apply state-of-the-art classiﬁcation models to this dataset, providing a competitive set of baselines alongside qualitative error analysis. Our results show that language models perform
reasonably well, and better than more traditional baselines. Nonetheless, there is clear room for improvement, particularly with unbalanced training sets and in cases where seemingly obvious linguistic cues (keywords) are used counter-intuitively.
抑郁症和焦虑症是一种精神疾病，在日常生活的许多方面都可以观察到。例如，这些疾病在社交媒体上非诊断用户撰写的文本中有一定程度的频繁表现。然而，检测有这些情况的用户并不是一项简单的任务，因为他们可能不会明确地谈论他们的精神状态，如果他们这样做，就必须考虑到上下文提示，例如即时性。如果有言语上出现焦躁或抑郁，医学专家可以用来提供更好的指导和治疗。在本文中，我们开发了一个数据集，旨在促进对Twitter中抑郁和焦虑检测的研究，将检测任务定义为一个二元推文分类问题。然后，我们将最先进的分类模型应用于该数据集，在定性误差分析的同时提供一组具有竞争力的基线。我们的结果表明，语言模型的性能相当好，而且比传统的基线更好。尽管如此，仍然有明显的改进空间，尤其是在训练集不平衡的情况下，以及在那些看似明显的语言线索（关键字）被反直觉使用的情况下。

Highlight：
In this paper, we build a classiﬁcation dataset 2 to assist in the detection of depression and anxiety in Twitter, and compare several text classiﬁcation baselines. The results show that state-of-the-art language models (LMs henceforth) like BERT (Devlin et al., 2019) unsurprisingly outperform competing baselines. However, when the dataset shows an unbalanced distribution, linear models perform on par. Finally, alongside quantitative results, we also provide a qualitative analysis through which we aim to better understand the strengths and limitations of the models under study. Further, we identify the linguistic patterns alluding to the presence of depression and anxiety that elude all of the classiﬁers, and consider how we might improve performance against such patterns in the future.
本文建立了一个分类数据集2来帮助检测Twitter中的抑郁和焦虑，并比较了几种文本分类基线。结果表明，像BERT（Devlin et al.，2019）这样的最先进的语言模型（LMs从今以后）的表现并不出人意料地优于比较基线。然而，当数据集显示出不平衡分布时，线性模型的表现是不相上下的。最后，除了定量结果外，我们还提供了一个定性分析，通过该分析我们可以更好地理解所研究模型的优势和局限性。此外，我们找出了所有量词都无法避免的、暗示抑郁和焦虑存在的语言模式，并考虑在未来如何针对这些模式改进表现。
2（https://bitbucket.org/nlpcardiff/preemptive-depression-anxiety-twitter）

3 Dataset Construction
3.1 Tweet collection
First, we used Twitter’s Stream API to compile a large corpus of tweets. All tweets were of English language and published between May 2018 and August 2019. 3 We only cons