twitter 数据集处理_Twitter数据清理和数据科学预处理

最新推荐文章于 2022-08-03 09:53:04 发布

weixin_26746401

最新推荐文章于 2022-08-03 09:53:04 发布

阅读量3.5k

点赞数

文章标签：机器学习人工智能数据分析 python 数据挖掘

原文链接：https://medium.com/swlh/twitter-data-cleaning-and-preprocessing-for-data-science-3ca0ea80e5cd

版权

本文介绍了如何对Twitter数据集进行清理和预处理，这是数据科学项目中的关键步骤，涉及机器学习、人工智能和数据分析等领域。内容包括数据清洗的技巧和使用Python进行数据挖掘的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

twitter 数据集处理

In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous. While there is no limit to the range of information conveyed by tweets and texts, often these short messages are used to share opinions and sentiments that people have about what is going on in the world around them.

在 过去的十年中，诸如微博和文本消息之类的新通信形式已经出现并无处不在。 尽管对推文和文本传达的信息范围没有限制，但这些短消息通常用于分享人们对周围世界正在发生的事情的看法和观点。

Opinion mining (known as sentiment analysis or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

观点挖掘(称为情感分析或情感AI)是指使用自然语言处理，文本分析，计算语言学和生物识别技术来系统地识别，提取，量化和研究情感状态和主观信息。 情绪分析广泛应用于客户材料的声音，例如评论和调查响应，在线和社交媒体以及医疗保健材料，其应用范围从营销到客户服务再到临床医学。

Both Lexion and Machine learning-based approach will be used to for Emoticons based sentiment analysis. Firstly we stand up with the Machine Learning based clustering. In MachineLearning based approach we are used Supervised and Unsupervised learning methods. The twitter data are collected and given as input in the system. The system classifies each tweets data as Positive, Negative and Neutral and also produce the positive, negative and neutral no of tweets of each emoticon separately in the output. Besides being the polarity of each tweet is also determined on the basis of polarity.

Lexion和基于机器学习的方法都将用于基于表情的情绪分析。首先，我们支持基于机器学习的集群。在基于MachineLearning的方法中，我们使用了有监督和无监督的学习方法。收集twitter数据并作为系统中的输入给出。系统将每个推文数据分类为“正”，“负”和“中性”，并且还分别在输出中生成每个表情符号的正，负和中性no。除了作为每个推文的极性之外，还基于极性来确定。

Collection of Data

资料收集

To collecting the twitter data, we have to do some data mining process. In that process, we have created our own applicating with help of twitter API. With the help of twitter API, we have collected a large no of the dataset . From this, we have to create a developer account and register our app. Here we received a consumer key and a consumer secret: these are used in application settings and from the configuration page of the app we also require an access token and an access token secrets which provide the application access to Twitter on behalf of the account. The process is divided into two sub-process. This is discussed in the next subsection.

要收集Twitter数据，我们必须执行一些数据挖掘过程。在此过程中，我们借助twitter API创建了自己的应用程序。借助twitter API，我们已收集了大量数据集。由此，我们必须创建一个开发人员帐户并注册我们的应用程序。在这里，我们收到了一个用户密钥和一个消费者密钥：这些密钥用于应用程序设置中，并且在应用程序的配置页面中，我们还需要访问令牌和访问令牌密钥，以代表帐户向Twitter提供应用程序访问权限。该过程分为两个子过程。下一部分将对此进行讨论。