The uptick in Twitter user activity during the recent lockdown made it seem like a good place to start looking for a quarantine project to increase my competency with machine learning. Specifically, as misinformation and baffling conspiracies took hold of the U.S.’s online population, trying to come up with new ways to identify bad actors seemed like more and more of a relevant task.
在最近的锁定期间,Twitter用户活动的增加使它看起来像是一个开始寻找隔离项目以提高我的机器学习能力的好地方。 具体来说,随着误导和令人困惑的 阴谋 笼罩着美国的在线人群,试图找到新的方法来识别不良行为者似乎越来越是一项重要的任务。
In this post, I’ll be demonstrating, with the help of some useful Python network graphing and machine learning packages, how to build a model for predicting whether Twitter users are humans or bots, using only a minimum viable graph representation of each user.
在这篇文章中,我将在一些有用的Python网络图形和机器学习包的帮助下演示如何构建模型,以仅使用每个用户的最小可行图形表示来预测Twitter用户是人类还是机器人。
大纲 (Outline)
1. Preliminary Research
1.初步研究
2. Data Collection
2.数据收集
3. Data Conversion
3.数据转换
4. Training the Classification Model
4.训练分类模型
5. Closing thoughts / Room for Improvement
5.总结思想/改进空间
技术说明 (Technical Notes)
All programming, data collection, etc. was done in a Jupyter Notebook. Libraries used:
所有编程,数据收集等都在Jupyter Notebook中完成。 使用的库:
tweepy
pandas
igraph
networkx
numpy
json
csv
ast
itemgetter (from operator)
re
Graph2Vec (from karateclub)
xgboost
Finally, four resources were key to this task, which I will discuss later in this writeup:
最后,四个资源是此任务的关键,我将在本文后续部分中讨论这些资源:
The Indiana University Network Science Institute’s Bot Repository,
印第安纳大学网络科学研究所的Bot资料库 ,
Jacob Moore’s tutorial on identifying Twitter influencers, using Eigenvector Centrality as a metric,
雅各布·摩尔(Jacob Moore)使用特征向量中心性(Eigenvector Centrality)作为度量标准来确定Twitter影响者的教程 ,
Karate Club, an extension of NetworkX,
空手道俱乐部 ,NetworkX的扩展,
and the XGBoost gradient boosting library.
和XGBoost梯度增强库。
Let’s get to it!
让我们开始吧!
初步研究 (Preliminary Research)
While bot detection as a goal is nothing new, to the extent that a project like this would have been impossible without drawing on