nlp错字检测_使用nlp进行假新闻检测

nlp错字检测

阳光下没有新事物 (Nothing New Under The Sun)

Fake News is being talked about by everyone from your best friend, to your parents, perhaps even your goldfish are whispering in the corners of the tank. It’s even being covered by Real News at an alarming clip. Dictionary.com even listed ‘misinformation’ as their Word of the Year in 2018. However, this isn’t a particularly new problem, right? After all Jonathon Swift wrote in 1710, “Falsehood flies, and the Truth comes limping after it”. And then there is the more famous, often attributed to Mark Twain, “A lie can travel halfway around the world before the truth can get its boots on.”

从您最好的朋友到父母,每个人都在谈论“假新闻”,也许甚至您的金鱼也在水箱的角落里窃窃私语。 Real News甚至以惊人的速度报道了它。 Dictionary.com甚至将“错误信息”列为“ 2018年年度最佳词汇”。但是,这不是一个特别新的问题,对吗? 毕竟,乔纳森·斯威夫特(Jonathon Swift)在1710年写道:“虚假无疾而终,真相也随之而来”。 然后是一个更著名的,通常归功于马克·吐温(Mark Twain):“谎言可以在真相得以掩盖之前在世界各地传播。”

So, if this isn’t a new problem, why is it one of the most talked-about topics today?

因此,如果这不是一个新问题,为什么它成为当今最受关注的话题之一?

Mo数据,Mo问题 (Mo Data, Mo Problems)

We won’t dive into all the intricate reasons here. However, one of the most obvisous reasons it’s easy to understand why this problem has proliferated and permeated into every crevice of our lives. In a word, accessibility.

我们不会在这里深入探讨所有复杂的原因。 但是,最显而易见的原因之一是很容易理解为什么这个问题扩散并渗透到我们生活的每一个缝隙中。 总之,可访问性。

It isn’t hard to see how quickly things can get out of hand. The ability for almost anyone, anywhere in the world to publish or share an article, video or podcast comes at a cost. It would take me 20 minutes at least to verify that this story is true! Nah, I’ll just retweet it because it looks true enough.

不难看出事情会很快失控。 世界上几乎任何地方的任何人都可以发布或共享文章,视频或播客,这需要付出一定的代价。 我至少要花20分钟才能证实这个故事是正确的! 不,我将转发它,因为它看起来足够真实

So, let’s review a project, then discuss a bit more about where things are and where they could be going.

因此,让我们回顾一个项目,然后再讨论一些有关事物在什么地方以及它们将要去往何处的信息。

使用NLP检测虚假新闻 (Detect Fake News Using NLP)

We will be using two datasets for this project. There will be one real news set and a fake news data set. Let’s take a look at the first five observations for each.

我们将为此项目使用两个数据集 。 将有一个真实的新闻集和一个伪造的新闻数据集。 让我们看一下每一个的前五个观察值。

Image for post

We can see the datasets are not that different. Which I feel is a metaphor for this entire issue… Well, anyway, we have a title, text, subject, and date for each observation — in both datasets.

我们可以看到数据集没有什么不同。 我觉得这是整个问题的隐喻...嗯,无论如何,我们在两个数据集中都为每个观察结果提供了标题,文本,主题和日期。

But, I’m sure you can already see some of the differences. Take a look at the text for the real articles. Each text string starts with the location of the story, then it’s followed by the name of the news outlet. Before running this through an algorithm, we can already see that there is a key difference between the datasets.

但是,我相信您已经可以看到其中的一些区别。 看一下真正文章的文字。 每个文本字符串都以故事的位置开始,然后是新闻发布地的名称。 在通过算法运行之前,我们已经看到数据集之间存在关键差异。

Image for post

After we add a new column to each dataframe to distinguish if it is a real or fake article, we concatenate them into a single dataframe. Looking at a sample of the new dataframe we see another difference. None of the real articles have ALL CAPS in the title. However, in this cross-section, we can see that three of the five fake articles have all caps in the title. Also, we can see that the subjects are similar, but not the same.

在向每个数据框添加新列以区分它是真实商品还是伪造商品之后,我们将它们连接为一个数据框。 查看新数据框的样本,我们会看到另一个差异。 实际文章中都没有标题中的“全部大写”。 但是,在此横断面中,我们可以看到五篇假文章中的三篇标题都大写。 此外,我们可以看到主题相似但不相同。

Image for post

We have 21,417 and 23,481 observations for real and fake news, respectively. So, we have a decently balanced dataset, with no null values in the 44,898 observations.

对于真实和虚假新闻,我们分别有21,417和23,481个观测值。 因此,我们有一个相当平衡的数据集,在44,898个观测值中没有空值。

图表攻击 (Chart-attack)

Now that we’re done wrangling and doing some basic prep/exploration with the data, let’s bust out some of those visualization libraries.

现在,我们已经完成了数据整理和一些基本的准备工作/探索工作,下面我们来看一下其中的一些可视化库。

Image for post

Like we’ve already discussed, the data set we’re dealing with is decently balanced. Even with the difference in total counts being this close, we will still stratify the data when we do our train-test split.

就像我们已经讨论过的那样,我们正在处理的数据集是相当平衡的。 即使总计数的差异接近,我们仍将在进行火车测试拆分时对数据进行分层。

After reviewing the dataframe I started to wonder about the subject column. My initial hypothesis was that it might be a useful column to include in our analysis. With that in mind, we did a value count on the totals per category.

在查看了数据框之后,我开始对主题列感到疑惑。 我最初的假设是,将其包含在我们的分析中可能是一个有用的专栏。 考虑到这一点,我们对每个类别的总计进行了价值计数。

Image for post

Even though there is a low representation in a few of the categories I still thought this might be something meaningful to consider when doing our analysis. However, taking it one step further…

即使在某些类别中代表较少,我仍然认为这在进行分析时可能值得考虑。 但是,再往前走一步……

Image for post

Ah, there it is. Look at us continuing to find patterns! Who needs those Machine Learning algothims anyway? We do. We all do. But, mostly me.

啊,有。 看着我们继续寻找模式! 无论如何,谁需要这些机器学习算法? 我们的确是。 大家都这样做。 但是,主要是我。

It does seem odd that the real news articles only have two different categories — its the most unbalanced thing about our data. Perhaps all the articles were pulled from a few sources, a few categories within them. All the more reason to use as much data as possible for this type of work.

真正的新闻报道只有两个不同的类别,这似乎很奇怪,这是关于我们数据的最不平衡的事情。 也许所有文章都来自一些来源,其中一些类别。 更有理由为此类型的工作使用尽可能多的数据。

This points to the bigger picture of what makes exploratory data analysis fun and insightful. We’ve found some good insights, even about the data scraping process used.

这表明了使探索性数据分析变得有趣而有见地的更大前景。 我们发现了一些很好的见解,甚至是关于所使用的数据抓取过程的见解。

那朵云是鸭子的形状 (That cloud is in the shape of a duck)

Before we get to everyone’s favorite NLP graphic, the Word Cloud, we’ll need to clean up the dataframe a bit.

在获得大家喜欢的NLP图形词云之前,我们需要稍微清理一下数据框。

Image for post

All the cleaning functions have been broken out into individual functions. When you’re creating functions you have the option to create One Function To Rule Them All, Mother Of All Function functions. These are not necessarily bad. However, I find that method more difficult to debug and implement. In this example here, if we wanted to change the language of the stop-words, or if any individual function didn’t work, we’d be able to make changes to just that function without having other issues potentially crop up.

所有清洁功能均已分解为单独的功能。 创建函数时,可以选择创建一个函数来统治所有函数,所有函数之母 。 这些不一定坏。 但是,我发现该方法更难以调试和实现。 在这里的示例中,如果我们想更改停用词的语言,或者任何单个功能不起作用,我们将能够对该功能进行更改,而不会出现其他问题。

We’ve removed English stop-words, removed some bracket punctuation, etc. and now we can make a few word-clouds. We can use this little piece of code to make them.

我们删除了英文停用词,删除了一些括号标点符号等,现在我们可以制作一些词云了。 我们可以使用这段小代码来制作它们。

plt.figure(figsize = (20,20)) # Text from the real news articles
wc = WordCloud(max_words = 2000, width = 1600, height = 800, stopwords = STOPWORDS).generate(“ “.join(df[df.category == 1].text))plt.imshow(wc , interpolation = ‘bilinear’)
plt.axis(“off”)
plt.show
Image for post
Real News Word Cloud
真实新闻词云
Image for post
Fake News Word Cloud
假新闻词云

We can certainly see some differences between these two word-clouds. I prefer a more custom approach when making word-clouds. In the case of these two data sets, I went with these:

我们当然可以看到这两个词云之间的一些差异。 在制作词云时,我更喜欢使用更自定义的方法。 在这两个数据集的情况下,我使用了以下数据:

Image for post
Image for post

Some of the words have similar representations. This isn’t too surprising, since most of the articles were listed as being of a political nature. A few standout differences for me are Hillary Clinton and Obama in the fake news word-cloud.

一些单词具有相似的表示。 这并不奇怪,因为大多数文章被列为具有政治性质。 假新闻词云中的希拉里·克林顿 ( Haryary Clinton)奥巴马 (O bama)对我来说是几个杰出的区别。

准确,但这有关系吗? (Accurate, but does it matter?)

If you want to see all the code used during the modeling process head over to Github. Here are the results:

如果您想查看建模过程中使用的所有代码,请转到Github 。 结果如下:

Image for post

So, we’re left with a fairly accurate model using basic NLP libraries and techniques.

因此,我们剩下的是使用基本NLP库和技术的相当准确的模型。

At the start of this article, we talked briefly about fake news, misinformation, propaganda, none of it is new. We can see that some basic techniques can produce decent results without much tuning. So, what makes it so difficult for platforms like Facebook to root out fake news shared on their platform?

在本文开始时,我们简要讨论了假新闻,错误信息,宣传,这些都不是新鲜事物。 我们可以看到,一些基本技术无需太多调整即可产生不错的结果。 那么,是什么让Facebook这样的平台很难根除其平台上共享的假新闻呢?

Well, a tremendous number of things. In Facebook’s own words,

好吧,数量众多。 用Facebook的话来说

[false] news is harmful to our community, it makes the world less informed, and it erodes trust. It’s not a new phenomenon, and all of us — tech companies, media companies, newsrooms, teachers — have a responsibility to do our part in addressing it. At Facebook, we’re working to fight the spread of false news in three key areas:

[假]新闻对我们的社区有害,它使世界不那么了解信息,并且削弱了信任。 这不是一个新现象,我们所有人(技术公司,媒体公司,新闻编辑室,教师)都有责任尽我们的力量来解决这一问题。 在Facebook,我们正在努力在以下三个关键领域阻止虚假新闻的传播:

disrupting economic incentives because most false news is financially motivated;

由于大多数虚假消息是出于经济动机,因此破坏了经济激励措施;

building new products to curb the spread of false news; and

开发新产品以遏制虚假新闻的传播; 和

helping people make more informed decisions when they encounter false news.

帮助人们在遇到错误消息时做出更明智的决策。

Across all of their platforms, they are attempting to find a suitable solution with a multi-pronged approach — though possibly there are more they aren’t sharing publicly. Even with world-class data teams at every tech company around the world, the elusiveness of fake news is something that will continue to challenge what seems to be logically possible — its eradication. Or at the very least, the ability to find as much of it as possible and remove it from their platforms.

在所有平台上,他们都试图通过多管齐下的方法找到合适的解决方案-尽管可能有更多的公司没有公开共享。 即使在世界各地的每个科技公司都有世界一流的数据团队,对假新闻的难以捉摸的情况仍将继续挑战似乎在逻辑上可能的可能性—消除它。 或至少是能够找到尽可能多的内容并将其从其平台中删除的功能。

The fairly straightforward nature of this project can lead to a misguided understanding of the problem, but more importantly the solution.

该项目的直接性质可能导致对问题的误解,但更重要的是解决方案

Watching from the sidelines to see what strategies get implemented to curtail this major issue is exciting. Though, joining one of these amazing teams to help bring to life would be a bit more exciting.

从一边观望,看看可以采取什么策略来减少这一重大问题,这是令人兴奋的。 不过,加入这些令人惊叹的团队之一来帮助实现生活会更加令人兴奋。

领英 (LinkedIn)

Connect with me on LinkedIn: https://www.linkedin.com/in/wchasethompson

在LinkedIn上与我联系: https : //www.linkedin.com/in/wchasethompson

翻译自: https://medium.com/swlh/fake-news-detection-using-nlp-e744a6909276

nlp错字检测

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值