贝叶斯分类器故障检测_使用贝叶斯分类器检测假新闻

贝叶斯分类器故障检测

There is so much fake news in circulation, it is difficult to find sources of accurate and unfabricated news. This article aims to use the Naive Bayes Classifier to classify real and fake news.

虚假新闻流传如此之多,很难找到准确而虚假的新闻来源。 本文旨在使用朴素贝叶斯分类器对真实和虚假新闻进行分类。

什么是朴素贝叶斯分类器: (What is the Naive Bayes Classifier:)

The Naive Bayes Classifier is a deterministic algorithm that uses the Bayes theorem to classify data. Let’s look at an example:

朴素贝叶斯分类器是一种确定性算法,使用贝叶斯定理对数据进行分类。 让我们看一个例子:

Suppose that you wanted to predict the probability that it would rain today: In the last few days, you have collected data by looking at the clouds in the sky. Here is the table of your data:

假设您想预测今天下雨的可能性:在过去的几天中,您已经通过观察天空中的云收集了数据。 这是您的数据表:

Image for post

This table represents the number of times a certain feature appears, given that it rained or it didn’t. What we have is actually a table containing the probability of it raining, given that grey clouds or white clouds appeared.

该表表示某个功能出现的次数(假设下雨了或没有出现)。 假设出现了灰色云层或白色云层,我们实际上拥有一张包含下雨可能性的表格。

Now armed with data, let’s make a prediction. Today we have seen grey clouds and no white clouds, is it more likely for it to be a rainy day or a sunny day? To answer this question, we have to use Bayes Theorem:

现在有了数据,让我们做一个预测。 今天我们看到的是灰云,没有白云,是下雨天还是晴天? 要回答这个问题,我们必须使用贝叶斯定理:

Image for post

This theorem uses past data to make better decisions.

该定理使用过去的数据做出更好的决策。

The probability of raining given that grey clouds appeared is equal to the probability that it rained, given that there were grey clouds, multiplied by the probability of it raining, divided the probability of grey clouds appearing.

给定灰云的出现,下雨的概率等于下雨的概率,假定存在灰云,乘以下雨的概率,除以灰云出现的概率。

Based on our data:

根据我们的数据:

P(B|A) (Probability of raining, given grey clouds) = 10/11

P(B | A)(下雨的概率,给定灰云)= 10/11

P(A) (Probability of raining) = 11/50+11 = 11/66 = 1/6

P(A)(下雨的概率)= 11/50 + 11 = 11/66 = 1/6

P(B) (Probability of grey clouds) = 1 (Grey clouds have confirmed to have appeared)

P(B)(灰云的概率)= 1(灰云已确认已出现)

P(A|B) = P(B|A) * P(A) / P(B)

P(A | B)= P(B | A)* P(A)/ P(B)

P(A|B) = 10/11 * 1/6 / 1

P(A | B)= 10/11 * 1/6 / 1

P(A|B) = 10/66

P(A | B)= 10/66

This is our result! Given that grey clouds appeared, the probability that it will rain is 10/66, that is, in 66 different probabilities in which the scenarios are the same, in 10 of them it will rain.

这是我们的结果! 考虑到出现了灰云,下雨的概率为10/66,即在场景相同的66个不同概率中,下雨的概率为10。

该项目: (The Project:)

With that brief introduction to Naive Bayes Classifiers, let’s talk about fake news detection with Naive Bayes Classifiers.

通过对朴素贝叶斯分类器的简要介绍,让我们谈谈朴素贝叶斯分类器的假新闻检测。

We will count the number of times a word appears in the headline, given that the news is fake. Change that to a probability, and then calculate the probability that the headline is fake, as compared to the headline being real.

假设新闻是假的,我们将计算一个单词出现在标题中的次数。 将其更改为概率,然后计算与真实标题相比,标题为假的概率。

The dataset I used has over 21,000 instances of real news, and instances 23,000 of fake news. To a normal dataset, this might seem unbalanced, but this unbalance is necessary to calculate the initial probability: that is the probability of a headline being fake, without considering what it is. You can contact me for the dataset at victorwtsim@gmail.com.

我使用的数据集有超过21,000个真实新闻实例和23,000个虚假新闻实例。 对于正常的数据集,这似乎是不平衡的,但是这种不平衡是计算初始概率所必需的:即标题为假的概率,而不考虑其真实性。 您可以通过victorwtsim@gmail.com与我联系以获取数据集。

代码: (The Code:)

import pandas as pd
import string

These are the three dependencies for the program: pandas is to read the csv file and string is to manipulate the casing of the words.

这些是程序的三个依赖项:pandas是读取csv文件,string是操纵单词的大小写。

true_text = {}
fake_text = {}true = pd.read_csv('/Users/XXXXXXXX/Desktop/True.csv')
fake = pd.read_csv('/Users/XXXXXXXX/Desktop/Fake.csv')

This script is to read the two datasets, containing the instances of fake and true news.

该脚本将读取两个数据集,其中包含假新闻和真实新闻的实例。

def extract_words(category,dictionary):
for entry in category['title']:
words = entry.split()
for word in words:
lower_word = word.lower()
if word in dictionary:
dictionary[lower_word] += 1
else:
dictionary[lower_word] = 1
return dictionary

This script counts how many times a word appears, given that the headline is of fake news, and adds one count to its entry into the dictionary that counts how many times each word appears.

考虑到标题是假新闻,此脚本计算一个单词出现的次数,并在其进入​​字典的条目中增加一个计数,以计算每个单词出现的次数。

def count_to_prob(dictionary,length):
for term in dictionary:
dictionary[term] = dictionary[term]/length
return dictionary

This function changes the number into a probability, by calculating the total number of words for fake news headlines, or real news headlines.

通过计算虚假新闻标题或真实新闻标题的单词总数,此函数将数字转换为概率。

def calculate_probability(dictionary,X,initial):
X.translate(str.maketrans('', '', string.punctuation))
X = X.lower()
split = X.split()
probability = initial
for term in split:
if term in dictionary:
probability *= dictionary[term]
print(term,dictionary[term])
return probability

This function multiplies the relevant probabilites, to compute a “score” for the headline. To make the prediction, compare the score when using the fake news and real news dictionary. If the fake news dictionary returns a higher score, the model has predicted the headline to be fake news.

此函数将相关概率相乘,以计算标题的“得分”。 为了做出预测,请在使用假新闻和真实新闻字典时比较分数。 如果伪造的新闻词典返回更高的分数,则该模型已预测标题为伪造的新闻。

true_text = extract_words(true,true_text)
fake_text = extract_words(fake,fake_text)true_count = count_total(true_text)
fake_count = count_total(fake_text)true_text = count_to_prob(true_text,true_count)
fake_text = count_to_prob(fake_text,fake_count)total_count = true_count + fake_count
fake_initial = fake_count/total_count
true_initial = true_count/total_count

This script uses all the above functions to create a dictionary of probabilities for each word, to later calculate the “score” for the headline.

该脚本使用上述所有功能为每个单词创建一个概率词典,以便稍后计算标题的“得分”。

X = 'Hillary Clinton eats Donald Trump'
calculate_probability(fake_text,X,1)>calculate_probability(true_text,X,1)

This final script evaluates the headline: “Hillary Clinton eats Donald Trump”, to test the model.

最后的脚本评估了标题:“希拉里·克林顿吃了唐纳德·特朗普”,以测试模型。

True

The model outputs True, as the headline is obviously fake news.

该模型输出True,因为标题显然是假新闻。

您可以在哪里改善我的计划: (Where you can improve my program:)

I created this program as a framework, so that others could improve upon it. Here are a few things you could consider:

我将该程序创建为框架,以便其他人可以对其进行改进。 您可以考虑以下几点:

  • Consider phrases, as well as words

    考虑短语和单词

A word itself has no meaning, but a phrase could give more insight into if the news is fake

单词本身没有任何意义,但是短语可以使您更深入地了解新闻是否是假的

  • Gain a larger dataset, by web scraping

    通过网络抓取获得更大的数据集

There are plenty of sources of real news and fake news online, you just need to find it.

在线上有大量真实新闻和虚假新闻的来源,您只需要找到它们即可。

Thank you for reading my article, I hope you learnt something!

感谢您阅读我的文章,希望您有所收获!

翻译自: https://towardsdatascience.com/using-bayesian-classifiers-to-detect-fake-news-3022c8255fba

贝叶斯分类器故障检测

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值