机器学习过采样是重复数据吗_使用机器学习查找重复的定额问题

机器学习过采样是重复数据吗

Quora is an amazing platform where questions are asked, answered, followed, and edited by internet companies. This empowers people to learn from each other and to better understand the world. About 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. It's not a better side from quora to ask its followers to write an answer for the same question. So it will be better if there is a system that is capable of detecting that a new question is similar to the questions that have already been answered.

Quora是一个了不起的平台,互联网公司可以在其中询问,回答,关注和编辑问题。 这使人们能够相互学习并更好地了解世界。 每个月大约有1亿人访问Quora,因此很多人提出类似措辞的问题也就不足为奇了。 从Quora要求其追随者为同一问题写答案并不是更好的一面。 因此,如果有一个系统能够检测到新问题与已经回答的问题相似,那就更好了。

So our problem statement is to predict whether a pair of questions are duplicates or not. We will use various machine learning techniques to come up with a solution for this. This blog is not a complete code walkthrough, but I will explain various approaches I used to solve the problem. You can have a look at my code from my Github repository.

因此,我们的问题陈述是预测一对问题是否重复。 我们将使用各种机器学习技术来提出解决方案。 本博客不是完整的代码演练,但我将解释解决问题的各种方法。 您可以从我的Github存储库中查看我的代码

一些业务限制 (Some business constrains)

  • The cost of misclassification can be very high. ie, if a user asked a particular question and if we provide some other answer, then it is not good. It will affect the business. This is the most important constrain.

    错误分类的代价可能很高。 也就是说,如果用户问了一个特定的问题,而我们又提供了其他答案,那就不好了。 它将影响业务。 这是最重要的约束。

  • We want the probability of a pair of questions to be duplicated so that you can choose any threshold of choice. So depending upon use case we can change it.

    我们希望重复两个问题的概率,以便您可以选择任何选择阈值。 因此,可以根据用例进行更改。
  • We don’t have any latency requirements.

    我们没有任何延迟要求。
  • Interpretability is partially important. ie, we don’t want users to know why a pair of questions is duplicated. But if we know that it will be better.

    可解释性部分重要。 即,我们不希望用户知道为什么重复了两个问题。 但是,如果我们知道会更好。

绩效指标 (Performance metric)

Here we have a binary classification task and we want to predict a pair of questions is duplicate or not. We will use log loss as a metric. It makes sense since we are predicting a probability value, it makes sense to use log loss as a metric. It is our primary KPI(key performance indicator). We will also use the confusion matrix for measuring performance.

在这里,我们有一个二元分类任务,我们想预测一对问题是否重复。 我们将使用对数丢失作为指标。 因为我们正在预测概率值,所以这是有道理的,因此将对数损失用作度量是有道理的。 这是我们的主要KPI(关键绩效指标)。 我们还将使用混淆矩阵来衡量性能。

Log loss is nothing but negative of log of product of likelihoods

对数损失不过是可能性乘积的对数的负数

探索性数据分析(Exploratory Data Analysis)

Image for post
data
数据

We have about 404290 data points and 6 columns. The 6 columns/features are:

我们大约有404290个数据点和6列。 6列/功能是:

  • id: A unique id for the question pair

    id:问题对的唯一ID
  • qid1: id of the first question.

    qid1:第一个问题的ID。
  • qid2: id of the second question

    qid2:第二个问题的ID
  • question1: the first question

    问题1:第一个问题
  • question2: second question

    问题2:第二个问题
  • is_duplicate: Whether both are duplicate or not.

    is_duplicate:两者是否重复。

As initially approach, I checked for missing values. I found in 3 rows there are missing values (in question1 and question2). So I dropped those rows. Next, I checked for any duplicate rows. But there were no such rows.

作为最初的方法,我检查了缺失值。 我在3行中发现缺少值(在question1和question2中)。 所以我丢掉了那些行。 接下来,我检查了所有重复的行。 但是没有这样的行。

目标变量分析 (Analysis On Target Variable)

Image for post
distribution of target
目标分布

Clearly we have imbalanced data and the number of duplicate questions outnumbers the nonduplicates.

显然,我们的数据不平衡,重复问题的数量超过非重复问题的数量。

问题分析 (Analysis On Questions)

After analyzing the number of questions I came up with the following observation:

在分析了问题数量之后,我得出了以下观察结果:

Image for post
unique vs repeated questions
独特与重复的问题
Total number of unique questions is 537929
Number of questions that repeated more than 1 time is 111778 which is 20.779322178205675%
The maximum number of times a question occured is 157

After that, I tried plotting a histogram of the number of question occurrences and log of the number of questions. We can see that most of the questions have occurrences approximately < 60. We can see there is 1 question that occurs 167 times, 1 question that occurs 120 times 1 question that occurs 80 times, and so on.

之后,我尝试绘制问题发生数量的直方图和问题数量的对数。 我们可以看到大多数问题的发生率大约<60。我们可以看到1个问题发生了167次,1个问题发生了120次,1个问题发生了80次,依此类推。</

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值