机器学习过采样是重复数据吗
Quora is an amazing platform where questions are asked, answered, followed, and edited by internet companies. This empowers people to learn from each other and to better understand the world. About 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. It's not a better side from quora to ask its followers to write an answer for the same question. So it will be better if there is a system that is capable of detecting that a new question is similar to the questions that have already been answered.
Quora是一个了不起的平台,互联网公司可以在其中询问,回答,关注和编辑问题。 这使人们能够相互学习并更好地了解世界。 每个月大约有1亿人访问Quora,因此很多人提出类似措辞的问题也就不足为奇了。 从Quora要求其追随者为同一问题写答案并不是更好的一面。 因此,如果有一个系统能够检测到新问题与已经回答的问题相似,那就更好了。
So our problem statement is to predict whether a pair of questions are duplicates or not. We will use various machine learning techniques to come up with a solution for this. This blog is not a complete code walkthrough, but I will explain various approaches I used to solve the problem. You can have a look at my code from my Github repository.
因此,我们的问题陈述是预测一对问题是否重复。 我们将使用各种机器学习技术来提出解决方案。 本博客不是完整的代码演练,但我将解释解决问题的各种方法。 您可以从我的Github存储库中查看我的代码。
一些业务限制 (Some business constrains)
The cost of misclassification can be very high. ie, if a user asked a particular question and if we provide some other answer, then it is not good. It will affect the business. This is the most important constrain.
错误分类的代价可能很高。 也就是说,如果用户问了一个特定的问题,而我们又提供了其他答案,那就不好了。 它将影响业务。 这是最重要的约束。
- We want the probability of a pair of questions to be duplicated so that you can choose any threshold of choice. So depending upon use case we can change it. 我们希望重复两个问题的概率,以便您可以选择任何选择阈值。 因此,可以根据用例进行更改。
- We don’t have any latency requirements. 我们没有任何延迟要求。
- Interpretability is partially important. ie, we don’t want users to know why a pair of questions is duplicated. But if we know that it will be better. 可解释性部分重要。 即,我们不希望用户知道为什么重复了两个问题。 但是,如果我们知道会更好。
绩效指标 (Performance metric)
Here we have a binary classification task and we want to predict a pair of questions is duplicate or not. We will use log loss as a metric. It makes sense since we are predicting a probability value, it makes sense to use log loss as a metric. It is our primary KPI(key performance indicator). We will also use the confusion matrix for measuring performance.
在这里,我们有一个二元分类任务,我们想预测一对问题是否重复。 我们将使用对数丢失作为指标。 因为我们正在预测概率值,所以这是有道理的,因此将对数损失用作度量是有道理的。 这是我们的主要KPI(关键绩效指标)。 我们还将使用混淆矩阵来衡量性能。
Log loss is nothing but negative of log of product of likelihoods
对数损失不过是可能性乘积的对数的负数
探索性数据分析(Exploratory Data Analysis)
We have about 404290 data points and 6 columns. The 6 columns/features are:
我们大约有404290个数据点和6列。 6列/功能是:
- id: A unique id for the question pair id:问题对的唯一ID
- qid1: id of the first question. qid1:第一个问题的ID。
- qid2: id of the second question qid2:第二个问题的ID
- question1: the first question 问题1:第一个问题
- question2: second question问题2:第二个问题
- is_duplicate: Whether both are duplicate or not.is_duplicate:两者是否重复。
As initially approach, I checked for missing values. I found in 3 rows there are missing values (in question1 and question2). So I dropped those rows. Next, I checked for any duplicate rows. But there were no such rows.
作为最初的方法,我检查了缺失值。 我在3行中发现缺少值(在question1和question2中)。 所以我丢掉了那些行。 接下来,我检查了所有重复的行。 但是没有这样的行。
目标变量分析 (Analysis On Target Variable)
Clearly we have imbalanced data and the number of duplicate questions outnumbers the nonduplicates.
显然,我们的数据不平衡,重复问题的数量超过非重复问题的数量。
问题分析 (Analysis On Questions)
After analyzing the number of questions I came up with the following observation:
在分析了问题数量之后,我得出了以下观察结果:
Total number of unique questions is 537929
Number of questions that repeated more than 1 time is 111778 which is 20.779322178205675%
The maximum number of times a question occured is 157
After that, I tried plotting a histogram of the number of question occurrences and log of the number of questions. We can see that most of the questions have occurrences approximately < 60. We can see there is 1 question that occurs 167 times, 1 question that occurs 120 times 1 question that occurs 80 times, and so on.
之后,我尝试绘制问题发生数量的直方图和问题数量的对数。 我们可以看到大多数问题的发生率大约<60。我们可以看到1个问题发生了167次,1个问题发生了120次,1个问题发生了80次,依此类推。
Now we have a broad understanding of data. Next, I created some basic feature features before preprocessing the data.
现在我们对数据有了广泛的了解。 接下来,我在预处理数据之前创建了一些基本功能。
基本特征工程 (Basic Feature Engineering)
I created the following features:
我创建了以下功能:
freq_qid1 = Frequency of qid1’s.number of times question1 occur.
freq_qid1 = qid1的频率。问题1发生的次数。
freq_qid2 = Frequency of qid2’s.number of times question1 occur.
freq_qid2 = qid2的频率。问题1发生的次数。
q1len = Length of q1
q1len = q1的长度
q2len = Length of q2
q2len = q2的长度
q1_n_words = Number of words in Question 1
q1_n_words =问题1中的单词数
q2_n_words = Number of words in Question 2
q2_n_words =问题2中的单词数
word_Common = (Number of common unique words in Question 1 and Question 2).
word_Common =(问题1和问题2中常见的唯一单词数)。
word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
word_Total =(问题1中的单词总数+问题2中的单词总数)
word_share = (word_common)/(word_Total)
word_share =(word_common)/(word_Total)
freq_q1+freq_q2 = sum total of the frequency of qid1 and qid2
freq_q1 + freq_q2 = qid1和qid2频率的总和
freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2
freq_q1-freq_q2 = qid1和qid2的频率绝对差
工程特征分析(Analysis On engineered features)
We can see that as the word share increases there is a higher chance the questions are similar. We know that for pdf’s as they overlap more the information that distinguishes the classes will be less. From the histogram, we can see that word share has some information differentiating similar and dissimilar classes.
我们可以看到,随着单词share的增加,出现相似问题的可能性更高。 我们知道,对于pdf,由于它们重叠更多,所以区分类别的信息会更少。 从直方图中,我们可以看到单词共享具有区分相似和不相似类的一些信息。
We can see that common_words does not have enough information separating classes. The hist plots of word_common of duplicate and non-duplicate questions are overlapping. Not much information can be retrieved as most of pdf’s is overlapping.
我们可以看到common_words没有足够的信息来分隔类。 重复和非重复问题的word_common的历史图是重叠的。 由于大多数pdf重叠,因此无法检索到太多信息。
高级功能工程 (Advanced Feature Engineering)
Now we will create some advanced features using the data we have. Before that, we will clean our text data. As a part of text preprocessing, I have removed stopwords, punctuations, special characters like “₹”,”$”,”€” etc, and also I have applied stemming to get more generalization. Next I engineered the following features.
现在,我们将使用已有的数据创建一些高级功能。 在此之前,我们将清理文本数据。 作为文本预处理的一部分,我删除了停用词,标点符号和特殊字符,例如“₹”,“ $”,“€”等,并且我也应用了词干以获得更多的概括。 接下来,我设计了以下功能。
Note: In below features tocken means words obtained by splitting text and words means tockens which are not stopwords.
注意:在以下功能中,“ tocken”表示通过拆分文本获得的单词,而“ word”表示不是停用词的tocken。
cwc_min : Ratio of common_word_count to min length of word count of Q1 and Q2
cwc_min :common_word_count与Q1和Q2的最小字数长度之比
cwc_min : Ratio of common_word_count to min length of word count of Q1 and Q2 cwc_min = common_word_count / (min(len(q1_words), len(q2_words))
cwc_min :common_word_count与Q1和Q2的最小字数长度之比cwc_min = common_word_count /(min(len(q1_words),len(q2_words))
cwc_max : Ratio of common_word_count to max length of word count of Q1 and Q2
cwc_max :common_word_count与Q1和Q2的最大字数长度之比
cwc_max : Ratio of common_word_count to max length of word count of Q1 and Q2 cwc_max = common_word_count / (max(len(q1_words), len(q2_words))
cwc_max :common_word_count与Q1和Q2的字数最大长度之比cwc_max = common_word_count /(max(len(q1_words),len(q2_words))
csc_min : Ratio of common_stop_count to min length of stop count of Q1 and Q2
csc_min :common_stop_count与Q1和Q2的最小停止计数长度之比
csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))
csc_min = common_stop_count /(min(len(q1_stops),len(q2_stops))
csc_max : Ratio of common_stop_count to max length of stop count of Q1 and Q2
csc_max :common_stop_count与Q1和Q2的最大停止计数长度之比
csc_max : Ratio of common_stop_count to max length of stop count of Q1 and Q2csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))
csc_max :common_stop_count与Q1和Q2的最大停止计数长度之比csc_max = common_stop_count /(max(len(q1_stops),len(q2_stops)))
ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2
ctc_min :common_token_count与Q1和Q2令牌计数的最小长度之比
ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
ctc_min :common_token_count与Q1和Q2令牌计数的最小长度之比ctc_min = common_token_count /(min(len(q1_tokens),len(q2_tokens))
ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2
ctc_max :common_token_count与Q1和Q2的令牌计数的最大长度之比
ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
ctc_max :common_token_count与Q1和Q2令牌总数的最大长度之比ctc_max = common_token_count /(max(len(q1_tokens),len(q2_tokens))
last_word_eq : Check if last word of both questions is equal or not
last_word_eq :检查两个问题的最后一个单词是否相等
last_word_eq : Check if last word of both questions is equal or notlast_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
last_word_eq :检查两个问题的最后一个单词是否相等last_word_eq = int(q1_tokens [-1] == q2_tokens [-1])
first_word_eq : Check if First word of both questions is equal or not
first_word_eq :检查两个问题的第一个单词是否相等
first_word_eq : Check if First word of both questions is equal or notfirst_word_eq = int(q1_tokens[0] == q2_tokens[0])
first_word_eq :检查两个问题的第一个单词是否相等first_word_eq = int(q1_tokens [0] == q2_tokens [0])
abs_len_diff : Abs. length difference
abs_len_diff :绝对长度差
abs_len_diff : Abs. length differenceabs_len_diff = abs(len(q1_tokens) — len(q2_tokens))
abs_len_diff :绝对长度差abs_len_diff = abs(len(q1_tokens)— len(q2_tokens))
mean_len : Average Token Length of both Questions
mean_len :两个问题的平均令牌长度
mean_len : Average Token Length of both Questionsmean_len = (len(q1_tokens) + len(q2_tokens))/2
mean_len :两个问题的平均令牌长度mean_len =(len(q1_tokens)+ len(q2_tokens))/ 2
fuzz_ratio : Here comes the interesting part. Fuzz ratio depends upon the Levenshtein distance. Intutively saying if the corresponding edits required from one sentance to become other is large, fuzz ratio will be small. ie, fuzz ratio will be similar for most similar words.
fuzz_ratio :这是有趣的部分。 模糊比取决于Levenshtein距离。 直观地说,如果从一种情感到另一种情感所需的相应编辑量很大,则绒毛率将很小。 即,对于大多数相似的单词,模糊率将相似。
eg: s1 = “mumbai is a great place” s2 = "mumbai is a nice place"
fuzz ratio = 91
fuzz_partial_ratio : In certain cases fuzz ratio cannot solve the issue.
fuzz_partial_ratio :在某些情况下,模糊比无法解决问题。
fuzz.ratio("YANKEES", "NEW YORK YANKEES") ⇒ 60
fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 75
Both s1 and s2 mean the same. But their fuzz ratio can be smaller. So we will find the ratio for partial sentences and it will be high. In such a case, it is known as a fuzz partial ratio.
s1和s2的含义相同。 但是它们的模糊比可以更小。 因此,我们会发现部分句子的比率很高。 在这种情况下,称为模糊部分比率。
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 60
token_sort_ratio: In some other cases even fuzz partial ratio will fail.
token_sort_ratio :在某些其他情况下,甚至绒毛比例也将失败。
For example:
例如:
fuzz.partial_ratio("MI vs RCB","RCB vs MI")
Actually both the sentence have the same meaning. But the fuzz ratio gives a low result. So a better approach is to sort the tokens and then apply fuzz ratio.
实际上,这两个句子具有相同的含义。 但是,绒毛率的结果很低。 因此,更好的方法是对令牌进行排序,然后应用模糊比。
fuzz.token_sort_ratio("MI vs RCB","RCB vs MI")
token_set_ratio: There is another type of fuzz ratio which helps even I cases where all above fails. It is the token set ratio.
token_set_ratio :还有另一种类型的绒毛比率,即使在上述所有情况均失败的情况下,也可以提供帮助。 这是令牌设置比率。
For that we have to first find the following:
为此,我们必须首先找到以下内容:
t0 -> find the intersection words of sentance1 and sentance2 and sort them.
t0->查找sentance1和sentance2的交集词并将其排序。
t1-> t0 + rest of tokens in sentance1
t1-> t0 + sendent1中的其他令牌
t2-> t0 + rest of tokens in sentance2
t2-> t0 + sentance2中的其他令牌
tocken_set_ratio = max(fuzz_ratio(to,t1),fuzz_ratio(t1,t2),fuzz_ratio(t0,t2))
tocken_set_ratio = max(fuzz_ratio(to,t1),fuzz_ratio(t1,t2),fuzz_ratio(t0,t2))
longest_substr_ratio : Ratio of length longest common substring to min lenghth of token count of Q1 and Q2.
longest_substr_ratio :最长公共子串的长度与Q1和Q2的令牌计数的最小长度之比。
s1-> hai, today is a good day
s2-> No, today is a bad day
Here longest common substring is “today is a”. So we have longest_substring_ratio = 3 / min(6,6) = 0.5longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))
这里最长的通用子字符串是“ today is a”。 所以我们有longest_substring_ratio = 3 / min(6,6)= 0.5 longest_substr_ratio = len(最长公共子串)/(min(len(q1_tokens),len(q2_tokens))
更多数据分析 (More data analysis)
Now we will plot Word cloud for duplicate sentences and nonduplicate sentences. I have plotted it after removing stopwords to get a better understanding.
现在,我们将为重复的句子和不重复的句子绘制词云。 为了更好地理解,我在删除停用词后对其进行了绘制。
Larger the words, the larger the frequency of repetition in the corpus. We can see that words like Donald trump, rupee, the best way are mostly repeating in case of duplicate sentences while words like difference, India, use, etc are most repeating in nonduplicate sentences.
单词越大,语料库中重复的频率越大。 我们可以看到,唐纳德·特朗普(Donald Trump),卢比(Rupee)等最佳词在重复句子的情况下大多会重复,而差异,印度,用法等词在非重复句子中的重复率最高。
配对图'ctc_min','cwc_min','csc_min','token_sort_ratio' (Pair plot for ‘ctc_min’, ‘cwc_min’, ‘csc_min’, ‘token_sort_ratio’)
From the pair plot, we can see that there is some useful information in all the features that separate duplicate and nonduplicate sentences. Out of which token sort ratio and ctc min do a nice job than others.
从对图中,我们可以看出,在所有功能中都有一些有用的信息将重复和不重复的句子分开。 使用哪种令牌排序比率和ctc min可以做得更好。
问题之间的字数绝对差异 (The absolute difference in the number of words between questions)
We can see that most questions differ by the only word. There is only a very small number of questions with a huge difference.
我们可以看到,大多数问题仅因一个词而不同。 只有极少数的问题有很大的不同。
TSNE可视化 (TSNE Visualization)
Next, I tried to have a lower-dimensional visualization of the data. I randomly sampled 5000 datapoints and used TSNE for lower dimensional visualization. I used only the features we recently engineered to see its impact on analysis. We see that in some regions the classes and well separated. So we can say that there is now much information in our model to perform a good classification.
接下来,我尝试对数据进行低维可视化。 我随机采样了5000个数据点,并将TSNE用于低维可视化。 我仅使用我们最近设计的功能来查看其对分析的影响。 我们看到,在某些地区,阶级和阶级之间是完全分开的。 因此,可以说我们的模型中现在有很多信息可以执行良好的分类。
Note: You can always experiment with more data points and perplexity(If you have enough computation power) as it will give much more information.
注意:您总是可以尝试使用更多的数据点和困惑度(如果您有足够的计算能力),因为它将提供更多的信息。
火车测试拆分(Train test split)
We did a 70:30 split. ie, we placed 70% of data points for training and the rest 30% for testing.
我们做了70:30的分割。 也就是说,我们将70%的数据点用于训练,将其余30%的数据用于测试。
向量化文字数据 (Vectorizing text data)
Before creating models using the data, we have to vectorize the text data. For that, we used 2 approaches
在使用数据创建模型之前,我们必须向量化文本数据。 为此,我们使用了2种方法
- TFIDF vectorizationTFIDF向量化
- TFIDF weighted GLOVE vectorizationTFIDF加权GLOVE向量化
We saved separate files after merging these vectors with the initial features we created.
将这些向量与我们创建的初始特征合并后,我们保存了单独的文件。
机器学习模型 (Machine Learning Models)
Let us get into the most interesting part of this blog- Creating machine learning models. Now we have two data frames for training- one using tfidf and the other with tfidf weighted glove vectors.
让我们进入这个博客最有趣的部分-创建机器学习模型。 现在我们有两个训练数据框-一个使用tfidf,另一个使用tfidf加权手套向量。
逻辑回归 (Logistic regression)
TFIDF features:
TFIDF功能:
On training Logistic regression of TFIDF data, we end up with a log loss of about 0.43 for train and .53 for the test. Our confusion precision and recall matrixes look as follows:
在训练TFIDF数据的Logistic回归上,对于火车,我们的对数损失约为0.43,对于测试,则为.53 。 我们的混淆精度和召回矩阵如下所示:
Everyone must be aware of how to interpret the normal confusion matrix. I am not getting into its explanation. Let us see how we can interpret the precision matrix. In the above figure, the second one is the precision matrix.
每个人都必须知道如何解释正常的混淆矩阵。 我没有解释它。 让我们看看如何解释精度矩阵。 在上图中,第二个是精度矩阵。
Intuitively precision means, of all the points, predicted as positive, how much is actually positive. Here we can see that of all the labels predicted as class 1, 24.6% belongs to class 2, and the rest 75.4 % belongs to class 1. Similarly, for all the points predicted as class 2, 69.2% belong to class 2 and 30.02% belong to class1. Here precision for class 1 is 0.754 and class2 is 0.30.
直觉上的精确度意味着在所有预测为正的点中,多少实际上是正的。 在这里我们可以看到,在所有预测为1类的标签中,有24.6%属于2类,其余75.4%属于1类。同样,对于所有预测为2类的点,有69.2%属于2类和30.02类%属于class1。 此处,类别1的精度为0.754,类别2的精度为0.30。
The third matrix we have is the recall matrix. Intuitively recall means, of all the points belonging to a class, how much of them are predicted correctly. In the recall matrix, of all the labels belonging to class1, 86.5% were predicted as class1 and 13.5% is predicted as class2. similarly, Of all the original points belonging to class2, 59.9% belong to class2 and the rest belongs to class1.
我们拥有的第三个矩阵是召回矩阵。 直观地回忆是指,在属于一个类的所有点中,有多少是正确预测的。 在召回矩阵中,属于类别1的所有标签中,有86.5%的预测为类别1,有13.5%的预测为类别2。 同样,在所有属于class2的原始点中,有59.9%属于class2,其余的则属于class1。
TFIDF Weighted GLOVE:
TFIDF加权手套:
After hyperparameter tuning, we end up with a log loss of about 0.39 for the test and 0.38 for the train.
经过超参数调整后,测试的对数损失约为0.39,列车的对数损失约为0.38。
From both recall matrixes, we can see that we have a low recall value. Let us see how it performs with a linear SVM.
从这两个召回矩阵中,我们可以看到我们的召回值较低。 让我们看看它如何与线性SVM一起执行。
线性支持向量机 (Linear SVM)
TFIDF功能:(TFIDF features:)
Train log loss: 0.3851
火车日志损失:0.3851
Test log loss: 0.3942
测试日志丢失:0.3942
Confusion matrix:
混淆矩阵:
TFIDF Weighted GLOVE:
TFIDF加权手套:
Train log loss: 0.3876
火车日志损失:0.3876
Test log loss: 0.395
测试日志丢失:0.395
Confusion matrix:
混淆矩阵:
We can see for both Logistic regression and Linear SVM, we end up with low recall value. But both models are not that much overfitting. So I thought of a high bias problem here. If we opt for some boosting methods, we may overcome this a may get a better recall value. With this intention, I tried out XGBoost on both tfidf and tfidf weighted glove features.
我们可以看到,对于Logistic回归和线性SVM,最终召回价值较低。 但是,这两种模型都没有太大的拟合度。 因此,我想到了一个高偏差问题。 如果我们选择一些加强方法,我们可能会克服这一点,从而可能获得更好的召回价值。 出于这个目的,我在tfidf和tfidf加权手套功能上都尝试了XGBoost。
XGBOOST (XGBOOST)
TFIDF features:
TFIDF功能:
Train log loss: 0.457
火车日志损失:0.457
Test log loss: 0.516
测试日志丢失:0.516
Confusion matrix:
混淆矩阵:
TFIDF Weighted GLOVE features:
TFIDF加权手套功能:
Train log loss: 0.183
火车日志损失:0.183
Test log loss: 0.32
测试记录损失:0.32
Confusion matrix:
混淆矩阵:
We can see that using Xgboost our recall of both classes improved slightly. Out of all these xgboost on glove vectors performs better.
我们可以看到,使用Xgboost可以使我们对这两个类的回忆都略有改善。 在所有这些手套上的xgboost中,性能更好。
For a complete code walkthrough, you can visit my GitHub repository.
有关完整的代码演练,您可以访问我的GitHub存储库。
翻译自: https://towardsdatascience.com/finding-duplicate-quora-questions-using-machine-learning-249475b7d84d
机器学习过采样是重复数据吗
本文探讨了机器学习中的过采样技术是否涉及重复数据的问题,通过实例展示了如何运用机器学习来识别Quora平台上的重复问题。
1648

被折叠的 条评论
为什么被折叠?



