https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur
TL;DR : I achieved near state-of-the-art accuracy by using a very deep neural net. The code is available here: https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question
Quora released its first ever dataset publicly on 24th Jan, 2017. This dataset consists of question pairs which are either duplicate or not. Duplicate questions mean the same thing.
For example, the question pairs below are duplicates (from the Quora dataset)
- How does Quora quickly mark questions as needing improvement?
- Why does Quora mark my questions as needing improvement/clarification before I have time to give it details? Literally within seconds…
- Why did Trump win the Presidency?
- How did Donald Trump win the 2016 Presidential Election?
- What practical applications might evolve from the discovery of the Higgs Boson?
- What are some practical benefits of discovery of the Higgs Boson?
Some examples of non-duplicate questions are as follows:
- Who should I address my cover letter to if I'm applying for a big company like Mozilla?
- Which car is better from safety view?""swift or grand i10"".My first priority is safety?
- Mr. Robot (TV series): Is Mr. Robot a good representation of real-life hacking and hacking culture? Is the depiction of hacker societies realistic?
- What mistakes are made when depicting hacking in ""Mr. Robot"" compared to real-life cybersecurity breaches or just a regular use of technologies?
- How can I start an online shopping (e-commerce) website?
- Which web technology is best suitable for building a big E-Commerce website?
In this article, we discuss methods which can be used to detect duplicate questions using Quora dataset. Of course, these methods can be used for other similar datasets.
Methods discussed in this article range from simple TF-IDF, Singular Value Decomposition, Fuzzy Features, Word2Vec features, GloVe features, LSTMs and 1D CNN. We provide a comparison of performance of these algorithms on the Quora dataset.
Let’s take a look at the data first.
Data
The data consisted of 404351 question pairs with 255045 negative samples (non-duplicates) and 149306 positive samples (duplicates). Approximately 40% positive samples.
First few rows of the data:
Label distribution:
Average number characters in question1: 59.57
Minimum number of characters in question1: 1
Maximum number of characters in question1: 623
Average number characters in question2: 60.14
Minimum number of characters in question2: 1
Maximum number of characters in question2: 1169
Since Quora Engineering chose accuracy to evaluate their models (https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) , I did the same.
Basic Feature Engineering
I started with some very basic features. These features included:
- Length of question1
- Length of question2
- Difference in the two lengths
- Character length of question1 without spaces
- Character length of question2 without spaces
- Number of words in question1
- Number of words in question2
- Number of common words in question1 and question2
These features can be created easily using pandas’ apply and lambda function.
Let’s call this basic set of features “fs-1”.
Next I created some fuzzy features using the fuzzywuzzy package (https://github.com/seatgeek/fuzzywuzzy). Fuzzywuzzy uses Levenshtein Distance to calculate differences between sequences.
The fuzzy features I used were:
- QRatio
- WRatio
- Partial ratio
- Partial token set ratio
- Partial token sort ratio
- Token set ratio
- Token sort ratio
This set of features will be called “fs-2”.
TF-IDF and SVD Features
I calculated TF-IDF & SVD features in a few different ways:
TF-IDF is an acronym for Term Frequency - Inverse Document Frequency. Its one of the very basic methods people use in information retrieval. One can read more about TFIDF here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
SVD stands for Singular Value Decomposition (https://en.wikipedia.org/wiki/Singular_value_decomposition). I used a variation of SVD called Truncated SVD which is implemented in scikit-learn.
The following pipelines were implemented and evaluated:
I’ll denote these features as “fs3-1”, “fs3-2”, “fs3-3”, “fs3-4” and “fs3-5”. Pretty easy ha! ;)
Let’s move to some complicated features from here.
Word2Vec Features
Word2Vec creates a multi-dimensional vector for every word in the english vocabulary (or the corpus it has been trained on). Word2Vec embeddings are very popular in natural language processing and always provide us with great insights. Wikipedia provides a good explanation of what these embeddings are and how they are generated (https://en.wikipedia.org/wiki/Word2vec).
Word2Vec can be used to represent words and words which have similar meaning will be very close to each other in the word2vec space. An example has been shown in the following figure:
We can also represent sentences using word2vec.
For word2vec model, I used gensim (https://radimrehurek.com/gensim/) and pre-trained word2vec model trained on Google News corpus.
For sentences, I generated vectors using the following function:
To calculate similarity between the questions, another feature that I created was word mover’s distance. Word mover’s distance uses word2vec embeddings and works on a principle similar to that of earth mover’s distance to give a distance between two text documents. In simple words, word mover’s distance provides the minimum distance needed to “move” a word from one document to another document. (From word embeddings to document distances: http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf).
Final word2vec features included:
- Word mover distance
- Normalized word mover distance
- Cosine distance between vectors of question1 and question2
- Manhattan distance between vectors of question1 and question2
- Jaccard similarity between vectors of question1 and question2
- Canberra distance between vectors of question1 and question2
- Euclidean distance between vectors of question1 and question2
- Minkowski distance between vectors of question1 and question2
- Braycurtis distance between vectors of question1 and question2
- Skew of vector for question1
- Skew of vector for question2
- Kurtosis of vector for question1
- Kurtosis of vector for question2
All the Word2Vec features are denoted by fs4.
A separate set of w2v features consisted of vectors itself.
- Word2vec vector for question1
- Word2vec vector for question2
These will be represented by fs5.
A snapshot of data after all the features (except tf-idf and svd features):
Now, we have everything available and we can start creating machine learning models on top of these features.
Machine Learning Models
I evaluated two of my favorite models: logistic regression and xgboost. For logistic regression the data was first normalized using z-score scaling.
The following table gives the performance of logistic regression and xgboost on different sets of features that were created:
The xgboost on basic features, fuzzy features, w2v vectors and w2v features already beats a few deep learning techniques such as siamese network as discussed here: http://www.erogol.com/duplicate-question-detection-deep-learning/ .
To be honest, I didn't spend much time with tuning hyperparameters of these models. I believe the score can be improved further if we use some hyperparameter tuning techniques. I wanted to dive into deep neural networks as soon as possible and that's what I did next!!!
Deep Learning Models
I tried many different deep learning models, from simple network with dense layers only to LSTM, GRU and 1D CNN. These models gave an accuracy of around 0.80.
Finally, I was able to get an accuracy of 0.85 with a deep neural network which comprised of two translation layers, one for each question, initialized by GloVe embeddings, two LSTMs without GloVe embeddings and two 1D convolutional layers which were also initialized by GloVe embeddings. This was followed by a series of dense layers with dropout and batch normalization. The final network summary is provided below:
Layer (type) Output Shape Param # Connected to ==================================================================================================== embedding_7 (Embedding) (None, 40, 300) 28683300 ____________________________________________________________________________________________________ timedistributed_3 (TimeDistribut (None, 40, 300) 90300 ____________________________________________________________________________________________________ lambda_3 (Lambda) (None, 300) 0 ____________________________________________________________________________________________________ embedding_8 (Embedding) (None, 40, 300) 28683300 ____________________________________________________________________________________________________ timedistributed_4 (TimeDistribut (None, 40, 300) 90300 ____________________________________________________________________________________________________ lambda_4 (Lambda) (None, 300) 0 ____________________________________________________________________________________________________ embedding_9 (Embedding) (None, 40, 300) 28683300 ____________________________________________________________________________________________________ convolution1d_3 (Convolution1D) (None, 36, 64) 96064 ____________________________________________________________________________________________________ dropout_10 (Dropout) (None, 36, 64) 0 ____________________________________________________________________________________________________ convolution1d_4 (Convolution1D) (None, 32, 64) 20544 ____________________________________________________________________________________________________ globalmaxpooling1d_3 (GlobalMaxP (None, 64) 0 ____________________________________________________________________________________________________ dropout_11 (Dropout) (None, 64) 0 ____________________________________________________________________________________________________ dense_13 (Dense) (None, 300) 19500 ____________________________________________________________________________________________________ dropout_12 (Dropout) (None, 300) 0 ____________________________________________________________________________________________________ batchnormalization_9 (BatchNorma (None, 300) 1200 ____________________________________________________________________________________________________ embedding_10 (Embedding) (None, 40, 300) 28683300 ____________________________________________________________________________________________________ convolution1d_5 (Convolution1D) (None, 36, 64) 96064 ____________________________________________________________________________________________________ dropout_13 (Dropout) (None, 36, 64) 0 ____________________________________________________________________________________________________ convolution1d_6 (Convolution1D) (None, 32, 64) 20544 ____________________________________________________________________________________________________ globalmaxpooling1d_4 (GlobalMaxP (None, 64) 0 ____________________________________________________________________________________________________ dropout_14 (Dropout) (None, 64) 0 ____________________________________________________________________________________________________ dense_14 (Dense) (None, 300) 19500 ____________________________________________________________________________________________________ dropout_15 (Dropout) (None, 300) 0 ____________________________________________________________________________________________________ batchnormalization_10 (BatchNorm (None, 300) 1200 ____________________________________________________________________________________________________ embedding_11 (Embedding) (None, 40, 300) 28683300 ____________________________________________________________________________________________________ lstm_3 (LSTM) (None, 300) 721200 ____________________________________________________________________________________________________ embedding_12 (Embedding) (None, 40, 300) 28683300 ____________________________________________________________________________________________________ lstm_4 (LSTM) (None, 300) 721200 ____________________________________________________________________________________________________ batchnormalization_11 (BatchNorm (None, 1800) 7200 merge_2[0][0] ____________________________________________________________________________________________________ dense_15 (Dense) (None, 300) 540300 batchnormalization_11[0][0] ____________________________________________________________________________________________________ prelu_6 (PReLU) (None, 300) 300 dense_15[0][0] ____________________________________________________________________________________________________ dropout_16 (Dropout) (None, 300) 0 prelu_6[0][0] ____________________________________________________________________________________________________ batchnormalization_12 (BatchNorm (None, 300) 1200 dropout_16[0][0] ____________________________________________________________________________________________________ dense_16 (Dense) (None, 300) 90300 batchnormalization_12[0][0] ____________________________________________________________________________________________________ prelu_7 (PReLU) (None, 300) 300 dense_16[0][0] ____________________________________________________________________________________________________ dropout_17 (Dropout) (None, 300) 0 prelu_7[0][0] ____________________________________________________________________________________________________ batchnormalization_13 (BatchNorm (None, 300) 1200 dropout_17[0][0] ____________________________________________________________________________________________________ dense_17 (Dense) (None, 300) 90300 batchnormalization_13[0][0] ____________________________________________________________________________________________________ prelu_8 (PReLU) (None, 300) 300 dense_17[0][0] ____________________________________________________________________________________________________ dropout_18 (Dropout) (None, 300) 0 prelu_8[0][0] ____________________________________________________________________________________________________ batchnormalization_14 (BatchNorm (None, 300) 1200 dropout_18[0][0] ____________________________________________________________________________________________________ dense_18 (Dense) (None, 300) 90300 batchnormalization_14[0][0] ____________________________________________________________________________________________________ prelu_9 (PReLU) (None, 300) 300 dense_18[0][0] ____________________________________________________________________________________________________ dropout_19 (Dropout) (None, 300) 0 prelu_9[0][0] ____________________________________________________________________________________________________ batchnormalization_15 (BatchNorm (None, 300) 1200 dropout_19[0][0] ____________________________________________________________________________________________________ dense_19 (Dense) (None, 300) 90300 batchnormalization_15[0][0] ____________________________________________________________________________________________________ prelu_10 (PReLU) (None, 300) 300 dense_19[0][0] ____________________________________________________________________________________________________ dropout_20 (Dropout) (None, 300) 0 prelu_10[0][0] ____________________________________________________________________________________________________ batchnormalization_16 (BatchNorm (None, 300) 1200 dropout_20[0][0] ____________________________________________________________________________________________________ dense_20 (Dense) (None, 1) 301 batchnormalization_16[0][0] ____________________________________________________________________________________________________ activation_2 (Activation) (None, 1) 0 dense_20[0][0] ==================================================================================================== Total params: 174,913,917 Trainable params: 60,172,917 Non-trainable params: 114,741,000 ____________________________________________________________________________________________________
And the network architecture:
The network was trained on an NVIDIA TitanX and took approximately 300 seconds for each epoch and took 10-15 hours to train. This network achieved an accuracy of 0.848 (~0.85). I tried over 10 different architectures to come up with this one :)
I'm still training a few configurations and will update this article as soon as the results improve. Code is available on my git repo: https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question
Major python libraries that I used:
- scikit-learn
- keras
- tensorflow
- pandas
I would like to thank Alexey Grigorev (https://github.com/alexeygrigorev/) for providing great pointers on word2vec features and Bradley Pallen for his similar work which is available on his github (https://github.com/bradleypallen/).
Detect duplicate questions in Quora #MachineLearning #DeepLearning