CHAPTER 6 Deep Learning for NLP

from unlocking text data with maching learning and deep learning using python

In this chapter, we will implement deep learning for NLP:
Recipe 1. Information retrieval using deep learning
Recipe 2. Text classification using CNN, RNN, LSTM
Recipe 3. Predicting the next word/sequence of words using LSTM for Emails

Introduction to Deep Learning

Deep learning is a subfield of machine learning that is inspired by the
function of the brain. Just like how neurons are interconnected in the
brain, neural networks also work the same. Each neuron takes input, does
some kind of manipulation within the neuron, and produces an output
that is closer to the expected output (in the case of labeled data).
What happens within the neuron is what we are interested in: to get to
the most accurate results. In very simple words, it’s giving weight to every
input and generating a function to accumulate all these weights and pass it
onto the next layer, which can be the output layer eventually.
The functions can be of different types based on the problem or the
data. These are also called activation functions. Below are the types.
• Linear Activation functions: A linear neuron takes a
linear combination of the weighted inputs; and the
output can take any value between -infinity to infinity.
• Nonlinear Activation function: These are the most used
ones, and they make the output restricted between
some range:
• Sigmoid or Logit Activation Function: Basically,
it scales down the output between 0 and 1
by applying a log function, which makes the
classification problems easier.
Chapter 6 Deep Learning for nLp
187
• Softmax function: Softmax is almost similar to
sigmoid, but it calculates the probabilities of the
event over ‘n’ different classes, which will be useful
to determine the target in multiclass classification
problems.
• Tanh Function: The range of the tanh function is
from (-1 to 1), and the rest remains the same as
sigmoid.
• Rectified Linear Unit Activation function: ReLU
converts anything that is less than zero to zero. So,
the range becomes 0 to infinity.

RNN and LSTM are suited better for text-related solutions
CNNs are basically used for computer vision problems but fail to solve
sequence models. Sequence models are those where even a sequence
of the entity also matters. For example, in the text, the order of the words
matters to create meaningful sentences. This is where RNNs come into the
picture and are useful with sequential data
because each neuron can use
its memory to remember information about the previous step.

LSTMs are a kind of RNNs with betterment in equation and
backpropagation, which makes it perform better.

Recipe 6-1. Retrieving Information##

Let’s take a simple example and see how to build a document retrieval
using query input. Let’s say we have 4 documents in our database as
below. (Just showcasing how it works. We will have too many documents
in a real-world application.)
Doc1 = [“With the Union cabinet approving the amendments to the
Motor Vehicles Act, 2016, those caught for drunken driving will
have to have really deep pockets, as the fine payable in court
has been enhanced to Rs 10,000 for first-time offenders.” ]
Doc2 = [“Natural language processing (NLP) is an area of
computer science and artificial intelligence concerned with the
interactions between computers and human (natural) languages,
in particular how to program computers to process and analyze
large amounts of natural language data.”]
Doc3 = [“He points out that public transport is very good in
Mumbai and New Delhi, where there is a good network of suburban
and metro rail systems.”]
Chapter 6 Deep Learning for nLp
196
Doc4 = [“But the man behind the wickets at the other end was
watching just as keenly. With an affirmative nod from Dhoni,
India captain Rohit Sharma promptly asked for a review. Sure
enough, the ball would have clipped the top of middle and leg.”]
Assume we have numerous documents like this. And you want to retrieve
the most relevant once for the query “cricket.” Let’s see how to build it.
query = “cricket”

How It Works
Step 1-1 Import the libraries
Here are the libraries:
import gensim
from gensim.models import Word2Vec
import numpy as np
import nltk
import itertools
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import scipy
from scipy import spatial
from nltk.tokenize.toktok import ToktokTokenizer
import re

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words(‘english’)

Put all the documents in one list

fin= Doc1+Doc2+Doc3+Doc4
As mentioned earlier, we are going to use the word embeddings to solve
this problem. Download word2vec from the below link:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
#load the model
model = gensim.models.KeyedVectors.load_word2vec_format
(’/GoogleNews-vectors-negative300.bin’, binary=True)

#Preprocessing
def remove_stopwords(text, is_lower_case=False):
pattern = r’[^a-zA-z0-9\s]’
text = re.sub(pattern, ", ".join(text))
tokens = tokenizer.tokenize(text)
tokens = [token.strip() for token in tokens]
if is_lower_case:
filtered_tokens = [token for token in tokens if token
not in stopword_list]
else:
filtered_tokens = [token for token in tokens if token.
lower() not in stopword_list]
filtered_text = ’ '.join(filtered_tokens)
return filtered_text

Function to get the embedding vector for n dimension, we have used “300”

def get_embedding(word):
if word in model.wv.vocab:
return model[x]
else:
return np.zeros(300)

For every document, we will get a lot of vectors based on the number of
words present. We need to calculate the average vector for the document
through taking a mean of all the word vectors.

Getting average vector for each document

out_dict = {}
for sen in fin:
average_vector = (np.mean(np.array([get_embedding(x) for x
in nltk.word_tokenize(remove_stopwords(sen))]), axis=0))
dict = { sen : (average_vector) }
out_dict.update(dict)

Function to calculate the similarity between the query vector

and document vector
def get_sim(query_embedding, average_vector_doc):
sim = [(1 - scipy.spatial.distance.cosine(query_embedding,
average_vector_doc))]
return sim

Rank all the documents based on the similarity to get

relevant docs
def Ranked_documents(query):
query_words = (np.mean(np.array([get_embedding(x) for x in
nltk.word_tokenize(query.lower())],dtype=float), axis=0))
rank = []
for k,v in out_dict.items():
rank.append((k, get_sim(query_words, v)))
rank = sorted(rank,key=lambda t: t[1], reverse=True)
print(‘Ranked Documents :’)
return rank

Step 1-5 Results and applications
Let’s see how the information retrieval system we built is working with a
couple of examples.

Call the IR function with a query

Ranked_documents(“cricket”)
Result :
[(‘But the man behind the wickets at the other end was watching
just as keenly. With an affirmative nod from Dhoni, India
captain Rohit Sharma promptly asked for a review. Sure enough,
the ball would have clipped the top of middle and leg.’,
[0.44954327116871795]),
(‘He points out that public transport is very good in Mumbai
and New Delhi, where there is a good network of suburban and
metro rail systems.’,
[0.23973446569030055]),
(‘With the Union cabinet approving the amendments to the Motor
Vehicles Act, 2016, those caught for drunken driving will have
to have really deep pockets, as the fine payable in court has
been enhanced to Rs 10,000 for first-time offenders.’,
[0.18323712012013349]),
(‘Natural language processing (NLP) is an area of computer
science and artificial intelligence concerned with the
interactions between computers and human (natural) languages,
in particular how to program computers to process and analyze
large amounts of natural language data.’,
[0.17995060855459855])]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值