10分钟内自然语言处理的基础知识

最新推荐文章于 2024-06-25 16:15:00 发布

weixin_26641529

最新推荐文章于 2024-06-25 16:15:00 发布

阅读量387

点赞数

原文链接：https://medium.com/@parasharabhay13/basics-of-natural-language-processing-in-10-minutes-2ed51e6d5d32

版权

Hello, there

你好

You are here because you also want to learn natural language processing as quickly as possible, like me.

您之所以在这里，是因为您也像我一样，希望尽快学习自然语言处理。

Let’s start

开始吧

The first thing we need is to install some dependency

我们需要做的第一件事是安装一些依赖

Python >3.7
Python> 3.7

2. Download an IDE or install Jupyter notebook

2.下载IDE或安装Jupyter笔记本

To install Jupyter notebook, just open your cmd(terminal) and type pip install jupyter-notebook after that type jupyter notebook to run it then you can see that your notebook is open at http://127.0.0.1:8888/ token .

要安装Jupyter笔记本，只需打开您的cmd(terminal)并在该类型的jupyter notebook运行后键入pip install jupyter-notebook jupyter notebook即可运行它，然后您可以在http://127.0.0.1:8888/ token看到您的笔记本已打开。

3. Install packages

3.安装软件包

pip install nltk

NLTK: It is a python library that can we used to perform all the NLP tasks(stemming, lemmatization, etc..)

NLTK ：这是一个python库，我们可以用来执行所有NLP任务(去梗，去斑等)。

演示地址

In this blog, we are going to learn about

在这个博客中，我们将学习

Tokenization
代币化
Stopwords
停用词
Stemming
抽干
Lemmatizer
脱胶机
WordNet
词网
Part of speech tagging
语音标记的一部分
Bag of Words
言语包

Before learning anything let’s first understand NLP.

在学习任何东西之前，让我们首先了解NLP。

Natural Language refers to the way we humans communicate with each other and processing is basically proceeding the data in an understandable form. so we can say that NLP (Natural Language Processing) is a way that helps computers to communicate with humans in their own language.

Ñatural语言是指我们人类彼此通信和处理基本上继续以可理解的形式的数据的方式。因此我们可以说NLP(自然语言处理)是一种帮助计算机以他们自己的语言与人类进行交流的方法。

It is one of the broadest fields in research because there is a huge amount of data out there and from that data, a big amount of data is text data. So when there is so much data available so we need some technique threw which we can process the data and retrieve some useful information from it.

它是研究领域最广泛的领域之一，因为那里有大量数据，而从这些数据中，大量数据是文本数据。因此，当有太多可用数据时，我们需要投入一些技术来处理数据并从中检索一些有用的信息。

Now, we have an understanding of what is NLP, let’s start understanding each topic one by one.

现在，我们已经了解了什么是NLP，让我们开始一个一个地了解每个主题。

1.标记化 (1. Tokenization)

Tokenization is the process of dividing the whole text into tokens.

标记化是将整个文本分为标记的过程。

It is mainly of two types:

它主要有两种类型：

Word Tokenizer (separated by words)
词标记器(用词分隔)
Sentence Tokenizer (separated by sentence)
句子标记器(按句子分隔)

import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
example_text = "Hello there, how are you doing today? The weather is great today. The sky is blue. python is awsome"
print(sent_tokenize(example_text))
print(word_tokenize(example_text))

In the above codeFirst, we are importing nltk , in the second line, we are importing our tokenizers sent_tokenize,word_tokenizfrom library nltk.tokenize , then to use the tokenizer on a text we just need to pass the text as a parameter in the tokenizer.

在上面的代码中，首先，我们导入nltk ，在第二行中，我们从库nltk.tokenize导入令牌生成器sent_tokenize,word_tokeniz nltk.tokenize ，然后在文本上使用令牌生成器，我们只需要将文本作为参数传递给令牌生成器。

The output will look something like this

输出看起来像这样

##sent_tokenize (Separated by sentence)
['Hello there, how are you doing today?', 'The weather is great today.', 'The sky is blue.', 'python is awsome']##word_tokenize (Separated by words)['Hello', 'there', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'today', '.', 'The', 'sky', 'is', 'blue', '.', 'python', 'is', 'awsome']

2.停用词 (2. Stopwords)

In general stopwords are the words in any language which does not add much meaning to a sentence. In NLP stopwords are those words which are not important in analyzing the data.Example : he,she,hi,and etc.Our main task is to remove all the stopwords for the text to do any further processing.

通常，停用词是指不会增加句子含义的任何语言中的单词。在NLP中，停用词是那些对数据分析不重要的词。例如：he，she，hi等。我们的主要任务是删除文本的所有停用词，以进行进一步处理。

There are a total of 179 stopwords in English, using NLTK we can see all the stopwords in English. We Just need to import stopwords from the library nltk.corpus .

总共有179个英文停用词，使用NLTK，我们可以看到所有英文停用词。我们只需要从库nltk.corpus导入stopwords nltk.corpus 。

from nltk.corpus import stopwords
print(stopwords.words('english'))######################
######OUTPUT##########
######################
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

To remove Stopwords for a particular text

删除特定文本的停用词

from nltk.corpus import stopwords
text = 'he is a good boy. he is very good in coding'
text = word_tokenize(text)
text_with_no_stopwords = [word for word in text if word not in stopwords.words('english')]
text_with_no_stopwords##########OUTPUT##########
['good', 'boy', '.', 'good', 'coding']

3.抽干 (3. Stemming)

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.In simple words, we can say that stemming is the process of removing plural and adjectives from the word.Example :loved → love, learning →learn

词干是将单词还原为单词词干，词缀，词缀或词根的过程，简单来说，词干是从单词中去除复数和形容词的过程。例子：喜欢→爱，学习→学习

In python, we can implement stemming by using PorterStemmer . we can import it from the library nltk.stem .

在python中，我们可以使用PorterStemmer来实现词干PorterStemmer 。我们可以从库nltk.stem导入它。

从词干中记住的一件事是，它与单个单词配合使用效果最佳。 (One thing to remember from Stemming is that it works best with single words.)

from nltk.stem import PorterStemmer
ps = PorterStemmer()    ## Creating an object for porterstemmer
example_words = ['earn',"earning","earned","earns"]  ##Example wordsfor w in example_words:
    print(ps.stem(w))    ##Using ps object stemming the word##########OUTPUT##########
earn
earn
earn
earnHere we can see that earning,earned and earns are stem to there lemma or root word earn.

4.合法化 (4. Lemmatizing)

Lemmatization usually refers to doing things properly with the use of vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.In simple words lemmatization does the same work as stemming, the difference is that lemmatization returns a meaningful word.Example:Stemminghistory → histori Lemmatizinghistory → history

词法词化通常是指使用单词的词汇和词法分析来正确处理事物，通常旨在仅去除词尾变化并返回单词的基数或字典形式，即词缀。与词干提取相同，不同之处在于，词原化返回一个有意义的词。 示例：阻止历史→ histori解放历史→历史

它通常在设计聊天机器人，问答机器人，文本预测等时使用。(It is Mostly used when designing chatbots, Q&A bots, text prediction, etc.)

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() ## Create object for lemmatizer
example_words = ['history','formality','changes']
for w in example_words:
    print(lemmatizer.lemmatize(w))#########OUTPUT############
----Lemmatizer-----
history
formality
change
-----Stemming------
histori
formal
chang

5. WordNet (5. WordNet)

WordNet is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing.We can use wordnet for finding synonyms and antonyms.

WordNet是词汇数据库，即英语词典，专门为自然语言处理而设计。我们可以使用wordnet查找同义词和反义词。

In python, we can import wordnet from nltk.corpus . Code For Finding Synonym and antonym for a given word

在python中，我们可以从nltk.corpus导入wordnet 。 查找给定单词的同义词和反义词的代码

from nltk.corpus import wordnetsynonyms = []   ## Creaing an empty list for all the synonyms
antonyms =[]    ## Creaing an empty list for all the antonyms
for syn in wordnet.synsets("happy"): ## Giving word 
    for i in syn.lemmas():        ## Finding the lemma,matching 
        synonyms.append(i.name())  ## appending all the synonyms       
        if i.antonyms():
            antonyms.append(i.antonyms()[0].name()) ## antonyms
print(set(synonyms)) ## Converting them into set for unique values
print(set(antonyms))#########OUTPUT##########
{'felicitous', 'well-chosen', 'happy', 'glad'}
{'unhappy'}

6.语音标记的一部分 (6. Part of Speech Tagging)

It is a process of converting a sentence to forms — a list of words, a list of tuples (where each tuple is having a form (word, tag)). The tag in the case is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on.

这是将句子转换为形式的过程-单词列表，元组列表(每个元组都具有一个形式(单词，标记) )。案例中的标记是词性标记，表示单词是否为名词，形容词，动词等。

语音标签列表的一部分 (Part of Speech Tag List)

 CC coordinating conjunction
 CD cardinal digit
 DT determiner
 EX existential there (like: “there is” … think of it like “there”)
 FW foreign word
 IN preposition/subordinating conjunction
 JJ adjective ‘big’
 JJR adjective, comparative ‘bigger’
 JJS adjective, superlative ‘biggest’
 LS list marker 1)
 MD modal could, will
 NN noun, singular ‘desk’
 NNS noun plural ‘desks’
 NNP proper noun, singular ‘Harrison’
 NNPS proper noun, plural ‘Americans’
 PDT predeterminer ‘all the kids’
 POS possessive ending parent’s
 PRP personal pronoun I, he, she
 PRP possessive pronoun my, his, hers
 RB adverb very, silently,
 RBR adverb, comparative better
 RBS adverb, superlative best
 RP particle give up
 TO to go ‘to’ the store.
 UH interjection errrrrrrrm
 VB verb, base form take
 VBD verb, past tense took
 VBG verb, gerund/present participle taking
 VBN verb, past participle taken
 VBP verb, sing. present, non-3d take
 VBZ verb, 3rd person sing. present takes
 WDT wh-determiner which
 WP wh-pronoun who, what
 WP possessive wh-pronoun whose
 WRB wh-abverb where, when

In python, we can do pos tagging using nltk.pos_tag .

在python中，我们可以使用nltk.pos_tag进行pos标记。

import nltk
nltk.download('averaged_perceptron_tagger')sample_text = '''
An sincerity so extremity he additions. Her yet there truth merit. Mrs all projecting favourable now unpleasing. Son law garden chatty temper. Oh children provided to mr elegance marriage strongly. Off can admiration prosperous now devonshire diminution law.
'''from nltk.tokenize import word_tokenize
words = word_tokenize(sample_text)print(nltk.pos_tag(words))################OUTPUT############
[('An', 'DT'), ('sincerity', 'NN'), ('so', 'RB'), ('extremity', 'NN'), ('he', 'PRP'), ('additions', 'VBZ'), ('.', '.'), ('Her', 'PRP$'), ('yet', 'RB'), ('there', 'EX'), ('truth', 'NN'), ('merit', 'NN'), ('.', '.'), ('Mrs', 'NNP'), ('all', 'DT'), ('projecting', 'VBG'), ('favourable', 'JJ'), ('now', 'RB'), ('unpleasing', 'VBG'), ('.', '.'), ('Son', 'NNP'), ('law', 'NN'), ('garden', 'NN'), ('chatty', 'JJ'), ('temper', 'NN'), ('.', '.'), ('Oh', 'UH'), ('children', 'NNS'), ('provided', 'VBD'), ('to', 'TO'), ('mr', 'VB'), ('elegance', 'NN'), ('marriage', 'NN'), ('strongly', 'RB'), ('.', '.'), ('Off', 'CC'), ('can', 'MD'), ('admiration', 'VB'), ('prosperous', 'JJ'), ('now', 'RB'), ('devonshire', 'VBP'), ('diminution', 'NN'), ('law', 'NN'), ('.', '.')]

7.一句话 (7. Bag of words)

Till now we have understood about tokenizing, stemming, and lemmatizing. all of these are the part of the text cleaning, now after cleaning the text we need to convert the text into some kind of numerical representation called vectors so that we can feed the data to a machine learning model for further processing.

到目前为止，我们已经了解了关于标记化，词干化和词条化的知识。所有这些都是文本清理的一部分，现在，在清理文本之后，我们需要将文本转换为某种称为矢量的数字表示形式，以便将数据提供给机器学习模型以进行进一步处理。

For converting the data into vectors we make use of some predefined libraries in python.

为了将数据转换为向量，我们使用了python中的一些预定义库。

Let’s see how vector representation works

让我们看看向量表示是如何工作的

sent1 = he is a good boy
sent2 = she is a good girl
sent3 = boy and girl are good 
        |
        |
  After removal of stopwords , lematization or stemming
sent1 = good boy
sent2 = good girl
sent3 = boy girl good  
        | ### Now we will calculate the frequency for each word by
        |     calculating the occurrence of each word
word  frequency
good     3
boy      2
girl     2
         | ## Then according to their occurrence we assign o or 1 
         |    according to their occurrence in the sentence
         | ## 1 for present and 0 fot not present
         f1  f2   f3
        girl good boy   
sent1    0    1    1     
sent2    1    0    1
sent3    1    1    1### After this we pass the vector form to machine learning model

The above process can be done using a CountVectorizer in python, we can import the same from sklearn.feature_extraction.text .

上面的过程可以在python中使用CountVectorizer来完成，我们可以从sklearn.feature_extraction.text导入相同的sklearn.feature_extraction.text 。

CODE to implement CountVectorizer In python

在python中实现CountVectorizer代码

import pandas as pd
sent = pd.DataFrame(['he is a good boy', 'she is a good girl', 'boy and girl are good'],columns=['text'])
corpus = []
for i in range(0,3):
    words = sent['text'][i]
    words  = word_tokenize(words)
    texts = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    text = ' '.join(texts)
    corpus.append(text)
print(corpus)   #### Cleaned Datafrom sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() ## Creating Object for CountVectorizer
X = cv.fit_transform(corpus).toarray()
X  ## Vectorize Form ############OUTPUT##############
['good boy', 'good girl', 'boy girl good']
array([[1, 0, 1],
       [0, 1, 1],
       [1, 1, 1]], dtype=int64)