自然语言处理导论

c语言导论

自然语言处理 (Natural Language Processing)

Advancement in technology has given a new direction to our way of life. Starting from generating tonnes of data, to extracting a deep insight from it, and thus making most of the businesses data-driven. Most of the data that is generated is in the form of text or human language, that it hard for the machine to understand. This text data is generated when we speak, send WhatsApp Message, send an E-Mail, health care data, social media posts, tweets, and even when we a review of a product. These data cannot be directly fed to our model as it contains lots of noise. It needs to be thoroughly processed before use.

科技的进步为我们的生活方式提供了新的方向。 从生成大量数据开始,到从中提取出深刻的见解,从而使大多数业务都由数据驱动。 生成的大多数数据都是文本或人类语言的形式,这使机器难以理解。 当我们讲话,发送WhatsApp消息,发送电子邮件,医疗保健数据,社交媒体帖子,推文,甚至当我们对产品进行评论时,都会生成此文本数据。 这些数据包含大量噪声,因此无法直接提供给我们的模型。 使用前需要彻底处理。

Thus, to solve this problem there’s come into the picture “Natural Language Processing”, frequently known as NLP. Natural Language Processing is a key segment of AI focussed to understand and analyze hidden patterns in text form of data. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as — automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation, etc.

因此,为了解决这个问题,出现了“ 自然语言处理 ”,通常称为NLP。 自然语言处理是AI的关键部分,致力于理解和分析数据文本形式的隐藏模式。 通过利用NLP及其组件,人们可以组织大量文本数据,执行大量自动化任务,并解决各种问题,例如-自动摘要,机器翻译,命名实体识别,关系提取,情感分析,语音识别,和主题细分等

文字预处理 (Text Preprocessing)

Image for post
Photo by Shahadat Rahman on Unsplash
Shahadat RahmanUnsplash拍摄的照片

Text data is one of the most unstructured forms of available data. To make this data noise-free, we have various methods in NLP.

文本数据是可用数据的最非结构化形式之一。 为了使该数据无噪声,我们在NLP中提供了多种方法。

降噪 (Noise Reduction)

Noise such as punctuation marks is of no meaning when it comes to a machine. These need to be removed from the data.

涉及机器的标点符号之类的噪声没有意义。 这些需要从数据中删除。

import retext = "What is NLP? Why, do we need to reduce noise?"# remove punctuation
result = re.sub(r'[\.\?\!\,\:\;\"]', '', text)print(result)#What is NLP Why do we need to reduce noise

代币化 (Tokenization)

Tokenisation is a method of converting the complete text into tokens, i.e. converting the sentences into individual words.

标记化是一种将全文转换为标记,即将句子转换为单个单词的方法。

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenizetext = "What is NLP Why do we need to reduce noise"
tokenized = word_tokenize(text)print(tokenized)#['What', 'is', 'NLP', 'Why', 'do', 'we', 'need', 'to', 'reduce', 'noise']

文字规范化 (Text Normalisation)

Text normalization plays a major role in test data preprocessing. Usually, text data contains various types of similar words or the same words with different affixes. For eg: ‘jumps’, ‘jumping’, and ‘jumped’ are all the variations of the word ‘jump’. These words have the same meaning, and they are needed to be normalized to their base form. We have two popular methods to normalize data:

文本规范化在测试数据预处理中起着重要作用。 通常,文本数据包含各种类型的相似词或带有不同词缀的相同词。 例如:“跳”,“跳”和“跳”都是单词“跳”的变体。 这些词具有相同的含义,需要将它们规范化为其基本形式。 我们有两种常用的数据标准化方法:

  • Stemming

    抽干

  • Lemmatization

    合法化

抽干 (Stemming)

Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

词干是从单词中去除后缀(“ ing”,“ ly”,“ es”,“ s”等)的基于规则的基本过程。

from nltk.stem import PorterStemmertokenized = [["So","kangaroos", "many", "jumped", "jump", "jumping"]stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]print(stemmed)#['I', 'kangaroo', 'mani', 'jump', 'jump', 'jump']

You can observe in the above output, but the problem is that the word ‘many’ is converted to ‘mani’ and that’s the limitation of stemming. This is reduced by lemmatization.

您可以在上面的输出中观察到,但是问题是单词“ many”被转换为“ mani”,这是词干的局限性。 通过词形还原可以减少这种情况。

合法化 (Lemmatization)

Lemmatization is a step by step, organized procedure to obtain a root word. Lemmatization is preferred over Stemming because lemmatization does a morphological analysis of the words.

合法化是一个逐步的,有组织的过程,可以获取词根。 与词干法相比词法词法化更胜一筹,因为词法化法会对词进行形态分析

from nltk.stem import WordNetLemmatizertokenized = ["So","kangaroos", "many", "jumped", "jump", "jumping"]lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]print(lemmatized)#['So', 'kangaroo', 'many', 'jumped', 'jump', 'jumping']

去除停用词 (Stopwords Removal)

Words like ‘a’, ‘an’, and ‘the’ are commonly used in a document and these words don’t provide the essence or meaning of that doc. These words are filtered out from the doc before applying to the train. NLTK already comes with a list of predefined stopwords, making the code easy.

文档中通常使用“ a”,“ an”和“ the”之类的词,这些词不提供该文档的本质或含义。 在申请培训之前,这些词已从文档中过滤掉。 NLTK已经附带了预定义的停用词列表,从而简化了代码。

from nltk.corpus import stopwordsword_tokens = ['I', 'am', 'a', 'very', 'good', 'programmer']
# define set of English stopwords
stop_words = set(stopwords.words('english'))# remove stopwords from tokens in dataset
statement_no_stop = [word for word in word_tokens if word not in stop_words]print(statement_no_stop)#['I', 'good', 'programmer']

Now that we have removed almost all the noise from the data, it’s time to analyze data by converting the tokens into features. We will discuss this in the next part.

既然我们已经消除了数据中的几乎所有噪音,现在该是通过将标记转换为特征来分析数据的时候了。 我们将在下一部分中对此进行讨论。

Till then, Bye. See ya..!

到那再见 拜拜..!

翻译自: https://medium.com/towards-artificial-intelligence/introduction-to-natural-language-processing-685a6e41fd3f

c语言导论

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值