自然语言处理职业路线图_自然语言处理nlp的路线图

自然语言处理职业路线图

介绍 (Introduction)

Due to the development of Big Data during the last decade. organizations are now faced with analysing large amounts of data coming from a wide variety of sources on a daily basis.

由于近十年来大数据的发展。 企业现在每天都需要分析来自各种来源的大量数据。

Natural Language Processing (NLP) is the area of research in Artificial Intelligence focused on processing and using Text and Speech data to create smart machines and create insights.

自然语言处理(NLP)是人工智能领域的研究领域,专注于处理和使用文本和语音数据来创建智能机器并创建见解。

One of nowadays most interesting NLP application is creating machines able to discuss with humans about complex topics. IBM Project Debater represents so far one of the most successful approaches in this area.

当今最有趣的NLP应用程序之一是创建能够与人类讨论复杂主题的机器。 到目前为止, IBM Project Debater代表了该领域最成功的方法之一。

Video 1: IBM Project Debater
视频1:IBM Project Debater

预处理技术 (Preprocessing Techniques)

Some of the most common techniques which are applied in order to prepare text data for inference are:

为了准备用于推理的文本数据,一些最常用的技术是:

  • Tokenization: is used to segment the input text into its constituents words (tokens). In this way, it becomes easier to then convert our data into a numerical format.

    标记化:用于将输入文本分割成其组成词(标记)。 这样,将我们的数据转换为数字格式变得更加容易。

  • Stop Words Removal: is applied in order to remove from our text all the prepositions (eg. “an”, “the”, etc…) which can just be considered as a source of noise in our data (since they do not carry additional informative information in our data).

    停用词移除:用于从我们的文本中移除所有介词(例如,“一个”,“该”等),这些介词只能被视为我们数据中的噪声源(因为它们不带有其他附加词)我们数据中的信息性信息)。

  • Stemming: is finally used in order to get rid of all the affixes in our data (eg. prefixes or suffixes). In this way, it can in fact become much easier for our algorithm to not consider as distinguished words which have actually similar meaning (eg. insight ~ insightful).

    词干:最后用于去除数据中的所有词缀(例如前缀或后缀)。 这样,实际上,对于我们的算法而言,将其视为实际上具有相似含义(例如,有见识的见解)的专有单词会变得容易得多。

All of these preprocessing techniques can be easily applied to different types of texts using standard Python NLP libraries such as NLTK and Spacy.

使用标准的Python NLP库(例如NLTKSpacy),所有这些预处理技术都可以轻松地应用于不同类型的文本。

Additionally, in order to extrapolate the language syntax and structure of our text, we can make use of techniques such as Parts of Speech (POS) Tagging and Shallow Parsing (Figure 1). Using these techniques, in fact, we explicitly tag each word with its lexical category (which is based on the phrase syntactic context).

此外,为了推断语言的语法和文本结构,我们可以利用诸如词性(POS)标记和浅解析(图1)之类的技术。 实际上,使用这些技术,我们用单词的词法类别(基于短语语法上下文)显式标记每个单词。

Natural Language Processing (Parts of Speech)
Figure 1: Parts of Speech Tagging Example [1].
图1:词性标记示例[1]。

建模技术 (Modelling Techniques)

言语包 (Bag of Words)

Bag of Words is a technique used in Natural Language Processing and Computer Vision in order to create new features for training classifiers (Figure 2). This technique is implemented by constructing a histogram counting all the words in our document (not taking into account the word order and syntax rules).

Bag of Words是一种用于自然语言处理和计算机视觉的技术 ,目的是为训练分类器创建新功能(图2)。 通过构建对文件中所有单词进行计数的直方图来实现此技术(不考虑单词顺序和语法规则)。

Natural Language Processing (Bag of Words)
Figure 2: Bag of Words [2]
图2:语言包[2]

One of the main problems which can limit the efficacy of this technique is the presence of prepositions, pronouns, articles, etc… in our text. In fact, these can all be considered as words which are likely to appear frequently in our text even without necessarily being really informative in finding out what are the main characteristics and topics in our document.

可能会限制这项技术功效的主要问题之一是在我们的课文中出现介词,代词,冠词等。 实际上,这些全都可以被认为是在我们的文本中经常出现的单词,即使不一定真正了解我们文档的主要特征和主题是什么。

In order to solve this type of problem, a technique called “Term Frequency-Inverse Document Frequency” (TFIDF) is commonly used. TFIDF aims to rescale the words count frequency in our text by considering how frequently each of the words in our text appears overall in a large sample of texts. Using this technique, we will then reward words (scaling up their frequency value) which appear quite commonly in our text but rarely in other texts, while punishing words (scaling down their frequency value) which appear frequently in both our text and other texts (such as prepositions, pronouns, etc…).

为了解决这种类型的问题,通常使用称为“术语频率-反文档频率”(TFIDF)的技术。 TFIDF旨在通过考虑文本中每个单词在大量文本样本中整体出现的频率来重新调整文本中单词计数的频率。 然后,使用这种技术,我们将奖励在文本中非常普遍出现但在其他文本中很少出现的单词(按比例增加频率值),同时对在文本和其他文本中频繁出现的词语(按比例缩小频率值)进行惩罚(例如介词,代词等)。

潜在狄利克雷分配(LDA) (Latent Dirichlet Allocation (LDA))

Latent Dirichlet Allocation (LDA) is a type of Topic Modelling technique. Topic Modelling is a field of research focused on finding out ways to cluster documents in order to discover latent distinguishing markers which can characterize them based on their content (Figure 3). Therefore, Topic Modelling can also be considered in this ambit as a dimensionality reduction technique since it allows us to reduce our initial data to a limited set of clusters.

潜在狄利克雷分配(LDA)是一种主题建模技术。 主题建模是一个研究领域,致力于发现对文档进行聚类的方法,以发现可以根据其内容表征其特征的潜在区分标记(图3)。 因此,主题建模也可以被视为降维技术,因为它允许我们将初始数据缩减为一组有限的簇。

Natural Language Processing (Topic Modelling)
Figure 3: Topic Modelling [3]
图3:主题建模[3]

Latent Dirichlet Allocation (LDA) is an unsupervised learning technique used to find out latent topics which can characterize different documents and cluster together similar ones. This algorithm takes as input the number N of topics which are believed exists and then groups the different documents into N clusters of documents which are closely related to each other.

潜在狄利克雷分配(LDA)是一种无监督的学习技术,用于找出可以表征不同文档并将相似文档聚类的潜在主题。 该算法将被认为存在的N个主题作为输入,然后将不同文档分组为N个彼此密切相关的文档簇。

What characterises LDA from other clustering techniques such as K-Means Clustering is that LDA is a soft-clustering technique (each document is assigned to a cluster based on a probability distribution). For example, a document can be assigned to a Cluster A because the algorithm determines that it is 80% likely that this document belongs to this class, while still taking into account that some characteristics embedded into this document (the remaining 20%) are more likely to belong instead to a second Cluster B.

LDA与其他聚类技术(例如K均值聚类)的不同之处在于LDA是一种软聚类技术(每个文档都基于概率分布分配给聚类)。 例如,可以将文档分配给群集A,因为算法确定该文档属于此类的可能性为80%,同时仍考虑到嵌入到此文档中的某些特征(其余20%)更多可能属于第二个群集B。

词嵌入 (Word Embeddings)

Word Embeddings are one of the most common ways to encode words as vectors of numbers which can then fed in into our Machine Learning models for inference. Word Embeddings aim to reliably transform our words into a vector space so that similar words are represented by similar vectors.

词嵌入是将词编码为数字向量的最常见方法之一,然后可以将其输入到我们的机器学习模型中进行推理。 词嵌入旨在将我们的词可靠地转换为向量空间,以便相似的词由相似的向量表示。

Natural Language Processing (Word Embeddings)
Figure 4: Word Embedding [4]
图4:单词嵌入[4]

Nowadays, there are three main techniques used in order to create Word Embeddings: Word2Vec, GloVe and fastText. All these three techniques, use a shallow neural network in order to create the desired word embedding.

如今,有为了用来创建的Word曲面嵌入三种主要的技术: Word2Vec手套fastText 。 所有这三种技术都使用浅层神经网络来创建所需的单词嵌入。

In case you can be interested in finding out more about how Word Embeddings works, this article is a great place where to start.

如果您有兴趣了解有关Word Embeddings如何工作的更多信息,那么本文是一个很好的起点。

情绪分析 (Sentiment Analysis)

Sentiment Analysis is an NLP technique commonly used in order to understand if some form of text expresses positive, negative or neutral sentiment about a topic. This can be particularly useful to do when for example trying to find out what is the general public opinion (through online reviews, tweets, etc…) about a topic, product or a company.

情感分析是一种NLP技术,通常用于了解某种形式的文本表达的是关于主题的正面,负面还是中立的情绪。 例如,在尝试找出有关某个主题,产品或公司的一般公众意见(通过在线评论,推文等)时,这可能特别有用。

In sentiment analysis, sentiments in texts are usually represented as a value between -1 (negative sentiment) and 1 (positive sentiment) referred to as polarity.

在情感分析中,文本中的情感通常表示为-1(负情感)和1(正情感)之间的值,称为极性。

Sentiment Analysis can be considered as an Unsupervised Learning technique since we are not usually provided with handcrafted labels for our data. In order to overcome this obstacle, we make use of prelabeled lexicons (a book of words) which had been created to quantify the sentiment of a large number of words in different contexts. Some examples of widely used lexicons in sentiment analysis are TextBlob and VADER.

情感分析可以被认为是一种无监督学习技术,因为我们通常不为数据提供手工制作的标签。 为了克服这一障碍,我们使用了预先标记的词典(一本单词集),该词典用来量化不同上下文中大量单词的情感。 情感分析广泛使用的词汇的一些例子是TextBlobVADER

变形金刚 (Transformers)

Transformers represent the current state of the art NLP models in order to analyse text data. Some examples of widely known Transformers models are BERT and GTP2.

变压器代表了最新的NLP模型,以便分析文本数据。 BERTGTP2是一些众所周知的Transformers模型的示例

Before the creation of Transformers, Recurrent Neural Networks (RNNs) represented the most efficient way to analyse sequentially text data for prediction but this approach found quite difficult to reliably make use of long term dependencies (eg. our network might find difficult to understand if a word fed in several iterations ago might result to be useful for the current iteration).

在创建变形金刚之前,递归神经网络(RNN)是顺序分析文本数据以进行预测的最有效方法,但是这种方法很难可靠地利用长期依赖关系(例如,我们的网络可能会难以理解前几次迭代中输入的单词可能对当前迭代有用)。

Transformers successfully managed to overcome this limitation thanks to a mechanism called Attention (which is used in order to determine which parts of the text to focus on and give more weight). Additionally, Transformers made easier to process text data in parallel rather than sequentially (therefore improving execution speed).

借助一种称为“ 注意力” ( Attention)的机制,变压器成功地克服了这一限制(该机制用于确定文本的哪些部分需要重点关注并给予更大的重视)。 此外,Transformers使并行处理文本数据变得容易,而不是顺序处理(因此提高了执行速度)。

Transformers can nowadays be easily implemented in Python thanks to Hugging Face library.

如今,借助Hugging Face库,可以轻松地在Python中实现变形金刚。

文字预测示范 (Text Prediction Demonstration)

Text prediction is one of the tasks which can be easily implemented using Transformers such as GPT2. In this example, we will give as input a quote from “The Shadow of the Wind” by Carlos Ruiz Zafón and our transformer will then generate other 50 characters which should logically follow our input data.

文本预测是可以使用诸如GPT2之类的变压器轻松实现的任务之一。 在此示例中,我们将输入Carlos RuizZafón 的“风之阴影 ”中报价,然后我们的转换器将生成其他50个字符,这些字符应在逻辑上跟随我们的输入数据。

A book is a mirror that offers us only what we already carry inside us. It is a way of knowing ourselves, and it takes a whole life of self awareness as we become aware of ourselves. This is a real lesson from the book My Life.

As can be seen from our example output shown above, our GPT2 model performed quite well in creating a resealable continuation for our input string.

从上面的示例输出中可以看出,我们的GPT2模型在为输入字符串创建可重新密封的延续方面表现出色。

An example notebook which you can run in order to generate your own text is available at this link.

此链接提供了一个示例笔记本,您可以运行该笔记本以生成自己的文本

I hope you enjoyed this article, thank you for reading!

希望您喜欢这篇文章,感谢您的阅读!

联络人 (Contacts)

If you want to keep updated with my latest articles and projects follow me on Medium and subscribe to my mailing list. These are some of my contacts details:

如果您想随时了解我的最新文章和项目 跟我来中 并订阅我的 邮件列表 。 这些是我的一些联系方式:

参考书目 (Bibliography)

[1] Extract Custom Keywords using NLTK POS tagger in python, Thinkinfi, Anindya Naskar. Accessed at: https://www.thinkinfi.com/2018/10/extract-custom-entity-using-nltk-pos.html

[1]在python,Thinkinfi,Anindya Naskar中使用NLTK POS标记器提取自定义关键字。 访问网址https : //www.thinkinfi.com/2018/10/extract-custom-entity-using-nltk-pos.html

[2] Comparison of word bag model BoW and word set model SoW, ProgrammerSought. Accessed at: http://www.programmersought.com/article/4304366575/;jsessionid=0187F8E68A22612555B437068028C012

[2]单词袋模型BoW和单词集模型SoW的比较,ProgrammerSought。 访问网址http : //www.programmersought.com/article/4304366575/; jsessionid=0187F8E68A22612555B437068028C012

[3] Topic Modeling: Art of Storytelling in NLP, TechnovativeThinker. Accessed at: https://medium.com/@MageshDominator/topic-modeling-art-of-storytelling-in-nlp-4dc83e96a987

[3]主题建模:NLP中的叙事艺术,TechnovativeThinker。 访问网址https : //medium.com/@MageshDominator/topic-modeling-art-of-storytelling-in-nlp-4dc83e96a987

[4] Word Mover’s Embedding: Universal Text Embedding from Word2Vec, IBM Research Blog. Accessed at: https://www.ibm.com/blogs/research/2018/11/word-movers-embedding/

[4] Word Mover的嵌入:IBM Research Blog Word2Vec的通用文本嵌入。 访问网址https : //www.ibm.com/blogs/research/2018/11/word-movers-embedding/

翻译自: https://towardsdatascience.com/roadmap-to-natural-language-processing-nlp-38a81dcff3a6

自然语言处理职业路线图

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值