nlp自然语言处理_nlp第1集简介和nlp库中的0to1-CSDN博客

nlp自然语言处理

什么是自然语言？ (What is Natural Language?)

Natural language is simply human language like English, French, etc., whereas computer languages include C, Python among many others. Machine languages have been constructed for specific uses as opposed to natural languages that have matured over the years as per convenience. Although the natural language follows certain grammar rules it isn’t hard bounded by any specific rules, it incorporates slang, sarcasm, modern abbreviations, etc. Natural language can be of any form like text, speech, and even sign language.Natural language needs to be processed for the machine to understand, hence NLP.

自然语言只是人类语言，例如英语，法语等，而计算机语言包括C，Python等。机器语言是为特定用途而构造的，而与自然语言相比，自然语言多年来已经很方便。尽管自然语言遵循某些语法规则，但不受任何特定规则的限制，但它包含语，讽刺，现代缩写等。自然语言可以是文本，语音甚至手语等任何形式。自然语言需要以便机器理解以进行处理，因此是NLP。

自然语言处理 (Natural Language Processing)

NLP ( Natural Language Processing ) in the simplest terms is the interaction between computers and humans using the natural language. Broadly it can be defined as the building of computing tools for automatic manipulation of a natural language like speech and text. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable and do useful operations with language like translation, chatbots, question answering, text summarization, speech to text and vice versa text to speech, sentiment analysis, etc.

最简单的术语NLP (自然语言处理)是计算机和人之间使用自然语言进行的交互。从广义上讲，它可以定义为自动操纵自然语言(如语音和文本)的计算工具的构建。 NLP的最终目标是以一种有价值的方式阅读，解密，理解和理解人类语言，并使用翻译，聊天机器人，问题回答，文本摘要，语音到文本以及反之亦然的语言进行有用的操作进行演讲，情感分析等

Generally, we take Natural Language Processing in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves “understanding” complete human utterances, at least to the extent of being able to give useful responses to them. It is a collective term referring to the automatic computational processing of human languages. This includes both algorithms that take human-produced text as input and algorithms that produce natural-looking text as outputs.

通常，我们在广泛意义上采用自然语言处理，以涵盖任何形式的计算机对自然语言的操纵。在一个极端情况下，它可能像计数单词频率以比较不同的写作风格一样简单。另一方面，自然语言处理涉及“理解”完整的人类话语，至少是在能够对他们做出有用回应的程度上。它是一个通用术语，指的是人类语言的自动计算处理。这既包括将人工生成的文本作为输入的算法，也包括将看上去自然的文本作为输出的算法。

结构化和非结构化数据 (Structured and Unstructured data)

The main difference between structured and unstructured data is that structured data usually comes in tabular format i.e it can be displayed in rows, columns, and relational databases. It can also be simply numbers, dates, IDs, etc. whereas the unstructured data cannot be displayed in rows, columns, and relational databases. Audio files, emails, word processing files, a group of articles are all included in the unstructured category.

结构化和非结构化数据之间的主要区别在于，结构化数据通常以表格格式出现，即可以显示在行，列和关系数据库中。它也可以只是数字，日期，ID等，而非结构化数据则不能显示在行，列和关系数据库中。音频文件，电子邮件，文字处理文件，一组文章都包含在非结构化类别中。

Structured data is preferred in tasks such as text classification and machine translation where labeled data is required and in tasks such as QA model and language models, unstructured data is preferred.

在需要标签数据的文本分类和机器翻译等任务中，结构化数据是首选，而在QA模型和语言模型等任务中，结构化数据是首选。

NLP中的挑战 (Challenges in NLP)

Natural language is hard to learn and highly ambiguous. To simply understand the difficulty level an average full-grown human requires around 6–7 months to learn a language that the machine is expected to learn in a single go. Even after learning, the language is always evolving, detecting the true meaning of the sentence is truly difficult, how to do sentiment analysis on a sarcastic review. Though the language has certain governing rules, the raw data doesn’t necessarily follow these rules. The millennial language has slangs and abbreviations which prove to be very troublesome while processing natural languages.

自然语言很难学习并且非常模棱两可。为了简单地理解一个成年成年人的平均难度水平，大约需要6到7个月才能学习一台机器希望一次学习的语言。即使经过学习，语言也会不断发展，要真正检测句子的真正含义确实很困难，如何在讽刺评论中进行情感分析。尽管该语言具有某些控制规则，但原始数据不一定遵循这些规则。千禧年语言具有语和缩写，这在处理自然语言时非常麻烦。

The one thing we don’t want to do here is skipping the basic concepts of NLP and directly jumping to Text Classification and Text Summarization. In this series, we’ll try to cover as many topics as possible including:

我们在这里不想做的一件事是跳过NLP的基本概念，而直接跳到“文本分类”和“文本摘要”。在本系列中，我们将尝试涵盖尽可能多的主题，包括：

Text Pre-processing
文字预处理
Neural Networks
神经网络
Context-free Word Embeddings
上下文无关的词嵌入
Transformer
变压器
Context-based Word Embeddings
基于上下文的词嵌入
Text Summarization
文字摘要
Text Classification
文字分类
QA module
质量检查模块
GLUE Benchmark
GLUE基准

A decade ago, only experts with knowledge of statistics, machine learning and in linguistic concepts would perform heavy NLP tasks but in recent years thanks to various NLP libraries, solving NLP problems has become much easier. In this article, we’ll look into the most popular NLP libraries. Their comparison is done in the successive articles based on the task the article is based upon. So let’s get started.

十年前，只有具有统计学，机器学习和语言概念知识的专家才能执行繁重的NLP任务，但是近年来，由于有了各种NLP库，解决NLP问题变得更加容易。在本文中，我们将研究最受欢迎的NLP库。他们的比较是在后续文章中根据文章所基于的任务进行的。因此，让我们开始吧。

著名的NLP库 (Notable NLP Libraries)

There are many NLP libraries out there but these are a few libraries worth mentioning. One does not need to study all the libraries in detail but must know the advantages and disadvantages.

有很多NLP库，但其中有一些值得一提。无需详细研究所有库，但必须知道其优缺点。

NLTK: Natural language tool kit is probably the most famous NLP library with over 50 corpora and lexicons, 9 stemmers, and dozens of algorithms to choose from. NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources. Few weaknesses to note are that it is slow compared to other libraries and also a bit complicated to learn and implement.
NLTK：自然语言工具包可能是最著名的NLP库，其中包含超过50个语料库和词典，9个词干提取器以及数十种算法供您选择。 NLTK为50多种语料和词汇资源提供易于使用的界面。值得注意的弱点是，与其他库相比，它运行缓慢，并且学习和实现起来也有些复杂。
spaCy: Spacy is known as the state of the art library, providing only the best algorithms thus avoiding the stress to choose among algorithms. It is designed explicitly for production usage — it lets you develop applications that process and understand huge volumes of text. As it’s implemented on Cython, Spacy is lightning fast. It can support tokenization for over 49 languages.
spaCy： Spacy被称为最新技术库，仅提供最佳算法，从而避免了在算法之间进行选择的压力。它是专门为生产用途而设计的，它使您可以开发用于处理和理解大量文本的应用程序。正如在Cython上实现的那样，Spacy的闪电般快速。它可以支持超过49种语言的标记化。
Stanford CoreNLP: Stanford CoreNLP is a suite of production-ready natural analysis tools. Since CoreNLP is written in Java, it demands that Java be installed on your device. However, it does offer programming interfaces for many popular programming languages, including Python. The library provides vast functionalities also it’s very fast and accurate. Hence many organizations use CoreNLP for production.
Stanford CoreNLP： Stanford CoreNLP是一套可用于生产的自然分析工具。由于CoreNLP是用Java编写的，因此它要求在您的设备上安装Java。但是，它确实提供了许多流行编程语言(包括Python)的编程接口。该库提供了广泛的功能，而且非常快速和准确。因此，许多组织将CoreNLP用于生产。
TextBlob: TextBlob is built on NLTK and another package known as Pattern. It’s an easy to use interface to the NLTK library. It is based on both NLTK and Pattern and provides a very straightforward API to all common (and some less common) NLP tasks. While TextBlob does nothing particularly new or exciting, it makes working with text very enjoyable and removes a lot of barriers. The library provides in-built functions for text classification and sentiment analysis.
TextBlob： TextBlob基于NLTK和另一个称为Pattern的软件包构建。这是NLTK库的易于使用的界面。它基于NLTK和Pattern，并且为所有常见(和一些不常见)的NLP任务提供了非常简单的API。尽管TextBlob并没有什么特别新颖或令人兴奋的事情，但它使处理文本变得非常愉快，并消除了许多障碍。该库提供了用于文本分类和情感分析的内置功能。
Gensim: Gensim is a Python library designed specifically for “topic modeling, document indexing, and similarity retrieval with large corpora.”All algorithms in Gensim are memory-independent, w.r.t., the corpus size, and hence, it can process input larger than RAM. Even though it’s built-in pure python, Gensim is fast and memory efficient.
Gensim： Gensim是一个Python库，专门用于“主题建模，文档索引和大型语料库的相似性检索”。Gensim中的所有算法都是与内存无关的，wrt，语料库大小，因此，它可以处理比RAM大的输入。即使它是内置的纯python，Gensim还是快速且高效的内存。

This was a basic introduction to NLP and the libraries providing NLP functionalities. The further articles will dive deeper into the topics the article is based on.

这是对NLP和提供NLP功能的库的基本介绍。进一步的文章将更深入地探讨该文章所基于的主题。

So let’s get started on our NLP journey!

因此，让我们开始我们的NLP旅程吧！