自然语言处理综述_自然语言处理

本文是对自然语言处理的全面概述,涵盖了其基本概念、应用和技术。
摘要由CSDN通过智能技术生成

自然语言处理综述

Aren't we all initially got surprised when smart devices understood what we were telling them? And in fact, it answered in the most friendly manner too, isn't it? Like Apple’s Siri and Amazon’s Alexa comprehend when we ask the weather, for directions, or to play a certain genre of music. Ever since then I was wondering how do these computers get our language. This long due curiosity rekindled me and I thought to write a blog as a newbie on this.

当智能设备理解了我们告诉他们的内容后,我们所有人最初并不感到惊讶吗? 实际上,它也以最友好的方式回答,不是吗? 就像苹果公司的Siri和亚马逊公司的Alexa一样,当我们询问天气,方向或播放某种音乐时,他们就会明白。 从那时起,我一直在想这些计算机如何获得我们的语言。 这种长期的好奇心使我重新燃起了生命,我想以此为博客写一个新手。

In this article, I will be using a popular NLP library called NLTK. Natural Language Toolkit or NLTK is one of the most powerful and probably the most popular natural language processing libraries. Not only does it have the most comprehensive library for python-based programming, but it also supports the most number of different human languages.

在本文中,我将使用一个流行的名为NLTK的NLP库 。 自然语言工具包或NLTK是功能最强大且可能是最受欢迎的自然语言处理库之一。 它不仅具有用于基于python的编程的最全面的库,而且还支持大多数不同的人类语言。

What is Natural Language Processing?

什么是自然语言处理?

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to train computers to process and analyze large amounts of natural language data.

自然语言处理(NLP)是语言学,计算机科学,信息工程和人工智能的一个子领域,与计算机和人类语言之间的相互作用有关,尤其是如何训练计算机以处理和分析大量自然语言数据。

Why sorting of Unstructured Datatype is so important?

为什么对非结构化数据类型进行排序如此重要?

For every tick of the clock, the world generates the overwhelming amount of data!!, yeah, this is mind-boggling!! and the majority of the data falls under unstructured datatype. The data formats such as text, audio, video, image are classic examples of unstructured data. The Unstructured Datatype will not be having fixed dimensions and structures like traditional row and column structure of relational databases. Therefore it’s more difficult to analyze and not easily searchable. Having said that, it is also important for business organizations to find ways of addressing challenges and embracing opportunities to derive insights and prosper in highly competitive environments to be successful. However, with the help of natural language processing and machine learning, this is changing fast.

每一刻时钟,世界都会产生大量数据!是的,这真是令人难以置信! 并且大多数数据属于非结构化数据类型。 文本,音频,视频,图像等数据格式是非结构化数据的经典示例。 非结构化数据类型将没有固定的维度和结构,如关系数据库的传统行和列结构。 因此,它更难以分析且不易搜索。 话虽如此,对于企业组织来说,找到应对挑战和把握机遇的方法也很重要,以便在高竞争环境中获得见识并取得成功。 但是,借助自然语言处理和机器学习,这种情况正在Swift改变。

Are Computers confused with our Natural Language?

计算机与我们的自然语言混淆了吗?

Human language is one of the powerful tools of communication. The words, the tone, the sentences, the gestures which we use draw information. There are countless different ways of assembling words in a phrase. Words can also have many shades of meaning and, to comprehend human language with the intended meaning is a challenge. A linguistic paradox is a phrase or sentence that contradicts itself, for example, “oh, this is my open secret”, “can you please act naturally”, though it sounds pointedly foolish, we humans can understand and use in everyday speech but for machines, natural language’s ambiguity and inaccurate characteristics are the hurdles to sail-off.

语言是交流的强大工具之一。 我们使用的单词,语气,句子,手势会吸引信息。 在短语中组合单词的方式有无数种。 单词也可以具有多种含义,要使人类语言具有预期的含义是一个挑战。 语言悖论是与自己矛盾的短语或句子,例如,“哦,这是我的公开秘密”,“您能自然地行动吗”,虽然听起来很愚蠢,但我们人类可以在日常语音中理解和使用,但对于机器,自然语言的歧义和不正确的特征是航行的障碍。

Image for post

Most used NLP Libraries

最常用的NLP库

In the past, only pioneers could be part of NLP projects those who would have superior knowledge in mathematics, computer learning, and linguistics in natural language processing. Now developers can use ready-made libraries to simplify pre-processing of texts so that they can concentrate on creating machine learning models. These libraries have enabled text comprehension, interpretation, sentiment analysis through only a few lines of code. Most popular NLP libraries are:

过去,只有先驱者才能成为NLP项目的一部分,他们将对数学,计算机学习和自然语言处理方面的语言有丰富的知识。 现在,开发人员可以使用现成的库来简化文本的预处理,以便他们可以专注于创建机器学习模型。 这些库仅通过几行代码就可以进行文本理解,解释和情感分析。 最受欢迎的NLP库是:

Spark NLP, NLTK, PyTorch-Transformers, TextBlob, Spacy, Stanford CoreNLP, Apache OpenNLP, Allen NLP, GenSim, NLP Architecture, sci-kit learn.

Spark NLP,NLTK,PyTorch-Transformers,TextBlob,Spacy,Stanford CoreNLP,Apache OpenNLP,Allen NLP,GenSim,NLP Architecture,Sci-kit学习。

The question is from where should we start and how?

问题是我们应该从哪里开始,如何开始?

Have you ever observed how kids start to understand and learn a language? yeah, by picking each word and then sentence formations, right! Making computers understand our language is more or less similar to it.

您是否曾经观察过孩子如何开始理解和学习语言? 是的,先选择每个单词,然后再选择句子形式,对! 使计算机理解我们的语言或多或少类似于它。

Pre-processing Steps :

预处理步骤:

  1. Sentence Tokenization

    句子标记化
  2. Word Tokenization

    词标记化
  3. Text Lemmatization and Stemming

    文本缩编和词干
  4. Stop Words

    停用词
  5. POS Tagging

    POS标签
  6. Chunking

    块状
  7. Wordnet

    词网
  8. Bag-of-Words

    言语袋
  9. TF-IDF

    特遣部队
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值