知乎 自然语言处理学习_一个学习自然语言处理的一个月计划

本文提供了一个为期一个月的学习自然语言处理的详细计划,涵盖了从基础到深度学习的关键知识点,旨在帮助读者深入理解并掌握这一人工智能领域的核心技能。
摘要由CSDN通过智能技术生成

知乎 自然语言处理学习

In this article, I have shared a 1-month plan to learn the basics of Natural Language Processing. Natural Language Processing is a vast subject and multidisciplinary subject that uses concepts from Computer Science, Linguistics, Neuroscience, etc. and is one of the most popular research area in Machine Learning. This one-month plan can be used to prepare for data science interviews or to start a project in NLP.

在本文中,我分享了一个为期1个月的计划,以学习自然语言处理的基础知识。 自然语言处理是一门广泛的学科,它使用计算机科学,语言学,神经科学等方面的概念,并且是机器学习中最受欢迎的研究领域之一。 这个为期一个月的计划可用于准备进行数据科学访谈或在NLP中启动一个项目

我应该事先知道什么? (What should I know beforehand?)

To properly understand the material for this week, the following material should be well understood or the material covered in the courses:

为了正确理解本周的材料,应充分理解以下材料或课程中涵盖的材料:

  • Numpy

    脾气暴躁的
  • Pandas

    大熊猫
  • Matplotlib

    Matplotlib
  • Scikit-Learn

    Scikit学习
  • Basic TensorFlow/PyTorch(not mandatory but recommended)

    基本TensorFlow / PyTorch(不是强制性的,但建议使用)

我们走吧! (Let’s Go!)

In last week’s material, you all have seen how a convolutional network works and how to use image data and train a model to classify graphenes. While dealing with image data, there was no question on representing an image for teaching the model. It was already given in pixel form i.e., you don’t need to work on how to describe the pictures. But this is not the case with text data.

在上周的资料中,大家都已经了解了卷积网络的工作原理,以及如何使用图像数据和训练模型对石墨烯进行分类。 在处理图像数据时,毫无疑问要表示图像以示教模型。 它已经以像素形式给出,即,您无需研究如何描述图片。 但这不是文本数据的情况。

Text data in its raw form is just a string that makes no sense to the computer, or a model cannot be trained on it until it has been converted into numerical vectors. There are various processes in how it is done, which are described in the medium article.

原始格式的文本数据只是一个对计算机没有意义的字符串,或者只有在将其转换为数值向量之前,才能在其上训练模型。 如何完成此过程有各种过程,在中篇文章中进行了描述。

The above article describes the essential practices in text processing and some basic NLP tasks.

上面的文章介绍了文本处理和一些基本的NLP任务中的基本做法。

Tokenization, Stemming, Lemmatization, etc. are a few techniques to process data before training the ML model on it.

标记化,词干化,词法化等是在训练ML模型之前处理数据的一些技术。

There is a python text processing module known as NLTK(Natural Language ToolKit) for performing the text processing tasks, as described in the article above. The working examples and the tutorial of the module are given in detail in the report described below.

如上文所述,有一个称为NLTK(自然语言工具包)的python文本处理模块可用于执行文本处理任务。 在下面描述的报告中详细给出了该模块的工作示例和教程。

The documentation for the NLTK can be accessed here, and the book based on the documentation can be assessed here.

可以在此处访问NLTK的文档,并且可以在此处评估基于文档的书。

For modeling the data into prediction, various types of classical methodologies were developed.

为了将数据建模为预测,开发了各种类型的经典方法。

One of them is the bag-of-words described in the article below.

其中之一是下面的文章中描述的单词袋。

深度学习来了 (Here comes Deep Learning)

Like any other machine learning task, NLP tasks such as POS tagging, NER classification, etc. also give state-of-the-result with deep learning techniques.

像任何其他机器学习任务一样,NLP任务(例如POS标记,NER分类等)也提供了深度学习技术的最新状态。

The following course discusses most of the techniques and types of deep learning architectures used in NLP.

以下课程讨论了NLP中使用的大多数技术和深度学习架构的类型。

It is advisable to view all the videos provided in the course in addition to the quizzes. However, the assignment part of the course is optional. The course discusses the basic techniques without going into profound technical details of the methods.

除了观看测验外,建议您观看课程中提供的所有视频。 但是,课程的分配部分是可选的。 本课程讨论了基本技术,而没有深入介绍这些方法的技术细节。

Those who want to delve deeper into the literature are advised to go through the following course after completing the previous course.

建议那些想深入研究文献的人在完成上一门课程后再进行下一门课程。

Lectures 3rd & 4th can be skipped if it seems to mathematical as it doesn’t discuss much on NLP.

如果在数学上看来,第3和第4讲可以跳过,因为在NLP上讨论不多。

所有这些都在哪里使用? (Where all of this is used?)

POS tagging is one of the primary applications of NLP. The article below describes a method to do POS tagging by using CRF(Conditional Random Field)

POS标记是NLP的主要应用之一。 下面的文章介绍了一种使用CRF(条件随机场)进行POS标记的方法

The courses have given enough descriptions of LSTMs, so now, let’s dive in with a basic implementation of LSTM to perform text-classification.

这些课程已经对LSTM进行了足够的描述,所以现在让我们开始学习LSTM的基本实现以执行文本分类。

潜水更深! (Diving Deeper!)

For a long time, LSTMs have been used for many NLP tasks but it suffers from long range memory loss and leads to poor results in case of long sequences. To remove this effect, Transformers were introduced to model long range dependencies by using attention mechanisms. BERT is a special type of model by Google based on transformers(also from Google).

长期以来,LSTM已用于许多NLP任务,但是它遭受了远程内存丢失的困扰,并且在序列较长的情况下导致较差的结果。 为了消除这种影响,引入了Transformers通过使用注意力机制对远程依赖关系进行建模。 BERT是Google基于变压器(也是Google的产品)的一种特殊类型的模型。

There are many tasks that involve taking a sentence as input and predicting another sentence. There tasks are generally known as seq2seq. Some common examples of these types of tasks are Machine Translation, Text Summarization, etc. These are discussed in the following articles.

有许多任务涉及将一个句子作为输入并预测另一个句子。 这些任务通常称为seq2seq。 这些类型的任务的一些常见示例是机器翻译,文本摘要等。在以下文章中将对这些进行讨论。

A tutorial for implementing an ML model for Neural Machine Translation in Tensorflow is given below. It provides a basic understanding of how to approach a sequence to sequence problem and how to train the network.

下面给出了在Tensorflow中为神经机器翻译实现ML模型的教程。 它提供了有关如何处理序列问题和如何训练网络的基本知识。

下一步:开始练习! (Next: Start Practicing!)

It is always very important to put the concepts to the test in a competition, one such competition focused on beginners is given below from kaggle. The competition deals with a binary classification problem on textual data.

在竞赛中测试这些概念总是非常重要的,下面是kaggle给出的针对初学者的竞赛。 比赛涉及文本数据的二进制分类问题。

The following are a bunch of tutorials that can help to get started with the kaggle competition. These are just the 2 kernels from many amazing kernels published on the kaggle platform. It is advised to go through them as it gives a good opportunity to learn from others.

以下是一堆教程,可以帮助您开始kaggle竞赛。 这些只是kaggle平台上发布的许多惊人内核中的2个内核。 建议您仔细阅读它们,因为这是向他人学习的好机会。

翻译自: https://towardsdatascience.com/a-one-month-plan-to-learn-natural-language-processing-e364348146e0

知乎 自然语言处理学习

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值