自然语言处理 预处理步骤
介绍 (Introduction)
GPT-3 model has, for now, became a hot topic in the natural language processing field due to its performance. It has nearly 175 billion parameters in comparison to GPT-2 which had around 1.5 billion parameters. It's a major breakthrough in the field of NLP. But the preprocessing steps that are required before training any model is of utmost importance. Therefore in this article, I will be explaining all the major steps that are used and are required in preprocessing the data before training any NLP model.
到目前为止,由于其性能,GPT-3模型已成为自然语言处理领域的热门话题。 与GPT-2约有15亿个参数相比,它拥有近1750亿个参数。 这是NLP领域的重大突破。 但是,在训练任何模型之前所需的预处理步骤至关重要。 因此,在本文中,我将解释在训练任何NLP模型之前对数据进行预处理所需要使用的所有主要步骤。
First I will list out the preprocessing steps and then will explain them in detail:-
首先,我将列出预处理步骤,然后将详细解释它们:-
- Removing HTML tags 删除HTML标签
- Removing stopwords 删除停用词
- Removing extra spaces 删除多余的空间
- Converting numbers to their textual representations 将数字转换为其文本表示形式
- Lowercasing the text 小写文本
- Tokenization 代币化
- Stemming 抽干
- Lemmatization 合法化
- Spell-checking 拼写检查
Now let’s start with their explanation one by one.
现在让我们从他们的解释开始。