预训练 gpt2
We all know modern day Natural Language Processing (NLP) has progressed by leaps and bounds in the past couple of years following the development of attention networks and transformers. It paved the way for a plethora of new algorithms achieving State-Of-The-Art (SOTA) for the different tasks of NLP.
众所周知,随着注意力网络和转换器的发展,近几年来,自然语言处理(NLP)取得了长足的进步。 它为众多新算法铺平了道路,从而可以为NLP的不同任务提供最新技术(SOTA)。
OpenAI has been one of the leaders in providing their own language model (now released GPT-3) which is trained on a huge corpus of internet data. Since, GPT-3 is a recent phenomenon and in English at the moment, and is only accessible through API given by OpenAI, we shift our focus on the earlier version of it i.e. GPT-2. To know about the internal nuts and bolts of GPT-2, I’d suggest you to go through this link. For more depths into Attention and Transformers, here are some excellent links:
OpenAI一直是提供自己的语言模型(现已发布的GPT-3)的领导者之一,该模型在庞大的互联网数据集上进行了培训。 由于GPT-3是目前的一种新现象,目前仅以英语提供,并且只能通过OpenAI提供的API进行访问,因此我们将重点放在GPT-2的早期版本上。 要了解GPT-2的内部螺母和螺栓,建议您通过此链接 。 要深入了解“注意力”和“变形金刚”,以下是一些出色的链接:
The illustrated Transformer by Jay Alammar
Jay Alammar 的插图变形金刚
The Annotated Transformer by Harvard NLP
哈佛NLP 的带注释的变压器
GPT-2 was also released for English, which makes it difficult for someone trying to generate text in a different language.
GPT-2也发布了英语版本,这使得尝试生成其他语言的文本变得很困难。
So why not train your own GPT-2 model on your favourite language for text generation? That is exactly what we are going to do. So, without further ado, let us jump in.
那么,为什么不使用自己喜欢的语言训练自己的GPT-2模型来生成文本呢? 这正是我们要做的。 因此,事不宜迟,让我们跳进去。
For the demo, I have considered a non-Latin alphabet script (Bengali here), because why not!! I have used Huggingface’s implementation for the model.
对于演示,我考虑了非拉丁字母脚本(此处为孟加拉语),因为为什么不! 我已经为模型使用了Huggingface的实现。
1.收集数据。 (1. Gathering the data.)
Gathering good quality data is one of the most important stages as all Data Scientists would agree. So, we are going to assume that you already have a folder contai