奥尔龙。js_比奥尔姆

最新推荐文章于 2024-07-06 12:16:06 发布

weixin_26706653

最新推荐文章于 2024-07-06 12:16:06 发布

阅读量79

点赞数

文章标签： python

原文链接：https://medium.com/analytics-vidhya/byolm-32d728efbf21

版权

奥尔龙。js

Build Your Own Language Model 📚

建立自己的语言模型📚

Long time ago there was need of computation power to build fascinating things in AI/ML domain , which actually contribute to human advancements like medicine research and identifying the cancer cell and even recent one finding cure for Covid-19 but with the advancements in the field of AI/ML and developments of robust libraries and frameworks made the dream come true to build/train the models any where from the world and even without need of heavy computing resources , so today I am briefing about one such advancement which made a huge contribution in the field of Natural Language Processing(NLP) .

大号翁时间以前，有必要的计算能力的打造AI / ML域引人入胜的东西，这实际上促进人类进步，像医药研究和确定Covid-19，但与在的进步癌细胞，甚至最近的一个调查结果治愈AI / ML领域以及健壮的库和框架的开发使实现梦想成为现实，无论在世界任何地方，甚至都不需要大量的计算资源，都可以构建/训练模型，因此今天我向您简要介绍这样一个进步，它取得了巨大的成就在自然语言处理(NLP)领域的贡献。

Hugging Face is a most promising team in the NLP world which revolutionized the NLP domain with their contributions to it with Transformers. Have a look at their website https://huggingface.co/ 🤗

Hugging Face是NLP领域最有前途的团队，通过Transformers对NLP域的贡献进行了革命性的变革。看看他们的网站https://huggingface.co/🤗

Today I will explain you on how you can train your own language model using HuggingFace’s transformer library , given that you have the required data set for it. Don’t worry if you don’t have you can always try out with freely available data and train a model to gain the knowledge about it check this wonderful site for data set to try out or even you can do web scraping to get the needed data.

今天，我将向您解释如何使用HuggingFace的转换器库训练自己的语言模型，因为您已经拥有所需的数据集。如果您没有，请不要担心，您可以随时尝试免费提供的数据并训练模型以获取有关其知识的信息，请检查此出色的网站以获取数据集进行尝试，甚至可以进行网络抓取来获取所需的数据数据。

Once you have the data set , next comes the set up to train the model

有了数据集之后，接下来便要进行设置以训练模型

Hardware : If you have GPU machine available you can train on it, else make use of Google CoLab or Kaggle Notebooks to train given that you have sliced the data to fit for the CoLab/Kaggle RAM size so that it will be able to fit the data during training process or else you will see Out of memory errors or notebooks will hang.

硬件：如果您有可用的GPU机器，则可以在其上进行训练，如果您已对数据进行了切片以适合CoLab / Kaggle RAM的大小，则可以利用Google CoLab或Kaggle笔记本进行训练，从而使其能够适应训练过程中的数据，否则您将看到内存不足错误或笔记本计算机挂起。

Once you are ready with the hardware next comes the dependency packages to train the model.

准备好硬件之后，接下来便是依赖包来训练模型。

# Install `transformers` from master!pip install git+https://github.com/huggingface/transformers

Training is broken down as below steps:

培训分为以下步骤：

Tokenize
标记化
Config setup
配置设置
Load Data set
加载数据集
Train
培养
Test
测试
Publish/Upload
发布/上传

Tokenize:
标记化：

This is the process of splitting the sentences in to words we will use ByteLevelBPETokenizer -this is a byte level or character level tokenizer. read more about it here.

这是将句子分成单词的过程，我们将使用ByteLevelBPETokenizer-这是字节级别或字符级别的令牌生成器。在这里阅读更多有关它的信息。

from pathlib import Path
from tokenizers import ByteLevelBPETokenizer#get all .txt files in the given path here you can point to data set file.
paths = [str(x) for x in Path(".").glob("**/*.txt")] # Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=["<s>","<pad>","</s>","<unk>","<mask>",])# save the tokenizer output to local folder change dir name as per your need here my dir name is myBERTotokenizer.save_model("myBERTo")

2. Config setup

2.配置设置

RoBERTA is one of the training approach for BERT based models so we will use this to train our BERT model with below config. Play with the values of these hyper parameters and train accordingly to get better results.

RoBERTA是基于BERT的模型的训练方法之一，因此我们将使用此方法通过以下配置来训练BERT模型。发挥这些超级参数的值，并进行相应的训练以获得更好的结果。

from transformers import RobertaConfigconfig = RobertaConfig(
vocab_size=52_000,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)

3. Load Data set

3. 加载数据集

Once we have the tokenized data and finalized the config params for RoBERTa lets see how we will load the data for training.

在获得标记化数据并最终确定RoBERTa的配置参数后，让我们看看如何加载数据进行培训。

Assuming we have a txt file with data sentences arranged line by line we will use LineByLineTextDataset util from transformer to load the data file. Again here also you can play around with config params to fit best for your case.

假设我们有一个txt文件，其中的数据语句逐行排列，我们将使用来自Transformer的LineByLineTextDataset util加载数据文件。同样在这里，您还可以使用配置参数来最适合您的情况。

And DataCollatorForLanguageModeling util for data back propagation required by Pytorch framework during training.

还有DataCollatorForLanguageModeling实用程序，用于Pytorch框架在培训期间所需的数据向后传播。

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer = tokenizer,
file_path="./language-data-file.txt", #this path is from step 1
block_size=128,)# need this util for data back propagation from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

4. Train

4. 火车

Now we are almost ready to train the model. First lest set up training arguments and few params needed for it.

现在我们几乎可以训练模型了。首先，以免设置训练参数，并为此设置一些参数。

#Set up training arguments
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./myBERTo", #this dir name is from step 1
overwrite_output_dir=True,
num_train_epochs=10,#change epoch values as you need.more the number longer it takes to train.
per_gpu_train_batch_size=64, #decrease this number for out of memory issues
save_steps=10_000,
save_total_limit=2,
)#define trainer object with above training argstrainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
prediction_loss_only=True,)#trigger the training trainer.train()#save the model in local directorytrainer.save_model("./myBERTo")

Training will take time depending on the config params you set up in above code.

培训将花费一些时间，具体取决于您在上述代码中设置的配置参数。

5. Test the trained model.

5.测试训练好的模型。

If you wish to test the model do as below. and you can retrain the model if you are not happy with the results.

如果要测试模型，请执行以下操作。如果您对结果不满意，可以重新训练模型。

Transformer library from HugginFace used wrapper called Pipeline which binds the tokenizer and model to test the models.

HugginFace的Transformer库使用了称为Pipeline的包装器，该包装器将标记器和模型绑定在一起以测试模型。

from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="./myBERTo",
tokenizer="./myBERTo")fill_mask("i am <mask>.")#output will predict the mask word with probabilities.
[{'score': 0.466119220793247223,
  'sequence': '<s> i am happy.</s>',
  'token': 316},
 {'score': 0.2403824366629124,
  'sequence': '<s> i am sorry.</s>',
  'token': 2340},
...

6. Publish/Upload Model

6. 发布/上传模型

Once training is finished it will generate model file and other required config files and vocabulary files. If you wish to upload to HuggingFace website you can follow this step or else you can skip this and start using the model straight away within your applications.

培训完成后，它将生成模型文件以及其他所需的配置文件和词汇表文件。如果您希望上传到HuggingFace网站，则可以执行此步骤，否则可以跳过此步骤，并立即在应用程序中立即使用该模型。

Before uploading first create an account in HuggingFace website and login with the same credentials using below commands once login success upload the model to it. Don’t forget to create Model_cards which is like a info card about your uploaded model on their site.

在上传之前，首先在HuggingFace网站上创建一个帐户，并在登录成功后将以下模型使用以下命令以相同的凭据登录。别忘了创建Model_cards，就像创建有关您网站上上传的模型的信息卡一样。

transformers-cli login
transformers-cli upload ./

This is a short description of blog written by HuggingFace for detailed learning visit their website and blog which are mentioned in reference.

这是HuggingFace撰写的博客的简短说明，用于详细学习，请访问其网站和参考博客中提到的博客。

I have trained a small model with 1Million data samples for Kannada language which is native language of Karnataka state in India, by following aforementioned steps , you too can do same for your liked languages which will be contribution for advancement of that language.

我已经 按照上述步骤 为 Kannada 语言(印度卡纳塔克邦的本地语言) 训练了一个具有100万个数据样本的小模型 ，按照上述步骤，您也可以为自己喜欢的语言做同样的事情，这将为该语言的发展做出贡献。

Keep learning and keep sharing .

不断学习，不断分享。

演示地址

翻译自: https://medium.com/analytics-vidhya/byolm-32d728efbf21

奥尔龙。js

weixin_26706653

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
奥尔龙。js_比奥尔姆

奥尔龙。jsBuild Your Own Language Model ???? 建立自己的语言模型???? Long time ago there was need of computation power to build fascinating things in AI/ML domain , which actually contribute to human advancements like m...
复制链接

扫一扫