GPT2 for Multiple Languages
Simplifed GPT2 train scripts(based on Grover, supporting TPUs)
Ported bert tokenizer, multilingual corpus compatible
1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
Batteries-included Colab demo #
1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )
Pretrained Model
Size
Language
Corpus
Vocab
Link
SHA256
1.5B parameters
Chinese
~30G
CLUE ( 8021 tokens )
e698cc97a7f5f706f84f58bb469d614e
51d3c0ce5f9ab9bf77e01e3fcb41d482
1.5B parameters
Chinese
~15G
Bert ( 21128 tokens )
4a6e5124df8db7ac2bdd902e6191b807
a6983a7f5d09fb10ce011f9a073b183e
Using Cloud TPU Pod v3-256 to train 22w steps
Google Colab
With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:
Train
Disclaimer
The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.
Citation
@misc{GPT2-ML,
author = {Zhibo Zhang},
title = {GPT2-ML: GPT-2 for Multiple Languages},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/imcaspar/gpt2-ml}},
}
Reference
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)