NLP工具——Flair

Flair

本文翻译自github高star项目Flair教程,传送门:
https://github.com/flairNLP/flair






Base types

Two types of objects: Sentence and Token, Sentence holds a textual sentence and is a list of Token.

from flair.data import Sentence
sentence=Sentence("The grass is green .", use_tokenizer=True)
print(sentence)
# Sentence: "The grass is green ." - 5 Tokens
print(sentence[3])    # Token: 4 green
for token in sentence:
    print(token)

Add a tag to Token

token=sentence[3]
token.add_tag("ner", "color")
tag=token.get_tag("ner")
print(tag.value)
print(tag.score)

Our color tag has a score of 1.0 since we manually added it. If a tag is predicted by our sequence labeler, the score value will indicate classifier confidence.

Add a label to Sentence

sentence = Sentence('France is the current world cup winner.')
sentence.add_labels(['sports', 'world cup'])
for label in sentence.labels:
	print(label)





Tagging with Pre-Trained Sequence Tagging Models

from flair.models import SequenceTagger
# We download or move model file to cached_dir (~/.flair)
tagger = SequenceTagger.load('ner')

# 针对其中每一个词分别给出tag标记
sentence = Sentence('George Washington went to Washington .')
tagger.predict(sentence)
# print(sentence.to_tagged_string())
for token in sentence:
    tag=token.get_tag("ner")
    print(tag)
    
# 针对句子给出span标记
for entity in sentence.get_spans('ner'):
    print(entity)

# 给出详细信息
print(sentence.to_dict(tag_type='ner'))

# 针对一段文本,先进行分句子,再针对每个句子去NER
text = "This is a sentence. This is another sentence. I love Berlin."

# use a library to split into sentences
from segtok.segmenter import split_single

sentences = [Sentence(sent, use_tokenizer=True) for sent in split_single(text)]

# predict tags for list of sentences
tagger = SequenceTagger.load('ner')
tagger.predict(sentences)    # Using the mini_batch_size parameter of the .predict() method, you can set the size of mini batches passed to the tagger.

for token in sentences[2]:
    tag=token.get_tag("ner")
    print(tag)





Word Embeddings

# Classic Word Embeddings

from flair.embeddings import WordEmbeddings

# init embedding

# 需要下载两个文件https://github.com/zalandoresearch/flair/issues/651
glove_embedding = WordEmbeddings('glove')


# Flair Embeddings
# Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

from flair.embeddings import FlairEmbeddings
flair_embedding_forward = FlairEmbeddings('news-forward')

    
# Stacked Embeddings
# Stacked embeddings are one of the most important concepts of this library. You can use them to combine different embeddings together, for instance if you want to use both traditional embeddings together with contextual string embeddings. Stacked embeddings allow you to mix and match. We find that a combination of embeddings often gives best results.
# For instance, lets combine classic GloVe embeddings with forward and backward Flair embeddings. This is a combination that we generally recommend to most users, especially for sequence labeling.

from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings
# init standard GloVe embedding
glove_embedding = WordEmbeddings('glove')

# init Flair forward and backwards embeddings
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        glove_embedding,
                                        flair_embedding_forward,
                                        flair_embedding_backward,
                                       ])


# combine BERT and Flair embeddings
from flair.embeddings import FlairEmbeddings, BertEmbeddings

# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('news-forward')
flair_backward_embedding = FlairEmbeddings('news-backward')

# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-uncased')    # from pytorch_transformers package

from flair.embeddings import StackedEmbeddings
stacked_embeddings = StackedEmbeddings(
    embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])


# Document Embeddings
# Our document embeddings are created from the embeddings of all words in the document. 
# Two methods

# 1 Pooling
# The first method calculates a pooling operation over all word embeddings in a document. The default operation is mean which gives us the mean of all words in the sentence. The resulting embedding is taken as document embedding.
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings, Sentence

# initialize the word embeddings
glove_embedding = WordEmbeddings('glove')
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')

# initialize the document embeddings, mode = mean
document_embeddings = DocumentPoolEmbeddings([glove_embedding,
                                              flair_embedding_backward,
                                              flair_embedding_forward],pooling='mean')

# if you only use simple word embeddings that are not task-trained you should probably use a 'nonlinear' transformation instead:
# instantiate pre-trained word embeddings
embeddings = WordEmbeddings('glove')

# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='nonlinear')

# If on the other hand you use word embeddings that are task-trained (such as simple one hot encoded embeddings), you are often better off doing no transformation at all. Do this by passing 'none':
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='none')




# create an example sentence
sentence = Sentence('The grass is green . And the sky is blue .')

# embed the sentence with our document embedding
document_embeddings.embed(sentence)

# now check out the embedded sentence.
print(sentence.get_embedding())

# 2 RNN
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings

glove_embedding = WordEmbeddings('glove')

document_embeddings = DocumentRNNEmbeddings([glove_embedding])


# USE WORD EMBEDDINGS
# Words are now embedded using a concatenation of three different embeddings.


sentence = Sentence('The grass is green .')
SOME_embedding.embed(sentence)
for token in sentence:
    print(token)
    print(token.embedding)





Loading Training Data

The Corpus represents a dataset that you use to train a model. It consists of a list of train sentences, a list of dev sentences, and a list of test sentences, which correspond to the training, validation and testing split during model training.

# 加载已存在的corpus
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

# print the number of Sentences in the train split
print(len(corpus.train))

# print the number of Sentences in the test split
print(len(corpus.test))

# print the number of Sentences in the dev split
print(len(corpus.dev))

# print the first Sentence in the training split
print(corpus.train[0])

# downsample the corpus
import flair.datasets
downsampled_corpus = flair.datasets.UD_ENGLISH().downsample(0.1)

print("--- 1 Original ---")
print(corpus)

print("--- 2 Downsampled ---")
print(downsampled_corpus)

# --- 1 Original ---
# Corpus: 12543 train + 2002 dev + 2077 test sentences

# --- 2 Downsampled ---
# Corpus: 1255 train + 201 dev + 208 test sentences

# For many learning tasks you need to create a target dictionary. Thus, the Corpus enables you to create your tag or label dictionary, depending on the task you want to learn.

# create tag dictionary for an NER task
corpus = flair.datasets.CONLL_03_DUTCH()
print(corpus.make_tag_dictionary('ner'))

stats = corpus.obtain_statistics()
print(stats)





Reading Your Own Sequence Labeling Dataset

In cases you want to train over a sequence labeling dataset that is not in the above list, you can load them with the ColumnCorpus object. Most sequence labeling datasets in NLP use some sort of column format in which each line is a word and each column is one level of linguistic annotation. See for instance this sentence:

# 1 The first column is the word itself, the second coarse PoS tags, and the third BIO-annotated NER tags. 
# 2 Empty line separates sentences.
George N B-PER
Washington N I-PER
went V O
to P O
Washington N B-LOC

Sam N B-PER
Houston N I-PER
stayed V O
home N O

To read such a dataset, define the column structure as a dictionary and instantiate a ColumnCorpus

Note:

POS tags are not needed and in fact will be ignored by Flair if you provide them. The library directly goes from text to the tags you wish to predict and requires no extra features. So if you don’t have POS tags, you only need to change the column_format to reflect this in the ColumnCorpus and everything should be good to go!

from flair.data import Corpus
from flair.datasets import ColumnCorpus

# define columns
columns = {0: 'text', 1: 'pos', 2: 'ner'}

# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data/folder'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt')

This gives you a Corpus object that contains the train, dev and test splits, each has a list of Sentence.

# access a sentence and check out annotations
print(corpus.train[0].to_tagged_string('ner'))
print(corpus.train[1].to_tagged_string('pos'))

Reading a Text Classification Dataset

use your own text classification dataset

  1. load specified text and labels from a simple CSV file
  2. format your data to the FastText format

----------------------------------------------------------------------------------

You can load a CSV format classification dataset using CSVClassificationCorpus by passing in a column format (like in ColumnCorpus above).

Note: You will need to save your split CSV data files in the data_folder path with each file titled appropriately i.e. train.csv test.csv dev.csv. This is because the corpus initializers will automatically search for the train, dev, test splits in a folder.

from flair.data import Corpus
from flair.datasets import CSVClassificationCorpus

# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data'

# column format indicating which columns hold the text and label(s)
column_name_map = {4: "text", 1: "label_topic", 2: "label_subtopic"}

# load corpus containing training, test and dev data and if CSV has a header, you can skip it
corpus: Corpus = CSVClassificationCorpus(data_folder,
                                         column_name_map,
                                         skip_header=True,
                                         delimiter='\t',    # tab-separated files
) 

----------------------------------------------------------------------------------

You may format your data to the FastText format, in which each line in the file represents a text document.





Training a Model

Training a Sequence Labeling Model

Here is example code for a small NER model trained over NCBI data, using simple GloVe embeddings.

We can downsample it 10% for test.

from flair.data import Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

columns = {0: 'text', 1: 'pos', 2: 'ner'}
data_folder = '/home/fyh/.flair/datasets/ncbi'
corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt').downsample(0.1)
stats = corpus.obtain_statistics()
print(stats)
tag_dictionary = corpus.make_tag_dictionary(tag_type="ner")

glove_embedding = WordEmbeddings('glove')
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')
stacked_embeddings = StackedEmbeddings([
                                        glove_embedding,
                                        flair_embedding_forward,
                                        flair_embedding_backward,
                                       ])
tagger= SequenceTagger(hidden_size=256,
                                        embeddings=stacked_embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type="ner",
                                        use_crf=True)
trainer = ModelTrainer(tagger, corpus)
# start training
trainer.train('resources/taggers/example_ncbi-ner',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=150)
# 会在运行路径下的上述路径生成模型文件,使用ctrl+c提前终止
# Alternatively, try using a stacked embedding with FlairEmbeddings and GloVe, over the full data, for 150 epochs. This will give you the state-of-the-art accuracy we report in the paper.

# load the model you trained
model = SequenceTagger.load('resources/taggers/example_ncbi-ner/final-model.pt')

# create example sentence
sentence = Sentence('A common human skin tumour is caused by activating mutations in beta-catenin.')

# predict tags and print
model.predict(sentence)

print(sentence.to_tagged_string())

# Plotting Training Curves and Weights
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves('loss.tsv')
plotter.plot_weights('weights.txt')

Training a Text Classification Model / Multi-Dataset Training (Link)





Resuming Training

If you want to stop the training at some point and resume it at a later point, you should train with the parameter checkpoint set to True. This will save the model plus training parameters after every epoch. Thus, you can load the model plus trainer at any later point and continue the training exactly there where you have left.

The example code below shows how to train, stop, and continue training of a SequenceTagger.

# 7. start training
trainer.train('resources/taggers/example-ner',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=150,
              checkpoint=True)

# 8. stop training at any point

# 9. continue trainer at later point
from pathlib import Path

checkpoint = tagger.load_checkpoint(Path('resources/taggers/example-ner/checkpoint.pt'))
trainer = ModelTrainer.load_from_checkpoint(checkpoint, corpus)
trainer.train('resources/taggers/example-ner',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=150,
              checkpoint=True)

The main parameter you need to set is the embeddings_storage_mode in the train() method of the ModelTrainer. It can have one of three values:

'none': If you set embeddings_storage_mode=‘none’, embeddings do not get stored in memory. Instead they are generated on-the-fly in each training mini-batch (during training). The main advantage is that this keeps your memory requirements low.

'cpu': If you set embeddings_storage_mode=‘cpu’, embeddings will get stored in regular memory.

during inference: this slow down your inference when used with a GPU as embeddings need to be moved from GPU memory to regular memory. The only reason to use this option during inference would be to not only use the predictions but also the embeddings after prediction.

'gpu': If you set embeddings_storage_mode=‘gpu’, embeddings will get stored in CUDA memory. This will often be the fastest one since this eliminates the need to shuffle tensors from CPU to CUDA over and over again. Of course, CUDA memory is often limited so large datasets will not fit into CUDA memory. However, if the dataset fits into CUDA memory, this option is the fastest one.





Model Tuning

# you need to define the search space of parameters.
from hyperopt import hp
from flair.hyperparameter.param_selection import SearchSpace, Parameter

# define your search space
search_space = SearchSpace()
search_space.add(Parameter.EMBEDDINGS, hp.choice, options=[
    [ WordEmbeddings('en') ], 
    [ FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward') ]
])
search_space.add(Parameter.HIDDEN_SIZE, hp.choice, options=[32, 64, 128])
search_space.add(Parameter.RNN_LAYERS, hp.choice, options=[1, 2])
search_space.add(Parameter.DROPOUT, hp.uniform, low=0.0, high=0.5)
search_space.add(Parameter.LEARNING_RATE, hp.choice, options=[0.05, 0.1, 0.15, 0.2])
search_space.add(Parameter.MINI_BATCH_SIZE, hp.choice, options=[8, 16, 32])

Attention: You should always add your embeddings to the search space (as shown above). If you don’t want to test different kind of embeddings, simply pass just one embedding option to the search space, which will then be used in every test run. Here is an example:

search_space.add(Parameter.EMBEDDINGS, hp.choice, options=[
    [ FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward') ]
])

In the last step you have to create the actual parameter selector. Depending on the task you need either to define a TextClassifierParamSelector or a SequenceTaggerParamSelector and start the optimization.

define the maximum number of evaluation runs hyperopt should perform (max_evals)

A evaluation run performs the specified number of epochs (max_epochs)

specify the number of runs per evaluation run (training_runs)

If you specify more than one training run, one evaluation run will be executed the specified number of times. The final evaluation score will be the average over all those runs.

from flair.hyperparameter.param_selection import TextClassifierParamSelector, OptimizationValue

# create the parameter selector
param_selector = TextClassifierParamSelector(
    corpus, 
    False, 
    'resources/results', 
    'lstm',
    max_epochs=50, 
    training_runs=3,
    optimization_value=OptimizationValue.DEV_SCORE
)

# start the optimization
param_selector.optimize(search_space, max_evals=100)

The parameter settings and the evaluation scores will be written to param_selection.txt in the result directory. Selecting the best parameter combination we do not store any model to disk.

Finding the best Learning Rate Link

Custom Optimizers

You can now use any of PyTorch’s optimizers for training when initializing a ModelTrainer. To give the optimizer any extra options just specify it as shown with the weight_decay example:

from torch.optim.adam import Adam

trainer = ModelTrainer(tagger, corpus,
                       optimizer=Adam)
                                     
trainer.train(
    "resources/taggers/example",
    weight_decay=1e-4
)





Training your own Flair Embeddings

Flair Embeddings are the secret sauce in Flair, allowing us to achieve state-of-the-art accuracies across a range of NLP tasks. This tutorial shows you how to train your own Flair embeddings, which may come in handy if you want to apply Flair to new languages or domains.

Preparing a Text Corpus

To train your own model, you first need to identify a suitably large corpus. In our experiments, we used corpora that have about 1 billion words.

You need to split your corpus into train, validation and test portions. Our trainer class assumes that there is a folder for the corpus in which there is a test.txt and a valid.txt with test and validation data. Importantly, there is also a folder called train that contains the training data in splits. For instance, the billion word corpus is split into 100 parts. The splits are necessary if all the data does not fit into memory, in which case the trainer randomly iterates through all splits.

corpus/
corpus/train/
corpus/train/train_split_1
corpus/train/train_split_2
corpus/train/...
corpus/train/train_split_X
corpus/test.txt
corpus/valid.txt

Training the Language Model

Once you have this folder structure, simply point the LanguageModelTrainer class to it to start learning a model

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

# are you training a forward or backward LM?
is_forward_lm = True

# load the default character dictionary
dictionary: Dictionary = Dictionary.load('chars')

# get your corpus, process forward and at the character level
corpus = TextCorpus('/path/to/your/corpus',
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
                               is_forward_lm,
                               hidden_size=128,
                               nlayers=1)

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model',
              sequence_length=10,
              mini_batch_size=10,
              max_epochs=10)

The parameters in this script are very small. We got good results with a hidden size of 1024 or 2048, a sequence length of 250, and a mini-batch size of 100.

Using the LM as Embeddings

Just load the model into the FlairEmbeddings class and use as you would any other embedding in Flair:

sentence = Sentence('I love Berlin')

# init embeddings from your trained LM
char_lm_embeddings = FlairEmbeddings('resources/taggers/language_model/best-lm.pt')

# embed sentence
char_lm_embeddings.embed(sentence)

Fine-Tuning an Existing LM

Sometimes it makes sense to fine-tune an existing language model instead of training from scratch. For instance, if you have a general LM for English and you would like to fine-tune for a specific domain.

To fine tune a LanguageModel, you only need to load an existing LanguageModel instead of instantiating a new one.

from flair.data import Dictionary
from flair.embeddings import FlairEmbeddings
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus


# instantiate an existing LM, such as one from the FlairEmbeddings
language_model = FlairEmbeddings('news-forward').lm

# are you fine-tuning a forward or backward LM?
is_forward_lm = language_model.is_forward_lm

# get the dictionary from the existing language model
dictionary: Dictionary = language_model.dictionary

# get your corpus, process forward and at the character level
corpus = TextCorpus('path/to/your/corpus',
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# use the model trainer to fine-tune this model on your corpus
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model',
              sequence_length=100,
              mini_batch_size=100,
              learning_rate=20,
              patience=10,
              checkpoint=True)

Note that when you fine-tune, you must use the same character dictionary as before and copy the direction

Fine-Tuning the language model on a specific domain

model = LanguageModel.load_language_model('your/saved/model.pt')
# make sure to use the same dictionary from saved model
dictionary = model.dictionary

# load your new corpus
corpus = Corpus('path/to/your/corpus', dictionary, forward, character_level=True)

# pass corpus and pre-trained language model to trainer
trainer = LanguageModelTrainer(language_model, corpus)

# train with your favorite parameters
trainer.train('resources/taggers/language_model', learning_rate=5)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值