1.数据及库的准备
#!pip -q install trax==1.3.1
import trax
from trax import layers as tl
import os
import numpy as np
import pandas as pd
from utils import get_params, get_vocab
import random as rnd
# set random seeds to make this notebook easier to replicate
trax.supervised.trainer_lib.init_random_number_generators(33)
数据的表示和标记(其中B-表示token是实体的开始,I-表示token在实体内部)
数据规模如下:
数据生成器:
def data_generator(batch_size, x, y, pad, shuffle=False, verbose=False):
'''
Input:
batch_size - integer describing the batch size
x - list containing sentences where words are represented as integers
y - list containing tags associated with the sentences
shuffle - Shuffle the data order
pad - an integer representing a pad character
verbose - Print information during runtime
Output:
a tuple containing 2 elements:
X - np.ndarray of dim (batch_size, max_len) of padded sentences
Y - np.ndarray of dim (batch_size, max_len) of tags associated with the sentences in X
'''
# count the number of lines in data_lines
num_lines = len(x)
# create an array with the indexes of data_lines that can be shuffled
lines_index = [*range(num_lines)]
# sh