squad数据集处理代码阅读

最新推荐文章于 2023-08-22 23:13:40 发布

VIP文章 endinggy0

最新推荐文章于 2023-08-22 23:13:40 发布

阅读量1.4k

点赞数

分类专栏：机器阅读理解文章标签：深度学习 nlp

本文链接：https://blog.csdn.net/weixin_43856210/article/details/108041967

版权

源代码地址

核心函数是将原始样本转换为model可输入形式
Args:
max_seq_length: The maximum sequence length of the inputs.
doc_stride: The stride used when the context is too large and is split across several features.
max_query_length: The maximum length of the query.
is_training: whether to create features for model evaluation or model training.

def squad_convert_example_to_features(example, max_seq_length, doc_stride, max_query_length, is_training):
    features = []
    if is_training and not example.is_impossible:
        # Get start and end position
        start_position = example.start_position
        end_position = example.end_position

        # If the answer cannot be found in the text, then skip this example.
        actual_text = " ".join(example.doc_tokens[start_position : (end_position + 1)])
        cleaned_answer_text = " ".join(whitespace_tokenize(example.answer_text))
        if actual_text.find(cleaned_answer_text) == -1:
            logger.warning("Could not find answer: '%s' vs. '%s'", actual_text, cleaned_answer_text)
            return []

    tok_to_orig_index = []
    orig_to_tok_index = []
    all_doc_tokens = []
    for (i, token) in enumerate(example.doc_tokens):
        orig_to_tok_index.append(len(all_doc_tokens))
        sub_tokens = tokenizer.tokenize(token)
        for sub_token in sub_tokens:
            tok_to_orig_index.append(i)
            all_doc_tokens.append(sub_token)

    if is_training and not example.is_impossible:
        tok_start_position = orig_to_tok_index[example.start_position]
        if example.end_position < len(example.doc_tokens) - 1:
            tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
        else:
            tok_end_position = len(all_doc_tokens) - 1

        (tok_start_position, tok_end_posit

最低0.47元/天解锁文章

endinggy0

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
squad数据集处理代码阅读

源代码地址核心函数是将原始样本转换为model可输入形式Args:max_seq_length: The maximum sequence length of the inputs.doc_stride: The stride used when the context is too large and is split across several features.max_query_length: The maximum length of the query.is_training: w
复制链接

扫一扫