小黑舔一口尝尝torchText

torchtext结构总览

在这里插入图片描述
图片来源:https://mp.weixin.qq.com/s/1T8peCd8IQT5XmZf68DhwQ

数据格式(以MRC为例)

{"id": "56dfa01738dc42170015211f", 
"context": "Tesla went on to pursue his ideas of wireless lighting and electricity distribution in his high-voltage, high-frequency power experiments in New York and Colorado Springs, and made early (1893) pronouncements on the possibility of wireless communication with his devices. He tried to put these ideas to practical use in an ill-fated attempt at intercontinental wireless transmission, his unfinished Wardenclyffe Tower project. In his lab he also conducted a range of experiments with mechanical oscillators/generators, electrical discharge tubes, and early X-ray imaging. He also built a wireless controlled boat, one of the first ever exhibited.", 
"question": "What were some of Tesla's experiments?",
 "answer": "high-voltage, high-frequency power", 
 "s_idx": 15,
  "e_idx": 18
  }

test_data.jsonl

在这里插入图片描述

代码样例

import json
import nltk
from torchtext import data

def word_tokenize(tokens):
    return [token.replace("''", '"').replace("``", '"') for token in nltk.word_tokenize(tokens)]
# 定义id字段
RAW = data.RawField()
RAW.is_target = False
# 定义char字段,需要嵌套,先通过tokenize变成单词,然后通过单词进行list操作变成char
CHAR_NESTING = data.Field(batch_first = True,tokenize = list,lower = True)
CHAR = data.NestedField(CHAR_NESTING,tokenize = word_tokenize)
# 定义标签字段(阅读理解答案的起止位置)
LABEL = data.Field(sequential = False,unk_token = None,use_vocab = False)
# 对json中的每一个字段与相应的field进行对应
dict_field = {'id':('data_id',RAW),
              's_idx':('data_s_idx',LABEL),
              'e_idx':('data_e_idx',LABEL),
              'context':[('data_c_word',WORD),('data_c_char',CHAR)],
              'question':[('data_q_word',WORD),('data_q_char',CHAR)]
from torchtext.data.example import Example
test_data = [json.loads(line.strip()) for line in open('./data/test_data.jsonl')]
print('处理一条样例的例子:',Example.fromdict(test_data[0],dict_field).__dict__)

在这里插入图片描述

使用data.TabularDataset从json处理成Example
train,dev = data.TabularDataset.splits(
    path = './data',
    train = 'test_data.jsonl',
    validation='test_data.jsonl',
    format='json',
    fields = dict_field
)
print('data.TabularDataset.splits处理train与test后:')
print('dev:',dev)
print('dev[0]:',dev.examples[0].__dict__)

在这里插入图片描述

构造vocab
CHAR.build_vocab(train,dev)
WORD.build_vocab(train,dev)

在这里插入图片描述

当在VSCode中运行程序时,如果需要输入数据,就需要打开一个命令行窗口来进行输入。VSCode自身的终端不支持输入。 如果遇到VSCode运行程序时小黑窗口一闪而过的问题,有两种解决方法可以尝试。第一种方法是在launch.json文件中添加"externalConsole": true,这样会在外部打开一个命令行窗口来显示程序运行结果。如果第一种方法不行,可以尝试将launch.json文件中的console字段的内容改为"externalTerminal",即 "console": "externalTerminal"。这样也会在外部打开命令行窗口来显示结果。 另外,如果出现中文编码问题导致乱码的情况,可以尝试使用扩展插件来解决。一种方法是将源文件的编码改为GBK。另一种方法是在launch.json文件的"flags"字段中添加-fexec-charset=gbk参数,这样可以保证多字节字符串常量以指定编码保存。这样即使源文件编码为utf-8,也可以正常运行程序。 如果以上方法仍然无法解决问题,可以在评论区或通过私信向我提问,我会尽力提供帮助。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [vscode:如何输入数据,小黑框闪退?](https://blog.csdn.net/qq_44697303/article/details/124244743)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *3* [完美解决 VsCode 一闪而过 闪屏 非语法 黑窗口 黑框 小黑窗 一闪而过 插件 F8运行编译](https://blog.csdn.net/weixin_49486457/article/details/124824688)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值