NBME比赛总结

唐僧爱吃唐僧肉

已于 2022-05-17 20:54:58 修改

阅读量864

点赞数 5

分类专栏： kaggle比赛感悟文章标签：机器学习深度学习自然语言处理

于 2022-05-04 09:36:49 首次发布

本文链接：https://blog.csdn.net/znevegiveup1/article/details/123403668

版权

kaggle比赛感悟专栏收录该内容

9 篇文章

订阅专栏

NBME比赛总结

最近参加了一个NBME的相关比赛
NBME相关比赛
比赛目前还在进行之中，目前在参加比赛之中踩了一些坑，这里中途作一下小小的总结。
首先总结一下题目的含义：题目的意思是给出病人的一段描述文本以及编号pn_num和case_num内容，
patient_notes.csv中的内容如下

pn_num	case_num	pn_history
00000	0	"17-year-old male, has come to the student health clinic complaining of heart pounding. Mr. Cleveland’s mother has given verbal consent for a history, physical examination, …
00001	0	“17 yo male…from his roommate…”
features.csv中的内容如下：
feature_num	case_num	feature_text
–	–	–
000	0	Family-history-of-MI-OR-Family-history-of-myocardial-infarction
001	0	Family-history-of-thyroid-disorder
这里我们可以理解为pn_num和case_num共同决定了唯一的pn_history病人口述内容，feature_num和case_num共同决定了feature_text特征文本内容，接下来我们先查看test.csv的内容
id	case_num	pn_num
–	–	–
00016_000	0	00016
00016_001	0	00016
可以看出，通过case_num和pn_num唯一指向pn_history，通过feature_num和pn_num唯一指向feature_text，因此下面表格的内容可以将唯一的pn_history和feature_text都获得。
根据加入更多的特征，训练效果会更好的结论，训练的时候使用pn_history+feature_text的组合进行训练。
train.csv中的内容如下：
id	case_num	pn_num
–	–	–
00016_000	0	00016
00016_001	0	00016
可以看出，train.csv在上面内容的基础上多了annotation以及location的内容，这里的annotation用于说明，没有看出来有什么作用，而location的内容用于抽取出其中的关键信息。

1.label偏离

这是刚开始写代码的时候发生的错误，

for index,data in tqdm(valid_data.iterrows(),total=len(valid_data)):
    text = data['pn_history']
    feature_text = data['feature_text']
    inputs,length = prepare_input(text,feature_text)
    #valid_text.append(text+feature_text)
    valid_text.append(text)
    valid_input_ids.append(inputs['input_ids'].tolist())
    valid_token_type_ids.append(inputs['token_type_ids'].tolist())
    valid_attention_mask.append(inputs['attention_mask'].tolist())
    annotation_length = data['annotation_length']
    current_offset,current_label = create_label(text,annotation_length=data['annotation_length'],\
                                 location_list=data['location'])
    r"""
    current_label = 
[-1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0........]
    data['location'] = ['696 724']
    """
    #true_label = change_location_to_offset(text,data['location'])
    #发生bug的地方，true_label的标记错误
    valid_offset.append(current_offset)
    valid_label.append(data['location'])
    valid_length.append(length)

这里的

current_offset,current_label = create_label(text,annotation_length=data['annotation_length'],\
                                 location_list=data['location'])

之前取出来的current_label值取错了，之前的valid_label压入的不是data[‘location’]的值，而是current_label，导致训练过程之中出现错误。

2. loss出现bug

训练过程之中这里的loss写错了

def compute_multilabel_loss(model,batch_token_ids,\
                            batch_token_type_ids,\
                            batch_attention_mask,\
                            batch_label):
    logit = model(input_ids=batch_token_ids,\
                 attention_mask=batch_attention_mask,\
                 token_type_ids=batch_token_type_ids)
    logit = logit.view(-1,1)
    batch_label = batch_label.view(-1,1)
    loss_fn = nn.BCEWithLogitsLoss(reduction="none")
    loss = loss_fn(logit,batch_label)
    loss = torch.masked_select(loss,batch_label!=-1)
    loss = loss.mean()
    #这里的loss不要勿写成logit
    return loss

之前这里的masked_select出现问题

loss = torch.masked_select(loss,batch_label!=-1)

写成了

loss = torch.masked_select(logit,batch_label!=-1)

这是一个很难发现的错误，这里的loss错写成了logit直接导致最终的结果出现错误。

3.pn_history文本加上feature_text特征文本训练

这里就是加入更多的特征内容，上文中已经提到过将pn_history加上feature_text可以得到更好的效果，而且每次文本内容本质上只对应着一个相应的feature_text，所以这里text = pn_history+feature_text

4.deberta-v3 切词调用

deberta-v3模型之中有一个spm.model文件，这个模型文件可以从中读取出相应的字典。
这里本身的vocab_file内容一直报错，我将它修改了一下之后，可以进行使用

self.original_tokenizer.vocab_file = '/home/xiaoguzai/模型/deberta-v3-large/spm.model'
with open(self.original_tokenizer.vocab_file, "rb") as f:
    m.ParseFromString(f.read())

可以看出这里spm.model模型中的内容可以被读取，读取出来是一个dict字典，跟之前的vocab.json中的json文件相类似

5.BCEWithLogitLoss损失函数的使用

这一道题目使用到了BCEWithLogitLoss损失函数作为二分类的损失函数，二分类也可以使用mseloss，crossentropyloss，之前没有使用过BCEWithLogitLoss损失函数，这里学习一下。

6.后期数值概率一样，猜想可能由于batch_size较小导致

由于这一道题目有个鲜明的特点，就是标签为0的内容过多，标签为1的内容过少(跟之前的好多类别不均匀的题目类似。)因此这里batch_size不能够过小，否则当batch_size过小的情况下，数值中的0过多，从而会导致最终输出的标签一个样都为零。

7.roberta模型的再次使用

使用Roberta需知
这里说明了，使用roberta有一个很大的问题，就是roberta的offsets计算与deberta计算不一样，空格并不计算在内，所以需要进行后处理一下。其中最关键的后处理步骤代码在这：

def post_process_spaces(target, text):
    target = np.copy(target)

    if len(text) > len(target):
        padding = np.zeros(len(text) - len(target))
        target = np.concatenate([target, padding])
    else:
        target = target[:len(text)]

    if text[0] == " ":
        target[0] = 0
    if text[-1] == " ":
        target[-1] = 0

    for i in range(1, len(text) - 1):
        if text[i] == " ":
            if target[i] and not target[i - 1]:  # space before
                target[i] = 0

            if target[i] and not target[i + 1]:  # space after
                target[i] = 0

            if target[i - 1] and target[i + 1]:
                target[i] = 1

    return target

这里专门讲解一下分词的过程，放出一个简单的例子

inputs = tokenizer.encode_plus("Hello,I am your father!",\
                        add_special_tokens=True,\
                        max_length = 12,\
                        padding = "max_length",\
                        return_offsets_mapping = True)

得到了结果

{'input_ids': [0, 31414, 6, 100, 524, 110, 1150, 328, 2, 1, 1, 1], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 
'offset_mapping': [(0, 0), (0, 5), (5, 6), (6, 7), (8, 10), (11, 15), (16, 22), (22, 23), (0, 0), (0, 0), (0, 0), (0, 0)]}

比如这里面的(6,7)对应的是单词I，而当offset_mapping=(8,10)的时候，对应的是单词am，可以明显地看出来，单词I和am中间的空格没有出现在offset_mapping之中，但是在计算最终的得分的时候，由于是按照字母来计算的，所以中间的空格实际上也应该标记为1

8.控制长度

这里思路借鉴之前Evaluate Student Writing的思路，首先可以控制长度，其次可以控制平均的权重和
然而这里发现控制长度不大能行通，因为有一些长度较短的内容

9.切分问题

由于这里面出现了patient note，所有来自于patient note的内容需要在同一个fold之中，
Indeed, all data from a single patient note has to be in the same fold. You can either use :

patient_notes = pd.read_csv(DATA_PATH + "patient_notes.csv")
skf = StratifiedKFold(n_splits=K, random_state=SEED, shuffle=True)
splits = list(skf.split(X=patient_notes, y=patient_notes['case_num']))

df_train = pd.read_csv(DATA_PATH + "train.csv")
sgkf = StratifiedGroupKFold(n_splits=K, random_state=SEED, shuffle=True)
splits = list(sgkf.split(X=df_train , y=df_train['case_num'], groups=df_train['pn_num']))

10.调用医学bert

11.参数标记很重要

训练分数比不过别人，考虑是不是参数有问题，发现标注的时候，end = 284的时候，
本来应该标注到(278,284)就结束，现在老是标记到下一位(284,289)才结束，修改了一下训练的时候标记的代码

for idx in range(len(offset_mapping)):
      if (start_idx == -1) & (start < offset_mapping[idx][0]):
          start_idx = idx - 1
          #字符比当前字符小的时候，指向前一位
          #小的超出时多选数值
      if (end_idx == -1) & (end == offset_mapping[idx][1]):
          #!!!相等时直接标注
          end_idx = idx
      if (end_idx == -1) & (end < offset_mapping[idx][1]):
          #!!!大的超出时选择少选数值
          end_idx = idx + 1

发现经过修改之后，上面代码的运行效果并不好，原因在于数值1本身就比较少，要想增加得分，还是要扩大相应的标记范围(end往后去一个offset，这点很巧妙，而且竟然效果变好了)

for idx in range(len(offset_mapping)):
    if (start_idx == -1) & (start < offset_mapping[idx][0]):
        #start_idx还能往前去
        start_idx = idx - 1
    if (end_idx == -1) & (end <= offset_mapping[idx][1]):
        #end_idx还能往后去
        end_idx = idx + 1

12.字母左移一格的问题(右边边界问题有待解决)

每次在标记的时候，会将字母内容左移一格

for char_prob in char_probs:
    result = np.where(char_prob >= th)[0] + 1
    result = [list(g) for _, g in itertools.groupby(result, key=lambda n, c=itertools.count(): n - next(c))]
    result = [f"{min(r)} {max(r)}" for r in result]
    result = ";".join(result)
    results.append(result)

这里的原因很简单，因为roberta切词的时候打头为空格，所以在切词的时候需要右移动一位。

13.多折切分折数问题

因为题目中说训练集和测试集的patient_notes都来源于同一处，在多折切分的时候同一patient_notes的内容需要保持一致，所以应该按照patient_notes来进行切分
将所有train的data数据融合在一起之后，对数据进行切分

skf = StratifiedKFold(n_splits=K,random_state=42,shuffle=True)
#splits = list(skf.split(train,train[]))
for n, (train_index,val_index) in enumerate(skf.split(X=train,y=train['case_num'],\
                                                     groups=train['pn_num'])):
    train.loc[val_index,'fold'] = int(n)

11.接下来几个调整的方向：1.调用医学bert实验 2.maxlen调整，deberta调整为470,roberta调整为320或310 3.思考一下能否使用上annotation内容(用不到，annotation内容就是提取出来的内容)

14.找寻脏数据过程

import ast
# incorrect annotation
train.loc[338, 'annotation'] = ast.literal_eval('[["father heart attack"]]')
train.loc[338, 'location'] = ast.literal_eval('[["764 783"]]')

train.loc[621, 'annotation'] = ast.literal_eval('[["for the last 2-3 months"]]')
train.loc[621, 'location'] = ast.literal_eval('[["77 100"]]')

train.loc[655, 'annotation'] = ast.literal_eval('[["no heat intolerance"], ["no cold intolerance"]]')
train.loc[655, 'location'] = ast.literal_eval('[["285 292;301 312"], ["285 287;296 312"]]')

train.loc[1262, 'annotation'] = ast.literal_eval('[["mother thyroid problem"]]')
train.loc[1262, 'location'] = ast.literal_eval('[["551 557;565 580"]]')

train.loc[1265, 'annotation'] = ast.literal_eval('[[\'felt like he was going to "pass out"\']]')
train.loc[1265, 'location'] = ast.literal_eval('[["131 135;181 212"]]')

train.loc[1396, 'annotation'] = ast.literal_eval('[["stool , with no blood"]]')
train.loc[1396, 'location'] = ast.literal_eval('[["259 280"]]')

train.loc[1591, 'annotation'] = ast.literal_eval('[["diarrhoe non blooody"]]')
train.loc[1591, 'location'] = ast.literal_eval('[["176 184;201 212"]]')

train.loc[1615, 'annotation'] = ast.literal_eval('[["diarrhea for last 2-3 days"]]')
train.loc[1615, 'location'] = ast.literal_eval('[["249 257;271 288"]]')

train.loc[1664, 'annotation'] = ast.literal_eval('[["no vaginal discharge"]]')
train.loc[1664, 'location'] = ast.literal_eval('[["822 824;907 924"]]')

train.loc[1714, 'annotation'] = ast.literal_eval('[["started about 8-10 hours ago"]]')
train.loc[1714, 'location'] = ast.literal_eval('[["101 129"]]')

train.loc[1929, 'annotation'] = ast.literal_eval('[["no blood in the stool"]]')
train.loc[1929, 'location'] = ast.literal_eval('[["531 539;549 561"]]')

train.loc[2134, 'annotation'] = ast.literal_eval('[["last sexually active 9 months ago"]]')
train.loc[2134, 'location'] = ast.literal_eval('[["540 560;581 593"]]')

train.loc[2191, 'annotation'] = ast.literal_eval('[["right lower quadrant pain"]]')
train.loc[2191, 'location'] = ast.literal_eval('[["32 57"]]')

train.loc[2553, 'annotation'] = ast.literal_eval('[["diarrhoea no blood"]]')
train.loc[2553, 'location'] = ast.literal_eval('[["308 317;376 384"]]')

train.loc[3124, 'annotation'] = ast.literal_eval('[["sweating"]]')
train.loc[3124, 'location'] = ast.literal_eval('[["549 557"]]')

train.loc[3858, 'annotation'] = ast.literal_eval('[["previously as regular"], ["previously eveyr 28-29 days"], ["previously lasting 5 days"], ["previously regular flow"]]')
train.loc[3858, 'location'] = ast.literal_eval('[["102 123"], ["102 112;125 141"], ["102 112;143 157"], ["102 112;159 171"]]')

train.loc[4373, 'annotation'] = ast.literal_eval('[["for 2 months"]]')
train.loc[4373, 'location'] = ast.literal_eval('[["33 45"]]')

train.loc[4763, 'annotation'] = ast.literal_eval('[["35 year old"]]')
train.loc[4763, 'location'] = ast.literal_eval('[["5 16"]]')

train.loc[4782, 'annotation'] = ast.literal_eval('[["darker brown stools"]]')
train.loc[4782, 'location'] = ast.literal_eval('[["175 194"]]')

train.loc[4908, 'annotation'] = ast.literal_eval('[["uncle with peptic ulcer"]]')
train.loc[4908, 'location'] = ast.literal_eval('[["700 723"]]')

train.loc[6016, 'annotation'] = ast.literal_eval('[["difficulty falling asleep"]]')
train.loc[6016, 'location'] = ast.literal_eval('[["225 250"]]')

train.loc[6192, 'annotation'] = ast.literal_eval('[["helps to take care of aging mother and in-laws"]]')
train.loc[6192, 'location'] = ast.literal_eval('[["197 218;236 260"]]')

train.loc[6380, 'annotation'] = ast.literal_eval('[["No hair changes"], ["No skin changes"], ["No GI changes"], ["No palpitations"], ["No excessive sweating"]]')
train.loc[6380, 'location'] = ast.literal_eval('[["480 482;507 519"], ["480 482;499 503;512 519"], ["480 482;521 531"], ["480 482;533 545"], ["480 482;564 582"]]')

train.loc[6562, 'annotation'] = ast.literal_eval('[["stressed due to taking care of her mother"], ["stressed due to taking care of husbands parents"]]')
train.loc[6562, 'location'] = ast.literal_eval('[["290 320;327 337"], ["290 320;342 358"]]')

train.loc[6862, 'annotation'] = ast.literal_eval('[["stressor taking care of many sick family members"]]')
train.loc[6862, 'location'] = ast.literal_eval('[["288 296;324 363"]]')

train.loc[7022, 'annotation'] = ast.literal_eval('[["heart started racing and felt numbness for the 1st time in her finger tips"]]')
train.loc[7022, 'location'] = ast.literal_eval('[["108 182"]]')

train.loc[7422, 'annotation'] = ast.literal_eval('[["first started 5 yrs"]]')
train.loc[7422, 'location'] = ast.literal_eval('[["102 121"]]')

train.loc[8876, 'annotation'] = ast.literal_eval('[["No shortness of breath"]]')
train.loc[8876, 'location'] = ast.literal_eval('[["481 483;533 552"]]')

train.loc[9027, 'annotation'] = ast.literal_eval('[["recent URI"], ["nasal stuffines, rhinorrhea, for 3-4 days"]]')
train.loc[9027, 'location'] = ast.literal_eval('[["92 102"], ["123 164"]]')

train.loc[9938, 'annotation'] = ast.literal_eval('[["irregularity with her cycles"], ["heavier bleeding"], ["changes her pad every couple hours"]]')
train.loc[9938, 'location'] = ast.literal_eval('[["89 117"], ["122 138"], ["368 402"]]')

train.loc[9973, 'annotation'] = ast.literal_eval('[["gaining 10-15 lbs"]]')
train.loc[9973, 'location'] = ast.literal_eval('[["344 361"]]')

train.loc[10513, 'annotation'] = ast.literal_eval('[["weight gain"], ["gain of 10-16lbs"]]')
train.loc[10513, 'location'] = ast.literal_eval('[["600 611"], ["607 623"]]')

train.loc[11551, 'annotation'] = ast.literal_eval('[["seeing her son knows are not real"]]')
train.loc[11551, 'location'] = ast.literal_eval('[["386 400;443 461"]]')

train.loc[11677, 'annotation'] = ast.literal_eval('[["saw him once in the kitchen after he died"]]')
train.loc[11677, 'location'] = ast.literal_eval('[["160 201"]]')

train.loc[12124, 'annotation'] = ast.literal_eval('[["tried Ambien but it didnt work"]]')
train.loc[12124, 'location'] = ast.literal_eval('[["325 337;349 366"]]')

train.loc[12279, 'annotation'] = ast.literal_eval('[["heard what she described as a party later than evening these things did not actually happen"]]')
train.loc[12279, 'location'] = ast.literal_eval('[["405 459;488 524"]]')

train.loc[12289, 'annotation'] = ast.literal_eval('[["experienced seeing her son at the kitchen table these things did not actually happen"]]')
train.loc[12289, 'location'] = ast.literal_eval('[["353 400;488 524"]]')

train.loc[13238, 'annotation'] = ast.literal_eval('[["SCRACHY THROAT"], ["RUNNY NOSE"]]')
train.loc[13238, 'location'] = ast.literal_eval('[["293 307"], ["321 331"]]')

train.loc[13297, 'annotation'] = ast.literal_eval('[["without improvement when taking tylenol"], ["without improvement when taking ibuprofen"]]')
train.loc[13297, 'location'] = ast.literal_eval('[["182 221"], ["182 213;225 234"]]')

train.loc[13299, 'annotation'] = ast.literal_eval('[["yesterday"], ["yesterday"]]')
train.loc[13299, 'location'] = ast.literal_eval('[["79 88"], ["409 418"]]')

train.loc[13845, 'annotation'] = ast.literal_eval('[["headache global"], ["headache throughout her head"]]')
train.loc[13845, 'location'] = ast.literal_eval('[["86 94;230 236"], ["86 94;237 256"]]')

train.loc[14083, 'annotation'] = ast.literal_eval('[["headache generalized in her head"]]')
train.loc[14083, 'location'] = ast.literal_eval('[["56 64;156 179"]]')

14.特征文本的放置

将特征文本放在开头(即这里面的feature_text)有个重要的好处，就是如果切句子的时候不会切掉特征文本，而如果不放在开头，就必须保证maxlen跟文本的最长maxlen一致才行。
总而言之，就是切的时候炒出来不能切feature_text而应该切text

15.衰减optimizer

测试无果

16.最好的maxlen测试

17.maxlen的选定

roberta最好的maxlen=310
deberta-v3-large最好的maxlen=

18.目前还可以尝试的内容

debert-v3-large进行capitalize处理
biobert调用失败，进行capitalize效果差不多，进行capitalize+dropout之后结果有所提升
预训练模型，lgb模型
模型融合的时候roberta-large+deberta-v3-large+lgb模型

19.学习率bug

经过仔细排查，发现学习率出现bug，导致效果一直上不去
学习率的bug为

deberta = DebertaV2Model.from_pretrained("/home/xiaoguzai/模型/deberta-v3-large")
    model = ClassificationModel(deberta)        
    optimizer = torch.optim.AdamW(model.parameters(),lr=1e-5)
    
    for epoch in range(15):

        model.train()
        model.to(device)
        #model = torch.load("/home/xiaoguzai/程序/NBME-Score Clinical Patient Notes/best_point=0.8127543174426041.pth")
        losses = AverageMeter()
        scaler = torch.cuda.amp.GradScaler(enabled=True)
	      def lr_lambda(epoch):
	        if epoch > 5:
	           return 1
	        else:
	           return 2/(epoch+1)
	
	      scheduler = LambdaLR(optimizer, lr_lambda)

这里在epoch循环之中定义一次，等于每一次的学习率都是1e-5，这个bug是导致单模效果上不去的重要原因
调整一下代码

deberta = DebertaV2Model.from_pretrained("/home/xiaoguzai/模型/deberta-v3-large")
model = ClassificationModel(deberta)        
optimizer = torch.optim.AdamW(model.parameters(),lr=1e-5)
def lr_lambda(epoch):
    if epoch > 5:
        return 1
    else:
        return 2/(epoch+1)

scheduler = LambdaLR(optimizer, lr_lambda)

for epoch in range(15):

    model.train()
    model.to(device)
    #model = torch.load("/home/xiaoguzai/程序/NBME-Score Clinical Patient Notes/best_point=0.8127543174426041.pth")
    losses = AverageMeter()
    scaler = torch.cuda.amp.GradScaler(enabled=True)

内容瞬间清爽了许多

20.deberta线上线下分数差距过大的问题排查

deberta线上线下分数差距过大，经过排查之后发现，是由于划分数据时的折数问题造成的
而roberta的分数才是真实的分数
deberta的分数划分过程

from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold
Fold = GroupKFold(n_splits=5)
groups = train['pn_num'].values
for n, (train_index,val_index) in enumerate(Fold.split(train,train['location'],groups)):
    train.loc[val_index, 'fold'] = int(n)
#按照groups也就是train['pn_num']以及train['location']进行划分
train['fold'] = train['fold'].astype(int)
train['annotation_length'] = train['annotation'].apply(len)

!!!上面才是正确的划分方式，同一个pn_num的内容必须放在同一折，否则会验证集合会分数偏高

21.更详尽的后处理

由于deberta在处理的过程中是取的是正确的后面一个字母

def get_results(char_probs, th=0.5):
    results = []
    for char_prob in char_probs:
        result = np.where(char_prob >= th)[0] + 1
        #!!!注意这里的加一取的是正确的后面一个字母
        result = [list(g) for _, g in itertools.groupby(result, key=lambda n, c=itertools.count(): n - next(c))]
        result = [f"{min(r)} {max(r)}" for r in result]
        result = ";".join(result)
        results.append(result)
    return results

(因为每次deberta的offset大多数带着前面的一个空格，所以在处理的时候前一位有可能需要进行前移(比如offset前面字母区分的情况)，而后一位有可能也需要前移(后一位为字母的情况下)经过实验发现后面的字母去除掉的情况下会发现负增长

def get_results(test_text,char_probs, th=0.5):
    results = []
    #for char_prob in char_probs:
    for index in range(len(char_probs)):
        char_prob = char_probs[index]
        char_text = test_text[index]
        result = np.where(char_prob >= th)[0] + 1
        result = [list(g) for _, g in itertools.groupby(result, key=lambda n, c=itertools.count(): n - next(c))]
        #result = [f"{min(r)} {max(r)}" for r in result]
        result = [[min(r),max(r)] for r in result]
        
        for index1 in range(len(result)):
            if result[index1][0]-1 >= 0 and char_text[result[index1][0]-1] != ' ':
                result[index1][0] = result[index1][0]-1
            #前面的空格往后移
                result[index1][1] = result[index1][1]-1
        result = [str(r[0])+' '+str(r[1]) for r in result]
        result = ";".join(result)
        results.append(result)
    return results

22.tf_model.h5中的权重和pytorch_model.bin中的权重参数不同，

在deberta-v2模型的使用过程中，作者说必须将deberta-v2模型的dataset改为tensorflow的dataset，然后再用pytorch进行训练才能得到合理的分数(目前没有能够尝试)，尝试了一下shuffle数据集，但是没有成功
这里感觉deberta-v2跟deberta-v3结构相似，所以提升有限，并且之前roberta尝试成功过，所以准备继续进行roberta的提分尝试
deberta-v2有一些细小的数据可以进行后处理

23.目前还能想到的几种方法：

1.加入更多的特征(pn_num feature_num)
2.预训练模型，预训练的意义在于这一道题目的测试集合也包含在之前给出的文本之中了，所以猜想预训练可能会有一定的提升
3.lgb进行后处理操作内容
4.伪标签的学习率还可以再进行调节一下

24.更精细的数据处理

在train.csv中有一栏annotation的内容，这里就是用来修正后面的location的内容的，可以对这一部分进行一波更精细的处理

25.剩余还能进行的优化：伪标签和预训练，提交的时候注意处理数据时是否加入id的区别!!!(以后还可以尝试加id与不加id的模型融合)

加入伪标签进行训练时，注意只加入当前批次对应id的伪标签，否则其他批次的伪标签可能会影响当前批次的预测结果

26.不区分大小写，只将开头变为大写用于deberta模型之中

27.变换采样方式

这里调用BucketBatchSampler类别进行采样

class BucketBatchSampler(Sampler[Iterable[int]]):
    """A batch sampler for sequence bucketing.

    This class creates buckets according to the length of examples. It first sorts the
    lengths and creates index map. Then it groups them into buckets and shuffle
    randomly. This makes each batch has examples of which lengths are almost same. It
    leads the decrement of unnecessary and wasted paddings, hence, you can reduce the
    padded sequence lengths and entire computational costs.

    Args:
        texts: A list of target texts.
        batch_size: The number of examples in each batch.
    """

    def __init__(self, texts: List[str], batch_size: int):
        indices = np.argsort([len(text.split()) for text in texts])
        if len(indices) % batch_size > 0:
            padding = batch_size - len(indices) % batch_size
            indices = np.append(indices, [-1] * padding)

        self.buckets = indices.reshape(-1, batch_size)
        self.permutation = np.random.permutation(self.buckets.shape[0])

    def __len__(self) -> int:
        return self.buckets.shape[0]

    def __iter__(self) -> Iterator[Iterable[int]]:
        for indices in self.buckets[self.permutation]:
            yield indices[indices >= 0]

28.剩余还有的思路：预训练deberta-v3-10折/预训练deberta-五折+预训练-deberta-v3-五折进行比对

29.最终思路：主号：预训练+deberta-v3-large-8折+伪标签anddeberta-large-8折+伪标签->模型融合 and 一个886的模型

副号：deberta-v3-large-8折+伪标签 and deberta-large-8折+伪标签 / deberta-v3-large-5折+伪标签 and deberta-large-5折叠+伪标签

30.多折效果并不是折数越多越好？

这里猜想对于有一些词语，可能只是个别的模型能够预测出来，如果折数过多的情况下，这些个别的模型效果不能够体现，并且实验的效果表明，当模型十折的时候效果确要比五折时候的效果差不少。
由此可见，有可能在利用权重进行平均的时候，折数多的情况下效果不一定好，所以接下来尝试一下四折或者三折的模型，看有没有好的效果。

31.有效地调用deberta-v2模型的学习率

deberta-v2模型由于模型的特点，所以学习率比较受影响，这里需要根据模型预测的结果来进行学习率的调节：
1.如果输出的几乎全是零(得分为0.0)，说明学习率过高，此时需要调低学习率
2.如果输出的非零内容特别多，并且得分很低(0.0几)，说明学习率过低，此时需要调高学习率
最终发现在调用deberta-v2模型的时候，关键点在于

torch.nn.utils.clip_grad_norm_(model.parameters(),50)

上面，将这句去掉，完全不考虑，就成功了

32.最后的思路

目前只有一个思路，就是将预训练重新搞一下，使用pet预训练方式，每一次都将对应的annotation的内容盖上，然后再尝试后续的训练。

33.deberta-v2训练时候发生的错误总结

1.如果模型过大的时候，训练中产生的梯度也会过大，此时梯度需要放到正确的位置进行调用，必须放在loss.backward()的后面和optimizer.step()的前面。
2.如果在梯度累积的情况下，此时多轮的梯度累积必然会导致梯度裁剪对结果造成影响，在梯度裁剪的时候尽量避免使用梯度累积，如果必须使用的情况下也需要对梯度裁剪的边界进行修正。
3.当模型越大，模型的层数越深的时候，此时产生的梯度就越大，就越有可能需要进行梯度的裁剪。
4.、裁剪之前，对loss最好进行loss.mean()的平均操作，防止因为batch_size导致的梯度裁剪问题。

34.伪标签使用的错误总结

今天在打比赛的过程中，发现伪标签使用的过程之中会产生线下得分明明提升了，但是线上得分压根就没有提升的错误，这里进行一个小小的总结。
最关键的原因就是，在线下进行伪标签的时候，我调用了五个模型的融合之后的结果，但是实际上对于当前折数的每一个模型来说，五折融合的其他四折使用了当前折数的模型，相当于在训练的时候间接使用了其他折数对应的模型和数据，因此其他折数对应的模型会影响现在伪标签的效果。

35.梯度截断

这里注意梯度截断部分的位置

scaler.scale(loss).backward()
# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)
# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
scaler.step(optimizer)