fasttext-文本二分类实践(天池小布助手对话短文本语义匹配)

对句子二分类,检测两个句子是否表达是同一个意思,模型数据来自天池全球人工智能技术创新大赛【赛道三】详情

模型测试集准确率高99.5%,线上准确率高75%左右,略低于baseline

数据格式:
在这里插入图片描述
fasttext使用方法可查看:fasttext官网

import pandas as pd
import random
cate_dic = {'same':1, 'different':0}
#数据加载,未构建验证集
train_file = r'G:\chromeDownload\预测是否属于同一语义\baseline_tfidf_lr\oppo_breeno_round1_data\gaiic_track3_round1_train_20210228.tsv'
test_file = r'G:\chromeDownload\预测是否属于同一语义\baseline_tfidf_lr\oppo_breeno_round1_data\gaiic_track3_round1_testA_20210228.tsv'
df_train = pd.read_table(train_file,names=['q1', 'q2', 'label']).fillna("0")  # (100000, 3)
df_test = pd.read_table(test_file, names=['q1', 'q2']).fillna( "0")  # (25000, 2)
label = df_train['label'].values
df = pd.concat([df_train, df_test], ignore_index=True)  # (125000, 4)
df['text'] = df['q1'] + " " + df['q2']
#处理成fasttext格式
def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            sentences.append("__label__"+str(category)+" , "+line)
        except:
            print(line)
            continue 

#生成训练数据
sentences = []
same_sentences = df_train[df_train.label==1]
same_sentences = (same_sentences['q1']+ " " + same_sentences['q2']).values.tolist()
diffent_sentences = df_train[df_train.label==0]
diffent_sentences = (diffent_sentences['q1']+ " " + diffent_sentences['q2']).values.tolist()

preprocess_text(same_sentences, sentences, cate_dic['same'])
preprocess_text(diffent_sentences, sentences, cate_dic['different'] )            
random.shuffle(sentences)                

#写入到文本
out = open(r'G:\chromeDownload\预测是否属于同一语义\baseline_tfidf_lr\train_data.txt', 'w',encoding='utf-8')
for sentence in sentences:
    out.write(sentence+"\n")

#训练
classifier = fasttext.train_supervised(input='train_data.txt', lr=1.0, epoch=25, wordNgrams=3, bucket=200000, dim=50, loss='hs')
#查看效果,默认是所有类别的f1score
classifier.test('train_data.txt')
#100000, 0.9671, 0.9671

#预测
lr_0_predictions = []
lr_1_predictions = []

test_sentences_list = (df_test['q1']+ " " + df_test['q2']).values.tolist()
for i,texts in enumerate(test_sentences_list):
    labels, probabilities = classifier.predict(texts, k=2)
    print(labels,'--', probabilities)
    if (labels[0]=='__label__0'):
        lr_0_predictions.append(probabilities[0])
    if (labels[0]=='__label__1'):
        lr_0_predictions.append(probabilities[1]) 

#预测的第一列是0,写入到文件
pd.DataFrame(lr_0_predictions).to_csv("result.csv", index=False, header=False)

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值