汽车领域多语种迁移学习挑战赛竞赛学习

人工偶尔不太智能

已于 2022-07-24 00:08:56 修改

阅读量183

点赞数

文章标签：人工智能 python

于 2022-07-23 23:24:13 首次发布

本文链接：https://blog.csdn.net/weixin_51805094/article/details/125954115

版权

任务一：比赛报名

汽车领域多语种迁移学习挑战赛是讯飞开放平台上的一个算法挑战大赛，本次通过Coggle参加了这一竞赛学习，作为nlp竞赛的首次尝试~

步骤1：报名比赛

步骤2：下载比赛数据（点击比赛页面的赛题数据）

下载比赛数据需要实名，使用pandas读取数据集。

步骤3：解压比赛数据，并使用pandas进行读取

居然出师不利，上来就报了缺少缺少openpyxl的错，于是直接pip install openpyxl，本以为万事大吉，然鹅再次出现报错：

ERROR: Could not find a version that satisfies the requirement onenpyxl (from versions: none)
ERROR: No matching distribution found for onenpyxl

于是搜索查找一番，最后成功解决，办法如下：

!pip --default-timeout=100 install openpyxl -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

搞定，现在开始读取数据

# 读取数据
train_cn = pd.read_excel('汽车领域多语种迁移学习挑战赛初赛训练集/中文_trian.xlsx')
train_ja = pd.read_excel('汽车领域多语种迁移学习挑战赛初赛训练集/日语_train.xlsx')
train_en = pd.read_excel('汽车领域多语种迁移学习挑战赛初赛训练集/英文_train.xlsx')
test_ja = pd.read_excel('testA.xlsx', sheet_name='日语_testA')
test_en = pd.read_excel('testA.xlsx', sheet_name='英文_testA')

步骤4：查看训练集和测试集字段类型，并将数据读取代码写到博客


# 查看训练集和测试集字段类型
train_cn.dtypes,train_ja.dtypes,train_en.dtypes,test_ja.dtypes,test_en.dtypes

可以看到字段类型此处都是object

任务二：文本分析与文本分词

步骤1：使用jieba对中文进行分词

# 使用jieba对中文进行分词
import jieba
import jieba.posseg as pseg

train_cn['words'] = train_cn['原始文本'].apply(lambda x:jieba.lcut(x))
train_cn.head()

步骤2：使用negisa对日语进行分词

# 使用negisa对日语进行分词
import nagisa
train_ja['words'] = train_ja['原始文本'].apply(lambda x:nagisa.tagging(x).words)
train_ja.head()

任务3：TFIDF与文本分类

步骤1：学习TFIDF的使用，提取语料的TFIDF特征

tf（term frequency：指的是某一个给定的词语在该文件中出现的次数，这个数字通常会被归一化(一般是词频除以该文件总词数)，以防止它偏向长的文件。

idf （inverse document frequency）：反应了一个词在所有文本（整个文档）中出现的频率，如果一个词在很多的文本中出现，那么它的idf值应该低，而反过来如果一个词在比较少的文本中出现，那么它的idf值应该高。

(以上关于TFIDF的介绍参考公众号：小伍哥学风控，万万没想到，TF-IDF是这么计算的！)

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['This is the first document.',           
          'This document is the second document.',           
          'And this is the third one.',           
          'Is this the first document?'] 
# 初始化 
vector = TfidfVectorizer()
#t f-idf计算  
tfidf  = vector.fit_transform(corpus) 
# 查看每个词对应的数字编号
vector.vocabulary_

步骤2：使用逻辑回归结合TFIDF进行训练（所有的语言语料），并对测试集的意图进行分类

from sklearn.feature_extraction.text import TfidfVectorizer # 文本特征提取
from sklearn.linear_model import LogisticRegression # 逻辑回归
from sklearn.pipeline import make_pipeline # 组合流水线

# 文本分词
train_ja['words'] = train_ja['原始文本'].apply(lambda x: ' '.join(nagisa.tagging(x).words))
train_en['words'] = train_en['原始文本'].apply(lambda x: x.lower())

test_ja['words'] = test_ja['原始文本'].apply(lambda x: ' '.join(nagisa.tagging(x).words))
test_en['words'] = test_en['原始文本'].apply(lambda x: x.lower())

# 训练TFIDF和逻辑回归
pipline = make_pipeline(
    TfidfVectorizer(),
    LogisticRegression()
)
pipline.fit(
    train_ja['words'].tolist() + train_en['words'].tolist(),
    train_ja['意图'].tolist() + train_en['意图'].tolist()
)

# 模型预测
test_ja['意图'] = pipline.predict(test_ja['words'])
test_en['意图'] = pipline.predict(test_en['words'])
test_en['槽值1'] = np.nan
test_en['槽值2'] = np.nan

test_ja['槽值1'] = np.nan
test_ja['槽值2'] = np.nan

# 写入提交文件
writer = pd.ExcelWriter('submit.xlsx')
test_en.drop(['words'], axis=1).to_excel(writer, sheet_name='英文_testA', index=None)
test_ja.drop(['words'], axis=1).to_excel(writer, sheet_name='日语_testA', index=None)
writer.save()
writer.close()