任务1:报名比赛
import pandas as pd
import numpy as np
#任务一:读取数据
train_cn = pd.read_excel('data/中文_train.xlsx')
train_en = pd.read_excel('data/英文_train.xlsx')
train_jp = pd.read_excel('data/日语_train.xlsx')
test_jp = pd.read_excel('data/test_A.xlsx', sheet_name='日语_testA')
test_en = pd.read_excel('data/test_A.xlsx', sheet_name='英文_testA')
#查看数据
print(train_cn.head())
| 原始文本 | 意图 | 槽值1 | 槽值2 |
0 | 16.5度 | adjust_ac_temperature_to_number | offset:16.5 | NaN |
1 | 16度 | adjust_ac_temperature_to_number | offset:16 | NaN |
2 | 16空调开到16度 | adjust_ac_temperature_to_number | offset:16 | NaN |
3 | 16温度16度 | adjust_ac_temperature_to_number | offset:16 | NaN |
4 | 17度 | adjust_ac_temperature_to_number | offset:17 | NaN |
print(train_en.head())
| 原始文本 | 中文翻译 | 意图 | 槽值1 | 槽值2 |
0 | open aircon please | 请打开空调 | open_ac | NAN | NAN |
1 | I want to activate the AC | 我想打开空调 | open_ac | NAN | NAN |
2 | I want to turn on the air conditioner | 我想打开空调 | open_ac | NAN | NAN |
3 | switch on the AC please | 请打开空调 | open_ac | NAN | NAN |
4 | Help me open the AC | 帮我打开空调 | open_ac | NAN | NAN |
print(train_jp.head())
| 原始文本 | 中文翻译 | 意图 | 槽值1 | 槽值2 |
0 | エアコンのスイッチONに | 打开空调开关 | open_ac | NAN | NAN |
1 | エアコン入れる | 打开空调 | open_ac | NAN | NAN |
2 | エアコンのスイッチを | 打开空调开关 | open_ac | NAN | NAN |
3 | エアコンのスイッチ入れる | 打开空调开关 | open_ac | NAN | NAN |
4 | エアコンのスイッチON | 打开空调开关 | open_ac | NAN | NAN |
#查看测试集数据
print(test_jp.head())
print(test_en.head())
| 原始文本 |
0 | switch on the AC |
1 | air conditioner open |
2 | Turn on the AC please |
3 | I wanna switch on aircon please |
4 | Help me switch on aircon |
任务2:文本分析与文本分词
- 步骤1:使用jieba对中文进行分词;
import jieba
def cutwords(txt):
return jieba.lcut(txt)
train_cn['phrase'] = train_cn['原始文本'].apply(cutwords)
print(train_cn.head())
lcut 将返回的对象转化为list对象返回
- 步骤2:使用negisa对日语进行分词
import nagisa
def cutjpwords(txt):
words = nagisa.tagging(txt)
return words.words
train_jp['phrase'] = train_jp['原始文本'].apply(cutjpwords)
print(train_jp.head())
-
任务3:TFIDF与文本分类
- 步骤1:学习TFIDF的使用,提取语料的TFIDF特征;
- 步骤2:使用逻辑回归结合TFIDF进行训练(所有的语言语料),并对测试集的意图进行分类;
- 步骤3:将步骤2预测的结果文件提交到比赛,截图分数;