AI Challenger 细粒度用户评论情感分析 (baseline 0.62)

26 篇文章 6 订阅
4 篇文章 0 订阅

比赛官网https://challenger.ai/competition/fsauor2018
数据下载: https://download.csdn.net/download/linxid/11469830


相关文章:
1.情感分析:几乎包括你需要知道的所有(一)
2.情感分析:几乎包括你需要知道的所有(二)
3.AiChallenger比赛记录之样本不均衡

先给大家提供一个baseline,线上大概0.62,还可以继续调参。多跑几次,简单融合可以继续提分。代码很简单,使用GPU运行快,修改文件路径既可很快复现。

1.运行环境:

  • 系统:Ubuntu 16.04
  • 模型:CNN
  • 框架:keras

2.数据预处理:

首先将数据分词,去掉停用词,然后调整等长的的数组。分词使用jieba库,然后利用keras.preprocessing.text.Tokenizer调整等长数组。过长的选取一部分,不够长的添0补全。以下代码均有详细介绍。

# jieba分词
data['words'] = data['content'].apply(lambda x:list(jieba.cut(x)))
# 去掉开头
data['words'] = data['words'].apply(lambda x:x[1:-1])
# data['len'] = data['words'].apply(lambda x:len(x))
# maxlen = data['len'].max() 
words_dict =[]
texts = []
# 去掉停用词
for index,row in data.iterrows():
    line = [word for word in list(row['words']) if word not in stoplist]
    words_dict.extend([word for word in line])
    texts.append(line)
# stop_data = pd.DataFrame(texts)
# 求每一个样本的最大长度
maxlen = 0
for line in texts:
    if maxlen < len(line):
        maxlen = len(line)
max_words=50000
# 利用keras的Tokenizer进行onehot,并调整未等长数组
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
data_w = tokenizer.texts_to_sequences(texts)
data_T = sequence.pad_sequences(data_w, maxlen=maxlen)
# 数据划分,重新划分为训练集,测试集和验证集
dealed_train = data_T[:train.shape[0]]
dealed_val = data_T[train.shape[0]:(train.shape[0]+val.shape[0])]
dealed_test = data_T[(train.shape[0]+val.shape[0]):]

3.CNN模型

选择了运行速度较快的CNN模型,同时保证精度。

3.1 构建模型:
def build_model():
    model = Sequential()
    embedding_dim = 128
    model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
    model.add(layers.Conv1D(64, 3, activation='relu'))
    model.add(layers.MaxPooling1D(5))
#     model.add(Dropout(0.5))
    model.add(layers.Conv1D(64, 3, activation='relu'))
#     model.add(Dropout(0.5))
    model.add(layers.GlobalMaxPooling1D())
#     model.add(layers.Dense(32, activation='relu'))
    model.add(layers.Dense(4, activation='softmax'))
    return model
3.2 训练模型

此处构造两个函数:1.筛选出一部分作为交叉验证;2.直接拿val作为验证。
关于多分类metrics,f1_score实现没有包括在里面,日后会更新.

train_CV_CNN
def train_CV_CNN(train_x=dealed_train, test_x=dealed_test, val_x = dealed_val,y_cols=y_cols, debug=False, folds=2):
    model = build_model()
    
    model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['acc',f1])
    F1_scores = 0
    F1_score = 0
    if debug:
        y_cols= ['location_traffic_convenience']
    for index,col in enumerate(y_cols):
        train_y = train[col]+2
        val_y = val[col]+2
        y_val_pred = 0
        y_test_pred = 0
#         epochs=[5,10]   , stratify=train_y
        for i in range(1):
            X_train, X_test, y_train, y_test = train_test_split(train_x, train_y, test_size=0.2, random_state=100*i)

            y_train_onehot = to_categorical(y_train)
            y_test_onehot = to_categorical(y_test)
            history = model.fit(X_train, y_train_onehot, epochs=20, batch_size=64, validation_data=(X_test, y_test_onehot))
            
            # 预测验证集和测试集
            y_val_pred = model.predict(val_x)
            y_test_pred += model.predict(test_x)
            
            y_val_pred = np.argmax(y_val_pred, axis=1)

            F1_score = f1_score(y_val_pred, val_y, average='macro')
            F1_scores += F1_score

            print(col,'f1_score:',F1_score,'ACC_score:',accuracy_score(y_val_pred, val_y))
        y_test_pred = np.argmax(y_test_pred, axis=1)
        result[col] = y_test_pred-2
    print('all F1_score:',F1_scores/len(y_cols))
    return result
train_CNN
def train_CNN(train_x=dealed_train, test_x=dealed_test, val_x = dealed_val,y_cols=y_cols, debug=False, folds=1):
    model = build_model()
    
    model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['acc'])
    F1_scores = 0
    F1_score = 0
    if debug:
        y_cols= ['location_traffic_convenience']
    for index,col in enumerate(y_cols):
        train_y = train[col]+2
        val_y = val[col]+2
        y_val_pred = 0
        y_test_pred = 0
#         epochs=[5,10]   , stratify=train_y
        for i in range(folds):
            y_train_onehot = to_categorical(train_y)
            history = model.fit(train_x, y_train_onehot, epochs=3, batch_size=64, validation_split=0.2)
            
            # 预测验证集和测试集
            y_val_pred = model.predict(val_x)
            y_test_pred += model.predict(test_x)
            
        y_val_pred = np.argmax(y_val_pred, axis=1)

        F1_score = f1_score(y_val_pred, val_y, average='macro')
        F1_scores += F1_score

        print('第',index,'个细粒度',col,'f1_score:',F1_score,'ACC_score:',accuracy_score(y_val_pred, val_y))
        y_test_pred = np.argmax(y_test_pred, axis=1)
        result[col] = y_test_pred-2
    print('all F1_score:',F1_scores/len(y_cols))
    return result

One more thing

欢迎大家关注我们的微信公众号,后续更新其他模型,也包括其他比赛的开源。同时我们社区正在不断壮大欢迎志趣相投的朋友加入我们。

欢迎关注微信公众号「译智社」,为大家提供优质人工智能文章,国外优质博客和论文等资讯哟!

  • 9
    点赞
  • 81
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 70
    评论
评论 70
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

linxid

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值