文本情感分类问题
- 机器学习方法 TFIDF+机器学习分类算法
- 深度学习方法 TextCNN TextRNN 预训练的模型
预训练的模型有哪些?
- bert
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/bbe517de542e54f126952a12ac0cc13e.png)
输入有三个序列 Token(字符的序列 把文本转化成字符的编码 进行输入)
Segment(段序列 用于区分是句子A 还是句子B (如果是A就设为0 B就设为1) 用于文本分类 可以全部设成0)
Position(位置向量 由于transformer不能很好的捕捉位置特征 引入位置向量 随初始化 构建embedding的过程)
- albert
- xlnet
- robert
预训练的模型 需要很大的显存
Bert源码
- https://github.com/google-research/bert
- https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/modeling.py#L428
transformer包
- https://huggingface.co/transformers/v2.5.0/model_doc/bert.html
- tokenizer.encode_plus 参数详细见:https://github.com/huggingface/transformers/blob/72768b6b9c2083d9f2d075d80ef199a3eae881d8/src/transformers/tokenization_utils.py#L924 924行
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from tensorflow.python.keras.utils import to_categorical
from tqdm import tqdm
import tensorflow as tf
import tensorflow.keras.backend as K
import os
from transformers import *
print(tf.__version__)
TRAIN_PATH = './data/train_dataset/'
TEST_PATH = './data/test_dataset/'
BERT_PATH = './bert_base_chinese/'
MAX_SEQUENCE_LENGTH = 140
input_categories = '微博中文内容'
output_categories = '情感倾向'
df_train = pd.read_csv(TRAIN_PATH+'nCoV_100k_train.labled.csv',engine ='python')
df_train = df_train[df_train[output_categories].isin(['-1','0','1'])]
df_test = pd.read_csv(TEST_PATH+'nCov_10k_test.csv',engine ='python')
df_sub = pd.read_csv(TEST_PATH+'submit_example.csv')
print('train shape =', df_train.shape)
print('test shape =', df_test.shape)
tokenizer = BertTokenizer.from_pretrained(BERT_PATH+'bert-base-chinese-vocab.txt')
tokenizer.encode_plus("深度之眼",
add_special_tokens=True,
max_length