NLP beginner Task1 基于机器学习的文本分类

夏天爱喝可乐

已于 2023-07-24 21:36:13 修改

阅读量1k

点赞数

文章标签：人工智能 nlp

于 2022-04-13 22:38:41 首次发布

本文链接：https://blog.csdn.net/qq_47391835/article/details/124143618

版权

本系列为完成复旦大学qxp老师的 nlp-beginner 系列任务的记录，邱老师不愧是NLP行业巨擘，该项目作为入门有非常好的作用。github项目链接如下。本文为该系列的第一次任务。项目链接附于文末。

任务目标

实现基于logistic/softmax regression的文本分类

数据集：Classify the sentiment of sentences from the Rotten Tomatoes dataset

数据集说明：

Submissions are evaluated on classification accuracy (the percent of labels that are predicted correctly) for every parsed phrase. The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

项目实现

数据读取

读入数据，该数据集共有4列，最后一列为情感倾向，共有5种类别

df_train = pd.read_csv('../input/sentiment-analysis-on-movie-reviews/train.tsv.zip', sep='\t')
df_test = pd.read_csv('../input/sentiment-analysis-on-movie-reviews/test.tsv.zip', sep='\t')

watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5aSP5aSp54ix5Zad5Y-v5LmQ,size_19,color_FFFFFF,t_70,g_se,x_16

数据预处理

第一步需要对数据进行处理，经过本步操作完成分词

def clean_sentences(df):
    reviews = []
    for sent in tqdm(df['Phrase']):
        # 去除其他字符
        text = re.sub('[^a-zA-Z]', ' ', sent)
        
        # 分词
        words = word_tokenize(text.lower())
        
        # 去除停用词
        new_words = [char for char in words if char.lower() not in stopwords.words('english')]
        
        # 将单词恢复成原型
        lem = WordNetLemmatizer()
        lem_words = [lem.lemmatize(i) for i in new_words]
        
        reviews.append(lem_words)
    return reviews

train_sentences = clean_sentences(df_train)
test_sentences = clean_sentences(df_test)

同时对于目标进行one-hot编码

train_target = torch.zeros(df_train["Sentiment"].shape[0],5)
train_target.scatter_(dim = 1, value = 1 , index = torch.tensor(df_train["Sentiment"]).unsqueeze(1))

词向量化

首先统计出所有单词建立词表，然后通过词表将句子向量化，选取最长的句子长度作为向量的维度

vocab = set()
len_max = 0
for sent in tqdm(train_sentences):
    vocab.update(sent)
    if len(sent) >len_max:
        len_max = len(sent)
print(len(vocab),len_max)

word2index_list = {word : i for (i,word) in enumerate(vocab) }

def coding_sentence(sentences):
    sentences_coding = torch.zeros(len(sentences),len_max)
    for i ,line in enumerate(sentences):
        for j,word in enumerate(line):
            index = word2index_list[word]
            sentences_coding[i][j] = index
        for j in range(len(line),len_max):
            sentences_coding[i][j] = 0
    return sentences_coding

train_sentence = coding_sentence(train_sentences)

n-gram

phrases = df_train["Phrase"]
data = phrases[:1000]
dict_words = dict()
phrase_list = []
for line in data:
    line = re.sub(r'[^a-zA-Z0-9]+'," ",line)
    line = line.lower().split()
    phrase_list.append(line)
data =phrase_list
for d in range(1,dimension+1):
    for words in data:
        for i in range(len(words) -d + 1):
            temp = words[i:i+d]
            temp = "_".join(temp)
            if temp not in dict_words:
                dict_words[temp] = len(dict_words)
length = len(dict_words)
train_matrix = torch.zeros(len(data),length)
for d in range(1,dimension+1):
    for i, words in enumerate(data):
        for j in range(len(words)-d+1):
            temp = words[j:j+d];
            temp = "_".join(temp)
            train_matrix[i][dict_words[temp]]=1
train_matrix

数据集划分

对数据集进行划分，20%设为验证集

n_samples = train_sentence.shape[0]
n_val  = int(0.2 * n_samples)
shuffled_indices = torch.randperm(n_samples)
train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices # 划分结果是随机的

X_train = train_sentence[train_indices]
Y_train = train_target[train_indices]

X_val = train_sentence[val_indices]
Y_val = train_target[val_indices]

训练模型

def training_loop(n_epochs , optimizer , model , loss_fn , X_train , Y_train , X_val , Y_val):
    for epoch in range(1,n_epochs+1):
        soft_max = torch.nn.Softmax()
        train_pre = model(X_train)
        train_loss = loss_fn(train_pre , Y_train)
        
        val_pre = model(X_val)
        val_loss = loss_fn(val_pre,Y_val)
        
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()
        
        if epoch == 1 or epoch % 500 ==0 :
            print("Epoch %d loss is %.4f  val_loss is %.4f"%(epoch,train_loss,val_loss))

参数如下：

class LogistRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(30, 5)
    
    def forward(self, x):
        out = self.fc(x)
        out = torch.sigmoid(out)
        return out
loss_fn = nn.MSELoss()
model = LogistRegression()
optimizer = torch.optim.SGD(model.parameters(),lr =1e-2)

项目链接：GitHub - FudanNLP/nlp-beginner: NLP上手教程NLP上手教程. Contribute to FudanNLP/nlp-beginner development by creating an account on GitHub.https://github.com/FudanNLP/nlp-beginner