NLP beginner Task1 基于机器学习的文本分类

本系列为完成复旦大学qxp老师的 nlp-beginner 系列任务的记录,邱老师不愧是NLP行业巨擘,该项目作为入门有非常好的作用。github项目链接如下。本文为该系列的第一次任务。项目链接附于文末。

任务目标

实现基于logistic/softmax regression的文本分类

数据集:Classify the sentiment of sentences from the Rotten Tomatoes dataset

数据集说明:

Submissions are evaluated on classification accuracy (the percent of labels that are predicted correctly) for every parsed phrase. The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

项目实现

数据读取

读入数据,该数据集共有4列,最后一列为情感倾向,共有5种类别

df_train = pd.read_csv('../input/sentiment-analysis-on-movie-reviews/train.tsv.zip', sep='\t')
df_test = pd.read_csv('../input/sentiment-analysis-on-movie-reviews/test.tsv.zip', sep='\t')

watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5aSP5aSp54ix5Zad5Y-v5LmQ,size_19,color_FFFFFF,t_70,g_se,x_16

数据预处理

第一步需要对数据进行处理,经过本步操作完成分词

def clean_sentences(df):
    reviews = []
    for sent in tqdm(df['Phrase']):
        # 去除其他字符
        text = re.sub('[^a-zA-Z]', ' ', sent)
        
        # 分词
        words = word_tokenize(text.lower())
        
        # 去除停用词
        new_words = [char for char in words if char.lower() not in stopwords.words('english')]
        
        # 将单词恢复成原型
        lem = WordNetLemmatizer()
        lem_words = [lem.lemmatize(i) for i in new_words]
        
        reviews.append(lem_words)
    return reviews

train_sentences = clean_sentences(df_train)
test_sentences = clean_sentences(df_test)

同时对于目标进行one-hot编码

train_target = torch.zeros(df_train["Sentiment"].shape[0],5)
train_target.scatter_(dim = 1, value = 1 , index = torch.tensor(df_train["Sentiment"]).unsqueeze(1))

词向量化

首先统计出所有单词建立词表,然后通过词表将句子向量化,选取最长的句子长度作为向量的维度

vocab = set()
len_max = 0
for sent in tqdm(train_sentences):
    vocab.update(sent)
    if len(sent) >len_max:
        len_max = len(sent)
print(len(vocab),len_max)

word2index_list = {word : i for (i,word) in enumerate(vocab) }

def coding_sentence(sentences):
    sentences_coding = torch.zeros(len(sentences),len_max)
    for i ,line in enumerate(sentences):
        for j,word in enumerate(line):
            index = word2index_list[word]
            sentences_coding[i][j] = index
        for j in range(len(line),len_max):
            sentences_coding[i][j] = 0
    return sentences_coding

train_sentence = coding_sentence(train_sentences)

n-gram

phrases = df_train["Phrase"]
data = phrases[:1000]
dict_words = dict()
phrase_list = []
for line in data:
    line = re.sub(r'[^a-zA-Z0-9]+'," ",line)
    line = line.lower().split()
    phrase_list.append(line)
data =phrase_list
for d in range(1,dimension+1):
    for words in data:
        for i in range(len(words) -d + 1):
            temp = words[i:i+d]
            temp = "_".join(temp)
            if temp not in dict_words:
                dict_words[temp] = len(dict_words)
length = len(dict_words)
train_matrix = torch.zeros(len(data),length)
for d in range(1,dimension+1):
    for i, words in enumerate(data):
        for j in range(len(words)-d+1):
            temp = words[j:j+d];
            temp = "_".join(temp)
            train_matrix[i][dict_words[temp]]=1
train_matrix

数据集划分

对数据集进行划分,20%设为验证集

n_samples = train_sentence.shape[0]
n_val  = int(0.2 * n_samples)
shuffled_indices = torch.randperm(n_samples)
train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices # 划分结果是随机的

X_train = train_sentence[train_indices]
Y_train = train_target[train_indices]

X_val = train_sentence[val_indices]
Y_val = train_target[val_indices]

训练模型

def training_loop(n_epochs , optimizer , model , loss_fn , X_train , Y_train , X_val , Y_val):
    for epoch in range(1,n_epochs+1):
        soft_max = torch.nn.Softmax()
        train_pre = model(X_train)
        train_loss = loss_fn(train_pre , Y_train)
        
        val_pre = model(X_val)
        val_loss = loss_fn(val_pre,Y_val)
        
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()
        
        if epoch == 1 or epoch % 500 ==0 :
            print("Epoch %d loss is %.4f  val_loss is %.4f"%(epoch,train_loss,val_loss))

参数如下:

class LogistRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(30, 5)
    
    def forward(self, x):
        out = self.fc(x)
        out = torch.sigmoid(out)
        return out
loss_fn = nn.MSELoss()
model = LogistRegression()
optimizer = torch.optim.SGD(model.parameters(),lr =1e-2)

项目链接:GitHub - FudanNLP/nlp-beginner: NLP上手教程NLP上手教程. Contribute to FudanNLP/nlp-beginner development by creating an account on GitHub.https://github.com/FudanNLP/nlp-beginner

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值