任务一：基于机器学习的文本分类

最新推荐文章于 2023-06-28 17:29:04 发布

Lucius_Keep_Going!

最新推荐文章于 2023-06-28 17:29:04 发布

阅读量3.3k

点赞数 2

分类专栏：机器学习

本文链接：https://blog.csdn.net/weixin_43125502/article/details/100053734

版权

机器学习专栏收录该内容

13 篇文章 2 订阅

订阅专栏

NLP小白入门，在实现复旦NLP组的新生入门任务，在此做一个记录。
废话不多说，直接上代码。注释掉的部分是没有用到批处理时的代码，可忽略。
https://github.com/Lucius888/nlp-beginner

1.数据处理

数据是来源于国外网站的电影评论，数据集kaggle上都有，导入进来之后。对每条评论进行分词，标准化处理（虽然数据集好像已经是被清洗好了的）。处理好了之后使用词袋模型（BOW）形成词袋，然后对数据进行one-hot。这是数据处理的流程。
细节一：分词常用的俩个函数：CountVectorizer，TfidfVectorizer
CountVectorizer：只考虑词汇在文本中出现的频率
TfidfVectorizer：除了考量某词汇在文本出现的频率，还关注包含这个词汇的所有文本的数量
能够削减高频没有意义的词汇出现带来的影响, 挖掘更有意义的特征

CountVectorizer(input='content', encoding='utf-8',  decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, 
token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

上面给出了所有的参数，两个函数的参数好像是一样的。其中只讲我觉得最常用的三四个，当然还有设置一些平滑什么的，有兴趣可以继续研究：

stop_words=None#选择是否使用停用词，啥是停用词？自行谷歌
ngram_range=(1, 1)#选择使用的Ngram模型，默认unigram
 max_df=int/float #取值为int时，代表该词如果出现的次数超过int值（when float:或者超过百分之多少）时，那么这个词就不会被当作关键词
 min_df=int/float #取值为int时，代表该词如果出现的次数少于int值（when float:或者低于百分之多少）时，那么这个词就不会被当作关键词

除了其分词作用，其还有其他几个属性将作用，在此拿CountVectorizer为例：

属性表	作用
vocabulary_	词汇表；字典型
get_feature_names()	所有文本的词汇；列表型
stop_words_	返回停用词表

方法表	作用
fit_transform(X)	词汇表；字典型
fit(raw_documents[, y])	Learn a vocabulary dictionary of all tokens in the raw documents.
fit_transform(raw_documents[, y])	Learn the vocabulary dictionary and return term-document matrix.

最常用的就是我在代码中的部分：

#实例化，并设置参数
Vectorizer = CountVectorizer(max_df=0.95, min_df=5,stop_words='english')#去除停用词效果确实好了一点点
# (a,b),（ 单词所在得句子，单词所在词袋中得位置）出现的次数
train_CountVectorizer = Vectorizer.fit_transform(train['Phrase'])
#形成词袋，注意是字典
train_bag = Vectorizer.vocabulary_
#转换为one-hot,注意我用的是线性分类，之后RNN/CNN都是索引位置
#其实我觉得索引就是NNLM,但是我还没试过，有待实验
train_one_hot = train_CountVectorizer.toarray()

2.形成数据集

一般训练集：验证集：测试集=7：2：1
设置划分的比例进行切割就可以了。注：我在代码里面没有设置验证集。

# 划分数据集
split_idx0 = int(len(train) * 0.2)
split_idx1 = int(len(train) * 0.3)
train_x, test_x = train_one_hot[:split_idx0], train_one_hot[split_idx0:split_idx1]
train_y, test_y = labels[:split_idx0], labels[split_idx0:split_idx1]  # 此时还不是tensor

然后使用TensorDataset/DataLoader这一对函数了，具体用法在此不做解释。

# create Tensor datasets
train_data = TensorDataset(torch.FloatTensor(train_x), torch.LongTensor(train_y))  # 构建数据，（数据，标签）
test_data = TensorDataset(torch.FloatTensor(test_x), torch.LongTensor(test_y))
# dataloaders
batch_size = 64
# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

3.搭建网络

随便使用的线性模型大了一个三层网络。在此还有几点想留在这里
在这里插入图片描述
如果要GPU训练不要忘记把模型移到GPU
设置网络参数，输出是五种类型，注意标签不是onehot

# %%
# 构建模型
class Linear_Classfy(nn.Module):  # inheriting from nn.Module!

    def __init__(self):
        super(Linear_Classfy, self).__init__()
        self.linear0 = nn.Linear(len(train_bag), 256)
        self.linear1 = nn.Linear(256, 128)
        self.linear2 = nn.Linear(128, 5)

    def forward(self, x):
        x = self.linear0(x)
        x = self.linear1(x)
        out = self.linear2(x)
        # out = F.softmax(x)#交叉熵损失函数自带sofmax
        return out


model = Linear_Classfy()  # vocab_size, num_labels
model.cuda()
print(model)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.2)

4.开始训练，测试

训练数据要移动到GPU
注意数据类型，一般程序保存请先检查数据类型是否符合要求
要注意显示训练情况，很惭愧我并没有很好的展示出来，要改进

# %%
for epoch in range(1):
    for inputs, labels in train_loader:
        x = inputs.cuda()
        target = labels.cuda()
        out = model(x)
        loss = loss_function(out, target)  # must be (1. nn output, 2. target), the target label is NOT one-hotted
        optimizer.zero_grad()  # clear gradients for next train
        loss.backward()  # backpropagation, compute gradients
        optimizer.step()

模型训练好了之后可以进行保存，调用，测试。

# 保存和加载整个模型
torch.save(model_object, 'model.pkl')
model = torch.load('model.pkl')

# 仅保存和加载模型参数(推荐使用)
torch.save(model_object.state_dict(), 'params.pkl')
model_object.load_state_dict(torch.load('params.pkl'))

最后就是进行测试了，在测试的时候还碰到了很多小细节，感觉自己的python基本功不行，要继续努力啊！

最后

 accuracy:  0.49769319492502884

Lucius_Keep_Going!

关注

2
点赞
踩
12

收藏

觉得还不错? 一键收藏
8
评论
任务一：基于机器学习的文本分类

NLP小白入门，在实现复旦NLP组的新生入门任务，在此做一个记录。废话不多说，直接上代码。注释掉的部分是没有用到批处理时的代码，可忽略。1.数据处理数据是来源于国外网站的电影评论，数据集kaggle上都有，导入进来之后。对每条评论进行分词，标准化处理（虽然数据集好像已经是被清洗好了的）。处理好了之后使用词袋模型（BOW）形成词袋，然后对数据进行one-hot。这是数据处理的流程。细节一：分...
复制链接

扫一扫