Datawhale - 新闻文本分类 - Task3

最新推荐文章于 2022-05-22 11:38:49 发布

吱吱吱吱涵

最新推荐文章于 2022-05-22 11:38:49 发布

阅读量162

点赞数

本文链接：https://blog.csdn.net/zzhhit2014/article/details/107570727

版权

Task3进入了模型训练阶段，探讨了多种文本表示方法，包括One hot编码、Bag of Words、N-gram和TF-IDF。利用这些方法，结合Count Vectors和RidgeClassifier，实现了0.74的准确率；当使用TF-IDF和RidgeClassifier时，准确率提升到0.87。

摘要由CSDN通过智能技术生成

Task3终于到了modeling部分。

文本表示方法

One hot

即每个单词有一个index，对于每个index的vector，其中一位是1，其他都是0。

Bag of Words

词袋模型，也称count word。每个文档的字、词可以使用其出现的次数来表示。

from sklearn.feature_extraction.text import CountVectorizer 

corpus = ['This is the first document.', 'This document is the second document.', 
          'And this is the third one.', 'Is this the first document?'] 

vectorizer = CountVectorizer() 
res = vectorizer.fit_transform(corpus).toarray()
print(res)
print(vectorizer.get_feature_names())


Output:

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

可以看出，每一位的顺序是按照字母表排列的，值为对应token在全文中出现的次数。

N-gram

N-gram的n代表合成一个word需要的token数，即没n个word在一起统计。

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())
print(X2.toarray())


Output:

['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']
[[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]

ngram_range (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

TF-IDF

基于机器学习的文本分类

Count Vectors + RidgeClassifier: 0.74

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.linear_model import RidgeClassifier 
from sklearn.metrics import f1_score

train_df = pd.read_csv('train_set.csv', sep='\t', nrows=15000) 
vectorizer = CountVectorizer(max_features=3000) 
train_test = vectorizer.fit_transform(train_df['text'])

clf = RidgeClassifier() 
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:]) 
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

TF-IDF + RidgeCLS: 0.87

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import RidgeClassifier 
from sklearn.metrics import f1_score

train_df = pd.read_csv('train_set.csv', sep='\t', nrows=15000)

tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000) 
train_test = tfidf.fit_transform(train_df['text'])

clf = RidgeClassifier() 
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:]) 
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))