Datawhale - 新闻文本分类 - Task3

Task3进入了模型训练阶段,探讨了多种文本表示方法,包括One hot编码、Bag of Words、N-gram和TF-IDF。利用这些方法,结合Count Vectors和RidgeClassifier,实现了0.74的准确率;当使用TF-IDF和RidgeClassifier时,准确率提升到0.87。
摘要由CSDN通过智能技术生成

Task3终于到了modeling部分。

 

文本表示方法

One hot

即每个单词有一个index,对于每个index的vector,其中一位是1,其他都是0。

Bag of Words

词袋模型,也称count word。每个文档的字、词可以使用其出现的次数来表示。

from sklearn.feature_extraction.text import CountVectorizer 

corpus = ['This is the first document.', 'This document is the second document.', 
          'And this is the third one.', 'Is this the first document?'] 

vectorizer = CountVectorizer() 
res = vectorizer.fit_transform(corpus).toarray()
print(res)
print(vectorizer.get_feature_names())


Output:

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

可以看出,每一位的顺序是按照字母表排列的,值为对应token在全文中出现的次数。

N-gram

N-gram的n代表合成一个word需要的token数,即没n个word在一起统计。

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())
print(X2.toarray())


Output:

['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']
[[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]

ngram_range  (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

TF-IDF

 

基于机器学习的文本分类

Count Vectors + RidgeClassifier: 0.74

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.linear_model import RidgeClassifier 
from sklearn.metrics import f1_score

train_df = pd.read_csv('train_set.csv', sep='\t', nrows=15000) 
vectorizer = CountVectorizer(max_features=3000) 
train_test = vectorizer.fit_transform(train_df['text'])

clf = RidgeClassifier() 
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:]) 
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

TF-IDF + RidgeCLS: 0.87

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import RidgeClassifier 
from sklearn.metrics import f1_score

train_df = pd.read_csv('train_set.csv', sep='\t', nrows=15000)

tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000) 
train_test = tfidf.fit_transform(train_df['text'])

clf = RidgeClassifier() 
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:]) 
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值