零基础入门NLP赛事 - Task3 基于机器学习的文本分类

方法一:CountVectors + RidgeClassifier

# Count Vectors + RidgeClassifier

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

train_df = pd.read_csv('../data/train_set.csv', sep='\t', nrows=15000)

vectorizer = CountVectorizer(max_features=3000)
train_test = vectorizer.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
# 0.74

`结果:
在这里插入图片描述
方法二:TF-IDF + RidgeClassifier

# TF-IDF +  RidgeClassifier

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

train_df = pd.read_csv('../data/train_set.csv', sep='\t', nrows=15000)

tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
# 0.87

结果:
在这里插入图片描述
作业:
改变TF-IDF的参数,并验证精度

ngram_rangemax_featuresmax_dfstop_wordsf1_score
(1,3)3000defaultdefault0.8721
(1,3)4000defaultdefault0.8753
(1,3)2000defaultdefault0.8603
(1,3)5000defaultdefault0.8850
(1,4)5000defaultdefault0.8849
(1,2)5000defaultdefault0.8864
default5000defaultdefault0.8605
(1,2)50000.8default0.8861
(1,2)50001.2default0.8864
(1,2)5000default900,3750,6480.8895

使用其他机器学习模型,完成训练和验证

Logistic Regression(逻辑回归)
(Multinomial) Naive Bayes(多项式朴素贝叶斯)
Linear Support Vector Machine(线性支持向量机)
0.8932985819206637
Random Forest(随机森林)

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

LinearSVC(),
MultinomialNB(),


分类器f1值
线性SVM0.8932
朴素贝叶斯0.6468
# TF-IDF +  RidgeClassifier

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score

train_df = pd.read_csv('data/train_set.csv', sep='\t', nrows=15000)
# 有时单个词语作为特征还不够,能够加入一些词组更好,ngram_range允许词表使用1~3个词语组合
#max_features 限制词表大小 stop_words过滤指定停用词
#max_df 【0.0,1.0】默认值为1.0 当设置为浮点数时,过滤出现在超过max_df/低于min_df比例的句子中的词语
#当设置为正整数时,则是超过max_df的句子
#过滤指定停用词
tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=5000,stop_words=["900","3750","648"])
train_test = tfidf.fit_transform(train_df['text'])

clf = LinearSVC()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值