基于深度学习的文本分类1-FastText

基于深度学习的文本分类1-FastText

续基于机器学习的文本分类(上次使用了CountVector+RidgeClassifer和Tfidf+RidgeClassfier)
# tfidf+xgb
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
import xgboost as xgb
!pip install xgboost --user
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Looking in indexes: http://yum.tbsite.net/pypi/simple/
Requirement already satisfied: xgboost in /data/nas/workspace/envs/python3.6/site-packages (1.1.1)
Requirement already satisfied: scipy in /opt/conda/lib/python3.6/site-packages (from xgboost) (1.3.3)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from xgboost) (1.16.0)

generate train_data,valid_data,test_data

train_data = pd.read_csv('./data/train_set.csv',sep='\t')
test_data = pd.read_csv('./data/test_a.csv',sep='\t')
all_text =  pd.concat([train_data['text'],test_data['text']],axis = 0)
%%time
#还是用上次的参数,后续可以利用cv调参..
tfidf = TfidfVectorizer(ngram_range=(1,3),max_features = 3000,analyzer='word')
word_vectorizer = tfidf.fit(all_text)
train_word_features = word_vectorizer.transform(train_data['text'])
test_word_features = word_vectorizer.transform(test_data['text'])
CPU times: user 20min 50s, sys: 10.3 s, total: 21min
Wall time: 21min
# 大概看一下训练接中的数据,是max_features列的稀疏矩阵
print(train_word_features.shape,test_word_features.shape)
(200000, 3000) (50000, 3000)
import warnings
warnings.filterwarnings('ignore')
x_trn,x_valid,y_trn,y_valid = train_test_split(train_word_features,train_data['label'],test_size= 0.1,random_state=2020)
%%time
clf = LogisticRegression()
clf.fit(x_trn,y_trn)
y_pred = clf.predict(x_valid)
print('valid_socre',f1_score(y_pred,y_valid,average='macro'))
valid_socre 0.9158974462371058
CPU times: user 5min 19s, sys: 2min 10s, total: 7min 29s
Wall time: 4min 44s
df = pd.DataFrame()
df['label'] = clf.predict(test_word_features)
df.to_csv
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值