西瓜书-NLP比赛

最新推荐文章于 2023-12-01 09:25:02 发布

yxyibb

最新推荐文章于 2023-12-01 09:25:02 发布

阅读量181

点赞数

分类专栏：算法梳理文章标签：西瓜书

本文链接：https://blog.csdn.net/u012835414/article/details/93388786

版权

算法梳理专栏收录该内容

23 篇文章 0 订阅

订阅专栏

代码

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

df_train = pd.read_csv('./train_set.csv')
df_test = pd.read_csv('./test_set.csv')
df_train.drop(columns=['article', 'id'], inplace=True) #delete 某一行或列
df_test.drop(columns=['article'], inplace=True)#true直接对原dataFrame进行操作；false结果生成在一个新的dataFrame中

vectorizer = CountVectorizer(ngram_range = (1,2), min_df=3, max_df=0.9, max_features=100000)
#ngram_range 分词时，按1或2个字切割，分词
vectorizer.fit(df_train['word_seg'])
x_train = vectorizer.transform(df_train['word_seg'])
x_test = vectorizer.transform(df_test['word_seg'])
y_train = df_train['class'] - 1
#fit_transform分为两步，第一步确定转换函数，比如说标准化处理，就需要基于数据计算出均值与方差；第二步，然后所有数据基于第一步计算出来的均值与方差进行转换；
#transform：只有一步，就是上述中的第二个步骤；转换函数，与其对应所需的参数均已经确定好了；
#所以fit_transform一般用于训练数据；而transform用于测试数据；


lg = LogisticRegression(C=4, dual=True)
lg.fit(x_train, y_train)

y_test = lg.predict(x_test)

df_test['class'] = y_test.tolist()
df_test['class'] = df_test['class'] + 1
df_result = df_test.loc[:,['id', 'class']]
df_result.to_csv('./result.csv', index=False)

print('Done')

知识点

1. CoutVectorizer

sklearn提取文本特征的一种方法，属于常见的特征数值计算类。对于每一个训练文本，它只考虑每种词汇在该训练文本中出现的频率。
对于每一个训练文本，它只考虑每种词汇在该训练文本中出现的频率。
sklearn.feature_extraction.text.CountVectorizer

2. CoutVectorizer参数详解

CountVectorizer(input='content', encoding='utf-8',  decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, 
token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

CountVectorizer类的参数很多，分为三个处理步骤：preprocessing、tokenizing、n-grams generation.
一般要设置的参数是:ngram_range,max_df，min_df，max_features等，具体情况具体分析。

参数表	作用
input	一般使用默认即可，可以设置为"filename’或’file’
encodeing	使用默认的utf-8即可，分析器将会以utf-8解码raw document
decode_error	默认为strict，遇到不能解码的字符将报UnicodeDecodeError错误，设为ignore将会忽略解码错误，还可以设为replace，作用尚不明确
strip_accents	默认为None，可设为ascii或unicode，将使用ascii或unicode编码在预处理步骤去除raw document中的重音符号
analyzer	一般使用默认，可设置为string类型，如’word’, ‘char’, ‘char_wb’，还可设置为callable类型，比如函数是一个callable类型
preprocessor	设为None或callable类型
tokenizer	设为None或callable类型
ngram_range	词组切分的长度范围，待详解
stop_words	设置停用词，设为english将使用内置的英语停用词，设为一个list可自定义停用词，设为None不使用停用词，设为None且max_df∈[0.7, 1.0)将自动根据当前的语料库建立停用词表
lowercase	将所有字符变成小写
token_pattern	过滤规则，表示token的正则表达式，需要设置analyzer == ‘word’，默认的正则表达式选择2个及以上的字母或数字作为token，标点符号默认当作token分隔符，而不会被当作token
max_df	可以设置为范围在[0.0 1.0]的float，也可以设置为没有范围限制的int，默认为1.0。这个参数的作用是作为一个阈值，当构造语料库的关键词集的时候，如果某个词的document frequence大于max_df，这个词不会被当作关键词。如果这个参数是float，则表示词出现的次数与语料库文档数的百分比，如果是int，则表示词出现的次数。如果参数中已经给定了vocabulary，则这个参数无效
min_df	类似于max_df，不同之处在于如果某个词的document frequence小于min_df，则这个词不会被当作关键词
max_features	默认为None，可设为int，对所有关键词的term frequency进行降序排序，只取前max_features个作为关键词集
vocabulary	默认为None，自动从输入文档中构建关键词集，也可以是一个字典或可迭代对象？
binary	默认为False，一个关键词在一篇文档中可能出现n次，如果binary=True，非零的n将全部置为1，这对需要布尔值输入的离散概率模型的有用的
dtype	使用CountVectorizer类的fit_transform()或transform()将得到一个文档词频矩阵，dtype可以设置这个矩阵的数值类型

属性表	作用
vocabulary_	词汇表；字典型
get_feature_names()	所有文本的词汇；列表型
stop_words_	返回停用词表

方法表	作用
fit_transform(X)	拟合模型，并返回文本矩阵
fit(raw_documents[, y])	Learn a vocabulary dictionary of all tokens in the raw documents.
fit_transform(raw_documents[, y])	Learn the vocabulary dictionary and return term-document matrix.

用数据输入形式为列表，列表元素为代表文章的字符串，一个字符串代表一篇文章，字符串是已经分割好的。CountVectorizer同样适用于中文;
CountVectorizer是通过fit_transform函数将文本中的词语转换为词频矩阵，矩阵元素a[i][j] 表示j词在第i个文本下的词频。即各个词语出现的次数，通过get_feature_names()可看到所有文本的关键字，通过toarray()可看到词频矩阵的结果。

设置停用词列表，处理中文文档
(1) 停用词的配置：也可默认配置count_vec=CountVectorizer(stop_words=None) ，stop_words=None表示不去掉停用词；如果是英文的话，停用词不需要构建直接 count_vec=CountVectorizer(stop_words=’english’)则去掉英语停用词
(2)count_vec.fit_transform(data）的结果是如下的格式：

print(count_vec.fit_transform(X_test))
 （0：data输入列表的元素索引（第几个文章（列表元素）），词典里词索引）  词频
  (0, 7)    2
  (0, 25)   2
  (0, 34)   2
  (0, 4)    1
  (0, 0)    1
  ......
print(count_vec.fit_transform(X_test).toarray())
[[1 1 1 1 1 0 0 2 1 0 1 1 1 0 1 0 0 0 1 0 1 1 0 3 1 2 0 0 1 0 0 1 1 1 2]
 [0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 1 3 0 1 4 0 0 1 1 1 1 1 0 1 1 1]]

属性：
vocabulary_：字典类型，key为关键词，value是特征索引，样例如下：
com.furiousapps.haunt2: 57048
bale.yaowoo: 5025
asia.share.superayiconsumer: 4660
com.cooee.flakes: 38555
com.huahan.autopart: 67364
关键词集被存储为一个数组向量的形式，vocabulary_中的key是关键词，value就是该关键词在数组向量中的索引，使用get_feature_names()方法可以返回该数组向量。使用数组向量可验证上述关键词，如下：
```
ipdb> count_vec.get_feature_names()[57048]
# 返回u'com.furiousapps.haunt2'
ipdb> count_vec.get_feature_names()[5025]
# 返回u'bale.yaowoo'
```
stop_words_：集合类型，官网的解释十分到位，如下：
Terms that were ignored because they either:
occurred in too many documents (max_df)
occurred in too few documents (min_df)
were cut off by feature selection (max_features).
This is only available if no vocabulary was given.
这个属性一般用来程序员自我检查停用词是否正确，在pickling的时候可以设置stop_words_为None是安全的。

参考

原文：https://blog.csdn.net/weixin_38278334/article/details/82320307

yxyibb

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
西瓜书-NLP比赛

代码import pandas as pdfrom sklearn.linear_model import LogisticRegressionfrom sklearn.feature_extraction.text import CountVectorizerdf_train = pd.read_csv('./train_set.csv')df_test = pd.read_csv(...
复制链接

扫一扫