对字典数据、文本进行特征抽取

最新推荐文章于 2022-08-31 16:12:40 发布

小浩子7号

最新推荐文章于 2022-08-31 16:12:40 发布

阅读量399

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/qq_41782791/article/details/115497280

版权

机器学习专栏收录该内容

13 篇文章 0 订阅

订阅专栏

一、对字典数据进行特征抽取

流程

1.实例化

2.调用函数

from sklearn.feature_extraction import DictVectorizer

def dictvec():
    """
    字典数据抽取
    :return:
    """

    dict = DictVectorizer(sparse=False)   #实例化
    data = dict.fit_transform([
                        {'city':'北京','temperature':100},
                        {'city':'上海','temperature':60},
                        {'city': '深圳', 'temperature': 30}
                        ])
    print(data)
    return None
if __name__ == "__main__":
    dictvec()

[[  0.   1.   0. 100.]
 [  1.   0.   0.  60.]
 [  0.   0.   1.  30.]]

字典抽取：把北京上海深圳分别转换为：010 100 001的数字 one-hot编码

二、文本抽取

2.1英文抽取

from sklearn.feature_extraction.text import CountVectorizer   #文本抽取API
def countvec():
    cv = CountVectorizer()
    data = cv.fit_transform(["hello Python", "hello Java"]) #统计每个词在文本中出现的次数，单个字母不统计
    print(cv.get_feature_names()) #统计所有文章重复的词
    print(data.toarray())  #将sparse转换成数组形式
if __name__ == "__main__":
    countvec()

运行结果：重复词只统计一次

比如第一列 hello 在第一篇文章出现了一次， java在第一篇文章出现了0次， python在第一篇文章出现了一次

2.2 用jieba库对中文进行分词

先通过jieba第三方库进行分词，转化成列表，再转换成字符串

from sklearn.feature_extraction.text import CountVectorizer   #文本抽取API
import jieba
def cutword():
    con1 = jieba.cut('R有很多自然语言处理的包')   #先把3句话汉字分隔
    con2 = jieba.cut('但是大多是针对英文的')
    con3 = jieba.cut('中文来做NLP的包，经过长期探索')

    contest1 = list(con1)   #转换成列表
    contest2 = list(con2)
    contest3 = list(con3)
    c1 = " ".join(contest1)
    c2 = " ".join(contest2)
    c3 = " ".join(contest3)
    return c1, c2, c3
def hanzivec():
    """
    中文分词
    :return:
    """
    c1, c2, c3 = cutword()
    print(c1, c2, c3)
    cv = CountVectorizer()
    data = cv.fit_transform([c1, c2, c3])  # 统计每个词在文本中出现的次数
    print(cv.get_feature_names())  # 统计所有文章没有统计过的词
    print(data.toarray())  # 将sparse转换成数组形式
    return None
if __name__ == "__main__":
    hanzivec()

运行结果

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\40798\AppData\Local\Temp\jieba.cache
R 有 很多 自然语言 处理 的 包 但是 大多 是 针对 英文 的 中文 来 做 NLP 的 包 ， 经过 长期 探索
['nlp', '中文', '但是', '处理', '大多', '很多', '探索', '经过', '自然语言', '英文', '针对', '长期']
[[0 0 0 1 0 1 0 0 1 0 0 0]
 [0 0 1 0 1 0 0 0 0 1 1 0]
 [1 1 0 0 0 0 1 1 0 0 0 1]]
Loading model cost 0.964 seconds.
Prefix dict has been built successfully.

Process finished with exit code 0

三、TF-IDF 文本类型分类

主要思想是：某个词语或短语在一篇文章中出现的概率较高，且在其他文章出现的频率低，那么认为此词具有很好的类别区分，评估一个词对于一篇文章的重要程度

重要程度= tf*idf

tf 是term frequency

idf 是 inverse document frequency log(总文档数量/该词出现的数量)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer    #文本抽取API
import jieba
def cutword():
    con1 = jieba.cut('R有很多自然语言处理的包')   #先把3句话汉字分隔
    con2 = jieba.cut('但是大多是针对英文的')
    con3 = jieba.cut('中文来做NLP的包，经过长期探索')

    contest1 = list(con1)   #转换成列表
    contest2 = list(con2)
    contest3 = list(con3)
    c1 = " ".join(contest1)
    c2 = " ".join(contest2)
    c3 = " ".join(contest3)
    return c1, c2, c3
def tfidfvec():
    """
    中文分词
    :return:
    """
    c1, c2, c3 = cutword()
    print(c1, c2, c3)
    tf = TfidfVectorizer()
    data = tf.fit_transform([c1, c2, c3])  # 统计每个词在文本中出现的次数
    print(tf.get_feature_names())  # 统计所有文章没有统计过的词
    print(data.toarray())  # 将sparse转换成数组形式
    return None
if __name__ == "__main__":
    tfidfvec()

运行结果：

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\40798\AppData\Local\Temp\jieba.cache
Loading model cost 0.876 seconds.
Prefix dict has been built successfully.
R 有 很多 自然语言 处理 的 包 但是 大多 是 针对 英文 的 中文 来 做 NLP 的 包 ， 经过 长期 探索
['nlp', '中文', '但是', '处理', '大多', '很多', '探索', '经过', '自然语言', '英文', '针对', '长期']
[[0.         0.         0.         0.57735027 0.         0.57735027
  0.         0.         0.57735027 0.         0.         0.        ]
 [0.         0.         0.5        0.         0.5        0.
  0.         0.         0.         0.5        0.5        0.        ]
 [0.4472136  0.4472136  0.         0.         0.         0.
  0.4472136  0.4472136  0.         0.         0.         0.4472136 ]]

可以通过数字看出每个词的重要性