NLP 探索

最新推荐文章于 2024-08-22 09:53:48 发布

三笔竹林

最新推荐文章于 2024-08-22 09:53:48 发布

阅读量6.2k

点赞数 1

1.TF、IDF等统计特征–>文本关键词提取

基于BOW的文本统计特征不胜枚举，这些特征在文本挖掘领域包括大家熟知的TF,IDF特征，也包括一些看似平凡琐碎实则在模型中权重很高的特征。在讨论TF-IDF特征前，先列举一些有关词频、词密度及可读性的统计特征。如：
（1）Count特征：词频统计、句频句长统计、标点统计以及一些领域相关词的统计等。
（2）可读性特征：音节数、烟雾指数和阅读舒适性等
该类特征可以利用github中的Textstat软件包进行分析。该软件提供了如下很多函数挖掘文本的统计学特征。

from textstat.textstat import textstat
if __name__ == '__main__':
        test_data = """Playing games has always been thought to be important to the development of well-balanced and creative children; however, what part, if any, they should play in the lives of adults has never been researched that deeply. """

        print textstat.flesch_reading_ease(test_data)
        print textstat.smog_index(test_data)
        print textstat.flesch_kincaid_grade(test_data)
        print textstat.coleman_liau_index(test_data)
        print textstat.automated_readability_index(test_data)
        print textstat.dale_chall_readability_score(test_data)
        print textstat.difficult_words(test_data)
        print textstat.linsear_write_formula(test_data)
        print textstat.gunning_fog(test_data)
        print textstat.text_standard(test_data)

安装textstat

尚且不知道为什么跑不了，以后再说

TFIDF:

In [1]: from sklearn.feature_extraction.text import TfidfVectorizer

In [2]: vec = TfidfVectorizer()

In [3]: corpus = ['This is sample document.', 'another random document.', 'third sample document text']

In [4]: X = vec.fit_transform(corpus)

In [5]: print X   #(#doc, #wordFeature)   weight
  (0, 7)    0.58448290102
  (0, 2)    0.58448290102
  (0, 4)    0.444514311537
  (0, 1)    0.345205016865
  (1, 1)    0.385371627466
  (1, 0)    0.652490884513
  (1, 3)    0.652490884513
  (2, 4)    0.444514311537
  (2, 1)    0.345205016865
  (2, 6)    0.58448290102
  (2, 5)    0.58448290102

In [6]: vec.get_feature_names()  #wordFeature Order
Out[6]:
[u'another',
 u'document',
 u'is',
 u'random',
 u'sample',
 u'text',
 u'third',
 u'this']

更好的tfidf的教程

http://blog.csdn.net/liuxuejiang158blog/article/details/31360765

http://blog.csdn.net/eastmount/article/details/50323063