jieba 分词与 TF-IDF 提取文章关键字核心code - sxr

最新推荐文章于 2024-06-13 18:38:02 发布

秉寒-CHO

最新推荐文章于 2024-06-13 18:38:02 发布

阅读量504

点赞数

分类专栏： Python ML

本文链接：https://blog.csdn.net/haohaixingyun/article/details/88724115

版权

Python 同时被 2 个专栏收录

53 篇文章 1 订阅

订阅专栏

37 篇文章 0 订阅

订阅专栏

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import jieba.analyse

import sys

reload(sys)
sys.setdefaultencoding('utf8')   ##UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

with open("jffile",'r') as f:  #keyword
    text =f.read()
fenci_text=jieba.cut(text)

# print fenci_text

stop_word = [line.strip() for line in open("stopword",'r') ]
print 111  ,stop_word
meaningful_words = ""
for word in fenci_text:
    if  word not in stop_word:
        if word <> "。" and word <> "，":
            meaningful_words = meaningful_words +" "+ word
print meaningful_words

tfidf_word = jieba.analyse.extract_tags(meaningful_words,topK=10,withWeight=True,allowPOS=(
    'nr','nr1','nr2','ns','n','vn','nz'))

print 'tfidf_word', tfidf_word
for word in tfidf_word:
    print word[0] ,word[1]

通过这个技术我们可以完成的工作：

在数据治理过程中，我们很多的原始数据---called them 源数据以及数据仓库中的数据都没有标签，标签反应了这个数据表示的主要业务性质，在很多情况下这个属性是可以通过人工判断也就是人工标注。费时费力而且往往也达不到及格的标准。

因此如果基于一种统计方法进行数据的标注也是一种可行的方法

譬如我们对一张表进行标注：

需要采集以下内容，表的注解，字段级别的注解，对这些个数据进行jieba 分词和统计，但是这个就完全依赖于我们的语料库。

构建一个合适的语料库是这项工作成熟度评价的唯一标准，而且语料库的成功对于行业和公司本身的发展具有里程碑式的意义，但是目前来看部门不具备做这项工作的能力和眼光，尤其是需要极大的自我证明能力说服管理层

秉寒-CHO

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
jieba 分词与 TF-IDF 提取文章关键字核心code - sxr

#!/usr/bin/env python# -*- coding: utf-8 -*-import jieba.analyseimport sysreload(sys)sys.setdefaultencoding('utf8') ##UnicodeWarning: Unicode equal comparison failed to convert both argument...
复制链接

扫一扫