Python 中的文本分类

最新推荐文章于 2024-09-03 14:14:42 发布

2401_84140687

最新推荐文章于 2024-09-03 14:14:42 发布

阅读量1k

点赞数 11

分类专栏：程序员文章标签： python 分类 c#

本文链接：https://blog.csdn.net/2401_84140687/article/details/138293961

版权

程序员专栏收录该内容

137 篇文章 0 订阅

订阅专栏

return self 。__bag_of_words

def WordFreq ( self , word ):

“”"

如果 word 在 self 中，则返回单词 “”" 的频率。__bag_of_words ：

返回 self 。__bag_of_words [单词]

其他：

返回 0

文档类

class Document ( object ):

“”" 用于学习（训练）文档和测试文档。

如果应该训练分类器，

可选参数 lear__必须设置为 True。如果是测试文档，learn 必须是设置为 False。“”" _vocabulary = BagOfWords ()

def __init__ ( self , 词汇表):

self . __name = “”

self . __document_class = None

self 。_words_and_freq = BagOfWords ()

文档。_vocabulary = 词汇

DEF read_document （自，文件名，学习=假）：

“”，” A读取原稿假定该文件是在UTF-8或在异8859任一编码的…（Latin-1的）。

所述的字文档存储在词袋中，即 self._words_and_freq = BagOfWords() “”"

try :

text = open ( filename , “r” , encoding = ‘utf-8’ ) 。read ()

除了 UnicodeDecodeError :

text = open ( filename, “r” , encoding = ‘latin-1’ ) 。阅读()

文本 = 文本。下()

词 = re 。拆分( r “\W” ,文本)

自我。_number_of_words = 0

for word in words :

self 。_words_and_freq 。add_word ( word )

如果学习：

文档。_词汇。add_word (字)

DEF __add__ （自，其他）：

“”，“重载‘+’。操作员添加两个文件在于添加文件的BagOfWords‘’”

RES = 文献（文献。_vocabulary ）

水库。_words_and_freq = self 。_words_and_freq + 其他。_words_and_freq

返回资源

DEF vocabulary_length （自）：

“”，“返回的词汇的长度”“”

返回 len个（文献。_vocabulary ）

def WordsAndFreq ( self ):

“”" 返回字典，包含

文档的 BagOfWords 属性中

包含的单词（键）及其频率（值）__“”" return self 。_words_and_freq 。BagOfWords ()

def Words ( self ):

“”" 返回 Document 对象 “”"

d = self_的单词_。_words_and_freq 。BagOfWords ()

返回 d 。键()

def WordFreq ( self , word ):

“”" 返回单词 “word” 在文档 “”"

bow = self 中出现的次数。_words_and_freq 。BagOfWords ()

if word in bow :

return bow [ word ]

else :

return 0

def __and__ ( self , other ):

“”" 两个文档的交集。返回两个文档中出现的单词列表 “”"

intersection = []

words1 = self 。字（）

的话在其他。字（）：

如果字在 words1 ：

交叉点 + = [字]

返回相交

类别/文件集

这是由一个类别/类的文档组成的类。我们使用术语类别而不是“类”，这样它就不会与 Python 类混淆：

class 类别（文档）：

def __init__ （自我，词汇）：

文档。__init__ （自我，词汇）

自我。_number_of_docs = 0

def Probability ( self , word ):

“”" 返回给定类 “self” “”"

voc_len = Document_的单词 “word” 的概率_。_词汇。len ()

SumN = 0

for i in range ( voc_len ):

SumN = Category 。_词汇。WordFreq （字）

N = 自我。_words_and_freq 。WordFreq ( word )

erg = 1 + N

erg /= voc_len + SumN

返回 erg

DEF __add__ （自，其他）：

“”，“重载‘+’。操作员增加两个类别中的对象在于添加

了分类对象BagOfWords‘’”

RES = 类别（自我。_vocabulary ）

水库。_words_and_freq = self 。_words_and_freq + 其他。_words_and_freq

返回资源

def SetNumberOfDocs ( self , number ):

self 。_number_of_docs = 数字

def NumberOfDocuments ( self ):

返回 self 。_number_of_docs

池类

池是类，在其中训练和保存文档类：

类池（对象）：

def __init__ （自我）：

自我。__document_classes = {}

self 。__vocabulary = BagOfWords ()

def sum_words_in_class ( self , dclass ):

“”" 一个 dclass 的所有不同单词出现在一个类中的次数 “”"

sum = 0

for word in self 。__词汇。Words ():

WaF = self 。__document_classes [ dclass ] 。WordsAndFreq ()

if word in WaF :

sum += WaF [ word ]

return sum

def learn ( self , directory , dclass_name ):

“”" directory 是一个路径，其中可以找到名为 dclass_name 的类的文件 “”"

x = Category ( self . __vocabulary )

dir = os 。listdir ( directory )

for file in dir :

d = Document ( self . __vocabulary )

#print(directory + “/” + file)

d . 阅读文档（目录 + “/” + 文件，学习 = True )

x = x + d

self 。__document_classes [ dclass_name ] = x

x 。SetNumberOfDocs ( len ( dir ))

def Probability ( self , doc , dclass = “” ):

“”“计算给定文档 doc”“” 类 dclass 的概率，

如果 dclass :

sum_dclass = self 。sum_words_in_class ( dclass )

概率 = 0

d = 文献（自我。__vocabulary ）

d 。read_document (文档)

对于 j 在 self 。__document_classes :

sum_j = self 。sum_words_in_class ( j )

prod = 1

for i in d 。Words ():

wf_dclass = 1 + self 。__document_classes [ dclass ] 。WordFreq ( i )

wf = 1 + self 。__document_classes [ j ] 。词频( i )

r = wf * sum_dclass / ( wf_dclass * sum_j )

prod *= r

prob += prod * self 。__document_classes [ j ] 。NumberOfDocuments () / 自我。__document_classes [ dclass ] 。NumberOfDocuments ()

if prob != 0 :

return 1 / prob

else :

return - 1

else :

prob_list = []

for dclass in self 。__document_classes :

prob = self 。概率（doc ， dclass ）

prob_list 。追加([ dclass , prob ])

prob_list 。sort ( key = lambda x : x [ 1 ], reverse = True )

返回 prob_list

def DocumentIntersectionWithClasses ( self , doc_name ):

res = [ doc_name ]

for dc in self 。__document_classes ：

d = 文献（自我。__vocabulary ）

d 。read_document ( doc_name , learn = False )

o = self 。__document_classes [直流] ＆ d

intersection_ratio = len ( o ) / len ( d . Words ())

res += ( dc , cross_ratio )

返回 res

在这里插入图片描述

感谢每一个认真阅读我文章的人，看着粉丝一路的上涨和关注，礼尚往来总是要有的：

①　2000多本Python电子书（主流和经典的书籍应该都有了）

②　Python标准库资料（最全中文版）

③　项目源码（四五十个有趣且经典的练手项目及源码）

④　Python基础入门、爬虫、web开发、大数据分析方面的视频（适合小白学习）

⑤ Python学习路线图（告别不入流的学习）

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友，可以戳这里无偿获取

一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！

2401_84140687

关注

11
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录