基于朴素贝叶斯分类器的 20-news-group分类及结果对比(Python3)

之前看了很多CSDN文章,很多都是根据stack overflow 或者一些英文网站的照搬。导致我看了一整天最后一点收获都没有。 这个作业也借鉴了很多外文网站的帮助 但是是基于自己理解写的,算是一个学习笔记吧。环境是python3(海外留学原因作业是英文的,渣英语见谅吧)代码最后附上。
MNB 这个multinomial naive bayes function 是从一个网站上拿的, 具体的reference找不到了。个人觉得对于这个作业很有意义。

This Task is going to produce a text classification for 20-newgroups corpus. It will be implied with python 3.6
As usual, the first step is collecting data.
By visiting the corpus’s homepage(http://qwone.com/~jason/20Newsgroups/), the 20news-bydate-matlab.tgz will be downloaded which contain three types of data: .data, .label and .map for training and testing respectively. The data contains several type of information: docIdhx (Document_Index), wordIdx(word_Index),Count(word_Count), LabelID, Label Name.
Describe this 20 Newsgroups data set.
As its name shows, this data set is consisting of 20 classes news. By list the proportion of each class in the dataset and the word distribution, the data set can be described clearly.
Firstly, the directory is settled and the train.label will be opened to check how many documents are in this dataset. The lines of records stand the number of documents. So, we use: total = len(lines) and print it:
total: 11269
Hence there are 11269 documents are stored in this dataset.
By checking train.map we know there are 20 different class. Here I calculate the occurrence and proportion to stand the dataset. In addition, all the proportion keeps five significant digits
Occurrence of each class:
1: 480. 2: 581. 3: 572. 4: 587. 5: 575. 6: 592. 7: 582. 8: 592. 9: 596. 10: 594. 11: 598. 12: 594. 13: 591. 14: 594. 15: 593. 16: 599. 17: 545. 18: 564. 19: 464. 20: 376
Probability of each class:
1: 0.04259, 2: 0.05156, 3: 0.05076, 4: 0.05209, 5: 0.05102, 6: 0.05253, 7: 0.05165, 8: 0.05253, 9: 0.05289, 10: 0.05271,11: 0.05307, 12: 0.05271, 13: 0.05244, 14: 0.05271, 15: 0.05262,16: 0.05315, 17: 0.04836, 18: 0.05005, 19: 0.04117, 20: 0.03337
As we can see that the 20 classes’ occurrence are in range of (376,599). The largest difference between the proportion is 0.1978, a very small range comparing with the whole data set,1.0000. Thus, we could say that the documents are almost uniform distributed in the 20 classes.
Train.map is a file that contain the label name correspond its labelID. By opening train.m-ap, all the class name can be list as below:
alt.atheism 1
comp.graphics 2
comp.os.ms-windows.misc 3
comp.sys.ibm.pc.hardware 4
comp.sys.mac.hardware 5
comp.windows.x 6
misc.forsale 7
rec.autos 8
rec.motorcycles 9
rec.sport.baseball 10
rec.sport.hockey 11
sci.crypt 12
sci.electronics 13
sci.med 14
sci.space 15
soc.religion.christian 16
talk.politics.guns 17
talk.politics.mideast 18
talk.politics.misc 19
talk.religion.misc 20
There are the all label names correspond with their IDs, which could help us match with the upper list and get the proportions of each class.
In terms of words, we could open the train.data and check how many words are in the documents:
there are 11269 documents in the data set
there are 53975 words in the data set
By doing the same operation in the test data, then the result is:
7505 documents in test data set
61188 unique word in test data set
In conclusion, the training data set contains 11269 documents, in which include 53975 different words. The documents are in 20 different classes. Each class is assigned with a label name and an ID. Due to the variance of documents is small. the distribution of documents in classes can been seen as uniform distribution. There are 7505 documents, consist of 53975 words in test data set.
In addition, words in test set are more than in train set that may means there may some strange words occur while testing. In case the probability equal to zero, the lapalace smooth shall be used in bayes function.`
2. Describe how each document is represented in your implementation.
Every document can be seen as a set which consist of different words along with its occurrence.
TF-IDF is a numerical statistic which can reflect the importance of a word to a document in a collection or corpus. The importance of a word will increase with the times of appears in the file, decreases inversely with the frequency it appears in the corpus.
TF, which means terms frequency, represent how many times a word occurs in a document.
TF
A TF-IDF Metrix can be utilized to stand a document and can also be used for naïve bayes function.
Here I use Term frequency to generate a matrix correspond with ‘wordIdx’ and ‘docIdx’

  • 4
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值