贝叶斯公式与贝叶斯分类器
贝叶斯定理之所以有用,是因为我们在生活中经常遇到这种情况:
我们可以很容易直接得出P(A|B),P(B|A)则很难直接得出,但我们更关心P(B|A),贝叶斯定理就为我们打通从P(A|B)获得P(B|A)的道路
L(A|B)是在B发生的情况下A发生的可能性
Pr(A|B)是已知B发生后A的条件概率,也由于得自B的取值而被称作A的后验概率。
当前几个主要的Lucene中文分词器的比较
http://www.iteye.com/news/9637
P(H|X) H代表邮件 X代表邮件中出现的词
P(H|X)在出现X这个词时,H邮件为垃圾邮件的概率
P(X|H) 在H为垃圾邮件时,X出现的概率
P(X)为在所有邮件中X出现的概率
P(H)为在所有邮件中H为垃圾邮件的概率
P(H|X)=P(X|H)P(H)/P(X)
#划分数据集——pig
读入分词后的文件
processed = load '/opt/digitalout/part-r-00000' as (category:chararray, doc:chararray);
随机抽取20%的样本作为测试集
test = sample processed 0.2;
提取剩余样本作为训练集
jnt = join processed by (category,doc) left outer, test by (category,doc);
filt_test = filter jnt by test::category is null;
train = foreach filt_test generate processed::category as category, processed::doc as doc;
– 先将原数据集processed左连接(left join)测试集test
– 把有test记彔的样本去除
输出
store test into '/opt/digitalout/test';
store train into '/opt/digitalout/train';
测试集统计
test_ct = foreach (group test by category) generate group,COUNT(test.category);
dump test_ct;
train_ct = foreach (group train by category) generate group,COUNT(train.category);
dump train_ct;
#文件序列化
#mahout seqdirectory --input /opt/digitalout/train --output /opt/digitalout/train_byes --tempDir /opt/digitalout/temp
训练贝叶斯模型(与上面pig无关)
每个类别的文件,放到同一个文件夹下,文件夹的名称就是类别
/opt/digital 在我的docs目录下,有mp3 camera等目录,每个目录下都是一篇篇文章
./mahout seqdirectory -i /opt/digital -o /opt/digitalseq
将序列化文件分词,变成向量文件,然后分开成训练集和测试集
./mahout seq2sparse -i /opt/digitalseq -o /opt/digitalout2/vectors -lnorm -nv -wt tfidf
./mahout split -i /opt/digitalout2/vectors/tfidf-vectors --trainingOutput /opt/digitalout2/train --testOutput /opt/digitalout2/test --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
训练贝叶斯模型
./mahout trainnb -i /opt/digitalout2/train -el -o /opt/digitalout2/model -li /opt/digitalout2/labelindex -ow -c
测试贝叶斯模型
./mahout testnb -i /opt/digitalout2/test -m /opt/digitalout2/model -l /opt/digitalout2/labelindex -ow -o /opt/digitalout2/testresult -c
Summary
-------------------------------------------------------
Correctly Classified Instances : 3188 96.5768%
Incorrectly Classified Instances : 113 3.4232%
Total Classified Instances : 3301
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e <--Classified as
549 3 3 9 13 | 577 a = MP3
0 503 0 14 2 | 519 b = camera
5 1 560 13 30 | 609 c = computer
0 1 0 611 0 | 612 d = household
2 2 0 15 965 | 984 e = mobile
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.9496
Accuracy 96.5768%
Reliability 80.3207%
Reliability (standard deviation) 0.3944
mahout --help
贝叶斯定理之所以有用,是因为我们在生活中经常遇到这种情况:
我们可以很容易直接得出P(A|B),P(B|A)则很难直接得出,但我们更关心P(B|A),贝叶斯定理就为我们打通从P(A|B)获得P(B|A)的道路
L(A|B)是在B发生的情况下A发生的可能性
Pr(A|B)是已知B发生后A的条件概率,也由于得自B的取值而被称作A的后验概率。
当前几个主要的Lucene中文分词器的比较
http://www.iteye.com/news/9637
P(H|X) H代表邮件 X代表邮件中出现的词
P(H|X)在出现X这个词时,H邮件为垃圾邮件的概率
P(X|H) 在H为垃圾邮件时,X出现的概率
P(X)为在所有邮件中X出现的概率
P(H)为在所有邮件中H为垃圾邮件的概率
P(H|X)=P(X|H)P(H)/P(X)
#划分数据集——pig
读入分词后的文件
processed = load '/opt/digitalout/part-r-00000' as (category:chararray, doc:chararray);
随机抽取20%的样本作为测试集
test = sample processed 0.2;
提取剩余样本作为训练集
jnt = join processed by (category,doc) left outer, test by (category,doc);
filt_test = filter jnt by test::category is null;
train = foreach filt_test generate processed::category as category, processed::doc as doc;
– 先将原数据集processed左连接(left join)测试集test
– 把有test记彔的样本去除
输出
store test into '/opt/digitalout/test';
store train into '/opt/digitalout/train';
测试集统计
test_ct = foreach (group test by category) generate group,COUNT(test.category);
dump test_ct;
train_ct = foreach (group train by category) generate group,COUNT(train.category);
dump train_ct;
#文件序列化
#mahout seqdirectory --input /opt/digitalout/train --output /opt/digitalout/train_byes --tempDir /opt/digitalout/temp
训练贝叶斯模型(与上面pig无关)
每个类别的文件,放到同一个文件夹下,文件夹的名称就是类别
/opt/digital 在我的docs目录下,有mp3 camera等目录,每个目录下都是一篇篇文章
./mahout seqdirectory -i /opt/digital -o /opt/digitalseq
将序列化文件分词,变成向量文件,然后分开成训练集和测试集
./mahout seq2sparse -i /opt/digitalseq -o /opt/digitalout2/vectors -lnorm -nv -wt tfidf
./mahout split -i /opt/digitalout2/vectors/tfidf-vectors --trainingOutput /opt/digitalout2/train --testOutput /opt/digitalout2/test --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
训练贝叶斯模型
./mahout trainnb -i /opt/digitalout2/train -el -o /opt/digitalout2/model -li /opt/digitalout2/labelindex -ow -c
测试贝叶斯模型
./mahout testnb -i /opt/digitalout2/test -m /opt/digitalout2/model -l /opt/digitalout2/labelindex -ow -o /opt/digitalout2/testresult -c
Summary
-------------------------------------------------------
Correctly Classified Instances : 3188 96.5768%
Incorrectly Classified Instances : 113 3.4232%
Total Classified Instances : 3301
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e <--Classified as
549 3 3 9 13 | 577 a = MP3
0 503 0 14 2 | 519 b = camera
5 1 560 13 30 | 609 c = computer
0 1 0 611 0 | 612 d = household
2 2 0 15 965 | 984 e = mobile
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.9496
Accuracy 96.5768%
Reliability 80.3207%
Reliability (standard deviation) 0.3944
mahout --help