背景:前一段子找房子,在58,赶集上搜房源,可恨中介冒充个人发布很多房源信息,浪费彼此感情,为了响应节俭的号召,我就想着搞个程序代替我的大脑自动的把房源信息过滤一下,给我推荐出个人房源信息.由于算法简单,没有文采,望大牛们绕道,也欢迎拍砖.
转摘请注明,来自zsw2zkl的博客:http://blog.csdn.net/zsw2zkl/article/details/24394557
工程共分为三部分
1.网页抓取+分析
2.分词(采用基于隐马模型的ansj分词算法)
3.训练模型+预测
前两者是为第三步做准备叙述.
下面围绕第三点展开
模型训练基于朴素贝叶斯算法
1.选取特征,以分过后的词作为每一篇房源信息的特征.为什么这样考虑呢,想想一下,你大脑判别该房源信息是中介还是个人也是通过阅读房源介绍读出来的感觉,中介用词和个人用词一般有很大不同的.
2.准备训练样本,由工程第一部分提供的功能去网页上抓取房源信息,基于这样一个事实.几乎没有个人会把自己的房源信息标成中介,所以只需要你去中介类别下抓取信息基本就是中介的了,我抓取了大约277篇,这样就得出了中介类别的下的训练数据,由于租房网站提供了人工认证房源,当然这个不是完全可信的(有证据不多言),这些信息基本可以认为是个人的,我抓取了大约269篇,这样就有了个人类别下的训练数据
3.分词,以每个分词作为文章的特征.由于租房信息这个类别有一些特征含义的词语,或者成为完整语意,比方说,无中介费不能分成无和中介费,因为两个是一个语意,朝南也不可分开.当然这都属于个人感觉,没有固定的步骤.我准备了个人词典来提高准确率,ps:由于我不想过早的干涉或者调优结果,写上了几个就不写了.分词后,我把标题调整了一下,中介的以broker开头,加上序号.个人的以person开头,加上序号,他们的标题,加入到了内容中
3.基于NB(朴素贝叶斯)方法训练,well,现在文章的属性有了,开始计算:计算的目的是要得出每个词属于每个类别的概率(最大似然法):参考java代码如下
数据结构为{word:[in_broker_num,in_person_num]}即单词为key,value为一个List,index=0表示中介,index=1表示个人
Map<String, List<Double>> m = new HashMap<>();
//broker unique word num
float broker_num = 0.0f;
float person_num = 0.0f;
//borker total word num
float broker_total = 0.0f;
float person_total = 0.0f;
//calculate every word`s probability in every article
for(File f : new File(learn_file_root).listFiles()){
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
String fileName = f.getName();
//select some articles for cross validation eg:all the articles end with 1 will be selected for predicting
if(fileName.endsWith(end)){
continue;
}
int index = 0;
if(fileName.contains("person")){
index = 1;
}
String line = br.readLine();
String[] split = line.split(" ");
br.close();
for (String word : split) {
List<Double> list = m.get(word);
if(list == null){
list = new ArrayList<Double>(){
private static final long serialVersionUID = -1256219944522765531L;
{
add(0.0);
add(0.0);
}
};
m.put(word, list);
}
Double count = list.get(index);
list.set(index, count + 1);
}
}
System.out.println(m);
for (Entry<String, List<Double>> wordM : m.entrySet()) {
List<Double> list = wordM.getValue();
if(list.get(0) != 0.0){
broker_num ++;
broker_total += list.get(0);
}
if(list.get(1) != 0.0){
person_num ++;
person_total += list.get(1);
}
}
for (Entry<String, List<Double>> wordM : m.entrySet()) {
List<Double> list = wordM.getValue();
ArrayList<Double> list2 = new ArrayList<Double>(){
private static final long serialVersionUID = 1L;
{
add(0.0);
add(0.0);
}
};
list2.set(0, Math.log(((list.get(0) + 1) / (broker_num + broker_total))));
list2.set(1, Math.log(((list.get(1) + 1) / (person_num + person_total))));
wordM.setValue(list2);
}
System.out.println(m);
由于在预测文章中会出现有些词语在模型中未出现,所以这里采取了加一平滑,即为每一个出现的单词加1
4.OK 现在我们已经训练好了模型,就采用该模型进行预测:我们需要算出每个文章在每个类别下的概率,然后选择一个最大的概率所在的类别作为它的类别.这里有个独立同分布假设,当然这个假设不是很合理,因为在文章中的每个词大都不是独立的.有了这个假设之后我们就可以算出他们的概率,在上面的代码中有个步骤取了log函数,所以这里只需要吧每个文章下的概率值相加即可得到.参考代码如下
List<String> l_broker = new ArrayList<String>();
List<String> l_person = new ArrayList<String>();
for(File f : new File(learn_file_root).listFiles()){
List<String> l_null = new ArrayList<String>();
List<String> l_not_null = new ArrayList<String>();
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
String fileName = f.getName();
if(!fileName.endsWith(end)){
continue;
}
String line = br.readLine();
String[] split = line.split(" ");
br.close();
Double sum_broker = 0.0;
Double sum_person = 0.0;
for (String word : split) {
List<Double> list = m.get(word);
if(list == null){
l_null.add(word);
sum_broker += 1.0 / broker_num;
sum_person += 1.0 / person_num;
continue;
}
l_not_null.add(word);
sum_broker += list.get(0);
sum_person += list.get(1);
}
sum_broker += Math.log(broker_num);
sum_person += Math.log(person_num);
// System.out.println("l_null: " + l_null.size() + " " + l_null);
// System.out.println("l_not_null: " + l_not_null.size() + " " + l_not_null);
if(sum_broker > sum_person){
l_broker.add(fileName);
}else{
l_person.add(fileName);
}
}
System.out.println("broker: " + l_broker.size() + " " + l_broker);
System.out.println("person: " + l_person.size() + " " + l_person);
5.最后我们求一下准确率,召回率等指标,参考代码如下
int broker_right = 0;
for (String string : l_broker) {
if(string.contains("broker")){
broker_right ++;
}
}
int person_right = 0;
for (String string : l_person) {
if(string.contains("person")){
person_right ++;
}
}
System.out.println("accuracy: " + (broker_right + person_right) * 1.0 / (l_broker.size() + l_person.size()));
System.out.println("broker precision: " + (broker_right) * 1.0 / (l_broker.size()));
System.out.println("person precision: " + (person_right) * 1.0 / (l_person.size()));
System.out.println("broker recall: " + broker_right * 1.0 / (broker_right + l_person.size() - person_right));
System.out.println("person recall: " + person_right * 1.0 / (person_right + l_broker.size() - broker_right));
6.参考结果如下
broker: 25 [broker106, broker116, broker126, broker146, broker156, broker166, broker176, broker186, broker196, broker206, broker226, broker236, broker246, broker256, broker26, broker266, broker276, broker36, broker46, broker56, broker66, broker76, broker86, broker96, person196]
person: 30 [broker136, broker16, broker216, broker6, person106, person116, person126, person136, person146, person156, person16, person166, person176, person186, person206, person216, person226, person236, person246, person256, person26, person266, person36, person46, person56, person6, person66, person76, person86, person96]
accuracy: 0.9090909090909091
broker precision: 0.96
person precision: 0.8666666666666667
broker recall: 0.8571428571428571
person recall: 0.9629629629629629