今天主要的工作做了质量审核方面的工作:
对于外部的爬取文章,我们采用的方式是将于访问量来完成的,首先通过初始化函数求出已经审核通过的文章的访问量的均值:
private static void getAverageParameter(ArrayList<Knowledge> knowledge_list) {
// TODO Auto-generated method stub
TextSimilarity similarity = new CosineSimilarity();
int sum = 0;
for(int i=0;i<knowledge_list.size();i++){
sum += Integer.parseInt(knowledge_list.get(i).view_number);
}
average_viewNum = sum/(knowledge_list.size());
}
然后当访问量小于该均值时,以小概率通过,目的是给那些访问量小的,见解独特的文章一些通过的机会,防止数据的过拟合:
//访问量指标
if(Integer.parseInt(knowledege.view_number)<average_viewNum){
//当观看量比平均值小时,已0.35概率推荐,大时以0.9概率推过
double pass_pro = r.nextDouble();
if(pass_pro>0.65){
canPass = false;
return canPass;
}
}
else if(Integer.parseInt(knowledege.view_number)>=average_viewNum){
double pass_pro = r.nextDouble();
if(pass_pro>0.1){
canPass = false;
return canPass;
}
}
接下来对文章进行预处理,去除不能显示的字符和各类空格、换行符:
//内容预处理
String content = knowledege.article_content;
content = content.replace("\\s", "");
content = content.replace("\n", "");
content = content.replace("\r", "");
char[] oldChars = new char[content.length()];
content.getChars(0, content.length(), oldChars, 0);
char[] newChars = new char[content.length()];
int newLen = 0;
for (int j = 0; j < content.length(); j++) {
char ch = oldChars[j];
if (ch >= ' ' && ch < 127 || (ch>=0x4E00 &&ch <= 0x9FA5)) {
newChars[newLen] = ch;
newLen++;
}
}
content = new String(newChars, 0, newLen);
content = content.toLowerCase();
if(content.equals("")){
canPass = false;
return canPass;
}
接下来是提取内容的关键词,使用的是FNLP中的算法:
String app = System.getProperty("user.dir");
StopWords sw= new StopWords(app+"/models/stopwords");
CWSTagger seg = null;
try {
seg = new CWSTagger(app+"/models/seg.m");
} catch (LoadModelException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
AbstractExtractor key = new WordExtract(seg,sw);
//System.out.println(key);
CNFactory factory = null;
try {
factory = CNFactory.getInstance("models");
} catch (LoadModelException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String[] cixingWithContent = key.extract(content, 30, true).replace("{","").replace("}","").split(", ");
并且进行了改进,加入了词性的分析,只允许名词和形容词作为关键词,大大增加了准确率:
String result = factory.tag2String(content);
String[] cixing = result.split(" ");
Map<String, String> cixingMap = new HashMap<String, String>();
for(int i=0;i<cixing.length;i++){
String[] re = cixing[i].split("/");
cixingMap.put(re[0], re[1]);
}
ArrayList<String> result_with_cixing = new ArrayList<String>();
for(int i=0;i<cixingWithContent.length;i++){
String s = cixingWithContent[i];
String[] re = s.split("=");
if(re[0].length()<=1){
continue;
}
if((cixingMap.get(re[0])==null)||!(cixingMap.get(re[0]).equals("名词")||cixingMap.get(re[0]).equals("形容词"))){
continue;
}
else{
result_with_cixing.add(s);
}
}
剩下的工作是对两个关键词之间做相似度分析,并对内容做相似度分析