今天做完了审核的整个系统,将各个算法联合起来,形成一个完整的审核流程:
对于外来的文章(爬取的文章):
首先进行访问量的审核:
if(pattern == 1){
System.out.println("该文章访问量为:"+knowledege.view_number);
//访问量指标
if(Integer.parseInt(knowledege.view_number)<average_viewNum){
//当观看量比平均值小时,已0.35概率推荐,大时以0.9概率推过
double pass_pro = r.nextDouble();
if(pass_pro<0.65){
canPass = false;
error_msg = "访问量不足";
return canPass;
}
}
else if(Integer.parseInt(knowledege.view_number)>=average_viewNum){
double pass_pro = r.nextDouble();
if(pass_pro<0.1){
canPass = false;
error_msg = "访问量不足";
return canPass;
}
}
}
然后如果通过,对内容进行预处理:
//内容预处理
String content = knowledege.article_content;
content = content.replace("\\s", "");
content = content.replace("\n", "");
content = content.replace("\r", "");
char[] oldChars = new char[content.length()];
content.getChars(0, content.length(), oldChars, 0);
char[] newChars = new char[content.length()];
int newLen = 0;
for (int j = 0; j < content.length(); j++) {
char ch = oldChars[j];
if (ch >= ' ' && ch < 127 || (ch>=0x4E00 &&ch <= 0x9FA5)) {
newChars[newLen] = ch;
newLen++;
}
}
content = new String(newChars, 0, newLen);
content = content.toLowerCase();
if(content.equals("")){
canPass = false;
return canPass;
}
下面利用DFA算法进行敏感词的过滤:
//敏感词过滤
Set<String> sensitive_list = sf.getSensitiveWord(content, 1);
if(sensitive_list.size()>=1){
//含有敏感词
canPass = false;
error_msg = "含有敏感词";
return canPass;
}
下面进行关键词相似度分析,其中用到的算法是前一篇中提到的NGD距离评判:
ArrayList<String> keywords = getKeywords(content);
//关键词相似度分析:
if(!isKeywordSimilar(keywords,knowledege.keywords)){
canPass = false;
error_msg = "关键词与内容关联不足";
return canPass;
}
最后进行查重检验:
//文本内容相似度分析
for(int i=0;i<have_passed_knowledge.size();i++){
if(isSimilar(knowledege,have_passed_knowledge)){
canPass = false;
error_msg = "与已发布文章过于相似";
break;
}
}
private static boolean isSimilar(Knowledge knowledege, ArrayList<Knowledge> have_passed_knowledge2) {
// TODO Auto-generated method stub
TextSimilarity similarity = new CosineSimilarity();
for(int i=0;i<have_passed_knowledge2.size();i++){
Knowledge other = have_passed_knowledge2.get(i);
String s1 = knowledege.article_content;
String s2 = other.article_content;
s1 = s1.replace(" ", "").replace("\n", "").replace("\r", "").replace("\'", "").replace("\"", "");
s2 = s2.replace(" ", "").replace("\n", "").replace("\r", "").replace("\'", "").replace("\"", "");
double score = similarity.getSimilarity(knowledege.article_content, other.article_content);
System.out.println("文章内容关联度:"+score);
if(score > AVERGE_SCORE){
double pass_pro = r.nextDouble();
if(pass_pro>0.3){
return true;
}
}
}
double pass_pro = r.nextDouble();
if(pass_pro>0.01){
return false;
}
return true;
}
利用的是word2vec中的余弦相似度。
最后完成审核,效果如下:
今天为止,技术工作已经完成,接下来要进行数据集的录入工作。等录入结束,会再有几篇总结工作的博客