最近用ansj分词,看关键词提取,简单写下思路,还有怎么根据具体文本来调整关键词规则呢?欢迎大家来讨论。
Ansj关键词提取规则:
影响关键词的因素:
词性、位置、频率
在KeyWordComputer 类中为某些词性预设score。
public class KeyWordComputer {
private static final Map<String, Double>POS_SCORE =new HashMap<String, Double>();
static {
POS_SCORE.put("null", 0.0);
POS_SCORE.put("w", 0.0);
POS_SCORE.put("en", 0.0);
POS_SCORE.put("m", 0.0);
POS_SCORE.put("num", 0.0);
POS_SCORE.put("nr", 3.0);
POS_SCORE.put("nrf", 3.0);
POS_SCORE.put("nw", 3.0);
POS_SCORE.put("nt", 3.0);
POS_SCORE.put("l", 0.2);
POS_SCORE.put("a", 0.2);
POS_SCORE.put("nz", 3.0);
POS_SCORE.put("v", 0.2);
POS_SCORE.put("kw", 6.0);//关键词词性
}
预设关键词个数为5<