近来公司安排做中文分词的项目,找了一些资料。做了前期的一些代码,贴出来总结一下。
分词开发思路
标题:林德璋PK Martin:“RUP是楷书,XP是草书”对吗?
步骤一:先将一些不能组成词汇的字符剔除。
String arrayChineseRemove[] = {
"的","吗","么","啊","说","对","在","和","是",
"被","最","所","那","这","有","将","会","与",
"於","于","他","她","它","您","为","欢迎"
};
当前字符串形态:林德璋PK Martin:“RUP楷书,XP草书”?
=============================================================
步骤二:将英文单词匹配出来列入数组,并将原英文单词的位置用空格符替换。
Pattern Epattern = Pattern.compile("//d+.//d|//w+//d+.//d|//w+//-//w+|//w+");
Matcher Ema = Epattern.matcher(str);
while(Ema.find()){
eWord.add(Ema.group());
str = str.replace(Ema.group(), " ");
}
当前字符串形态:林德璋 :“ 楷书, 草书”?
英文数组 [PK,Martin,RUP,XP]
=============================================================
步骤三:将剩下的字符串合并,并将多余的空格及符号剔除。
Pattern Cpattern = Pattern.compile("[//u4e00-//u9fa5]|//s");
Matcher Cma = Cpattern.matcher(str);
while(Cma.find()){
if (!(Cma.group().equals(" ") && isSpace.equals(" "))){
cWord = cWord + Cma.group().toString();
isSpace = Cma.group();
}
}
当前字符串形态:林德璋 楷书 草书
=============================================================
步骤四:将字符串采用中文分词思路,组成可用单词,并列入数组。 (未完成)
·················
当前字符串形态:林德璋
中文组数[楷书,草书]
=============================================================
步骤五:用正向或者反向匹配算法查询字符串。 (未完成)
·················