适用于中文的多模式字符串匹配算法

前阵在项目中要求实现bbs发帖时扫描是否有过滤词汇的功能,过滤词汇表是已经给出的,可能会很大,这样如果用正则表达式或者直接一次次字符串匹配效率都会比较差,最后给出的算法描述如下,只需要扫描一遍发帖内容即可。


Initialize:

Put the banned words in a HashSet called bannedWordsSet and put the first characters and the length of every banned word in a HashMap called bannedWordsFirstChar. Since HashMap in Java doesn’t allow duplicate keys, different lengths of the banned words whose first characters are the same are put in a HashSet.

 

Process:

Iterate the text from the beginning. Check every character of the text. If current character is in the bannedWordsFirstChar, it’s possible that a sub string from current position is a banned word. Get the possible lengths from bannedWordsFirstChar, and for every possible length, extract the sub string according to the current length and check if it is in the bannedWordsSet. If we get a hit, a banned word is found in the text, the start position is the current position and the length is the current length. Else, go to next possible length. If all possible lengths are processed, go to next character of the text.

 

Algorithm analysis:

Suppose the text length is m, and there are n banned words. The time needs of the initialize stage is a*n, a is a constant including the time store a word into HashSet, extract the first character of the word and store the character and the length into HashMap. The time needs of the process stage is b*m, where b is a constant including the time look up a character in the HashMap and if hit plus the time used to extract the sub string and look it up in the HashSet. So b must be smaller than or equal to the time needed to process the character which has most possible length values.

So, the total time complexity of the algorithm is O(a*n + b*m), where a and b are some constants.

 

Possible Improvements:

It is supposed that look up in the HashSet and HashMap needs a constant time. So the performance of the HashSet and HashMap is very important in the algorithm. A hash function fits for Chinese strings may contribute a better performance.

 

Performance analysis:


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值