[原创]关键词搜索算法改进——顺序表字典二分法逐级检索

最新推荐文章于 2020-07-08 15:37:00 发布

原创最新推荐文章于 2020-07-08 15:37:00 发布

· 5.7k 阅读

2 ·

版权

文章标签：

#算法 #search #dictionary #character #string #structure

介绍了一种高效的词汇搜索算法，该算法通过建立词典索引并采用二分法逐字检索，显著提高了搜索效率。实测显示，对于344KB文本，搜索耗时仅1.39秒，比传统遍历法快75倍以上。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文关键词：关键词搜索, 顺序表字典, 二分法, 逐级检索

问题重述：有一个内含有大约40万条常用词汇的词库。现给定一篇文章，使用这个词库分析出常用词汇的出现次数，并按出现次数由高到低排序这些词语。

改进算法的思路：
1. 通常一篇文章所包含的词语远少于词库中40万的数量；
2. 数据库建立索引之后，可采用“二分法”对词语进行快速定位；
3. 逐字缩小查询范围，如果查询到某个字符时范围已经为0，那么可以预测其后的词一定也不存在，（例如查询到forest时已经没有匹配的词了，就可以到此结束）。

该算法将时间复杂度由 O(m*n) 减少为 O( log2(m) * n ) （m为字典词汇数量，n为待搜索文本的长度）。测试结果表明：344KB 文本搜索耗时1.39s，而采用遍历法则需105s，可见速度的提升是很可观的。
以下是算法的实现：

一、首先，利用文本文件制作词典（二进制文件）。包括导入字符串数据、排序、剔除重复项、创建索引表。

字典文件格式描述如下：

1. 文件头（16字节）：
---------------------------------------------------------------------------
| "MAODICT"字符串(8字节) | 索引区开始位置(4字节) | 索引区结束位置(4字节) |
---------------------------------------------------------------------------

2. 字符串存储区：

每条字符串均以'/0'结尾，连续存放。

3. 索引区：

每个索引表项格式（5字节）：
---------------------------------------------------
| 字符串偏移量（4字节） | 词条长度（1字节） |
---------------------------------------------------

字符串紧跟文件头存放，索引区在字符串存储区之后。

文件头和索引表项结构体：

// Dictionary file header typedef struct _DictHeader { char maodict[8]; // string "MAODICT" long so; // index start offset long eo; // index end offset } DictHeader; // Index item structure(5 bytes) typedef struct _IndexItem { union { long offset; // string offset char * str; // string pointer(unused) }; char length; // string length } IndexItem;

数据导入代码暂略，详见附件msearch.cpp中的textToBinaryFile()函数。

二、利用创建的字典文件，编写检索程序。SearchTextFile()函数利用传入的文件名打开并进行“内存文件映射”，利用传入的数据流读取文本数据。从某个位置起始，向后组成“词语”进行查询，到一定长度“失配”后，起始位置移到下一个字符。由于数据流不能回退，故需缓存已读取的字符，每次“失配”后将缓冲区向前整体移动一个字符位置（memmove()）。算法利用了两个变量：j 用于记录当前字符相对于起始位置的偏移，k 用于记录缓冲区中已读取的字符的数量。

该部分代码如下：

j = 0; // word char index k = 0; // number of buffered chars do { j = 0; // return zero si = 0; li = rcCount-1; for(j=0; ; j++) { while(k<=j) cbuf[k++] = fgetc(fp); if(cbuf[j]==EOF) break; ret = getCharIndex(dbuf, idx, cbuf[j], j, &si, &li); //====================================== // if this is a complete word, add it if(ret && j==idx[si].length-1) { ...添加到查询结果列表，代码省略... } //====================================== if(!ret) break; else { if(li-si==0) if(j==idx[si].length-1) break; } } // move buffer one step foward (overlapped spaces!!!) if(k>1) memmove(cbuf, cbuf+1, (k-1)*sizeof(cbuf[0])); //printf("%d/n", k); k --; } while(cbuf[0]!=EOF);

三、二分法逐字检索是查询程序的核心算法，代码如下：

/* * dbuf: data area pointer * idx: index area pointer * ch: current character * j: current character's position in word * _si, _li: previous range * return: 1 - fonnd; 0 - not found */ static inline int getCharIndex( const char * dbuf, const IndexItem * idx, int ch, int j, int * _si, int * _li) { int si = *_si; int li = *_li; int mi; int ssi, lli, mmi; #define GETCH(x) ( (unsigned char)*(dbuf + idx[x].offset + j) ) if(ch < GETCH(si)) { // above the upper border [not found - case 1] *_si = *_li = si; return 0; } else if(ch == GETCH(si)) { // start position *_si = si; if(ch == GETCH(li)) { // li is just the end *_li = li; return 1; } else { /* ch < GETCH(li) */ // using binary search, find the end ssi = si; lli = li; while(lli-ssi>1) { mmi = (ssi + lli) / 2; if(ch < GETCH(mmi)) lli = mmi; else ssi = mmi; } *_li = ssi; return 1; } } else { /* ch > GETCH(si) */ if(ch > GETCH(li)) { // below the lower border [not found - case 2] *_si = *_li = li+1; return 0; } else if(ch == GETCH(li)) { *_li = li; // using binary search, find the start ssi = si; lli = li; while(lli-ssi>1) { mmi = (ssi + lli) / 2; if(ch <= GETCH(mmi)) lli = mmi; else ssi = mmi; } *_si = lli; return 1; } else { /* ch < GETCH(li) */ // the most common case while(li-si>1) { mi = (si + li) / 2; if(ch < GETCH(mi)) li = mi; else if(ch > GETCH(mi)) si = mi; else { /* == found */ // search the upper border ssi = si; lli = mi; while(lli-ssi>1) { mmi = (ssi + lli) / 2; if(ch <= GETCH(mmi)) lli = mmi; else ssi = mmi; } *_si = lli; // search the lower border ssi = mi; lli = li; while(lli-ssi>1) { mmi = (ssi + lli) / 2; if(ch < GETCH(mmi)) lli = mmi; else ssi = mmi; } *_li = ssi; return 1; } } // not included [not found - case 3] *_si = *_li = li; return 0; } } }

四、程序的执行效果：

1. 使用方法：

J:/Projects/cpp/msearch/Release>msearch -h Usage: msearch -c <source file> .... Convert text file to dictionary. msearch <dict file> .... Input text to search, ended with [Ctrl+Z]. msearch -h .... Print help information. Examples: msearch -c English.txt .... Create English.dat. msearch English.dat <gpl3.txt >result.txt .... Search keywords in gpl3.txt and write results to result.txt, using dictionary English.dat

2. 运行结果：

J:/Projects/cpp/msearch/Release>msearch English.dat The licenses for most software and other practical works are designed to take away your freedom to share and change the works. ^Z Processed in 0.012s. Totally allocated memory: 35.06KB re -- 4 or -- 3 are -- 3 an -- 3 he -- 3 and -- 2 works -- 2 work -- 2 the -- 2 to -- 2 ha -- 2 freed -- 1 freedom -- 1 free -- 1 hang -- 1 hare -- 1 for -- 1 her -- 1 do -- 1 ice -- 1 lice -- 1 license -- 1 licenses -- 1 designed -- 1 of -- 1 design -- 1 other -- 1 our -- 1 practical -- 1 change -- 1 reed -- 1 share -- 1 sig -- 1 sign -- 1 signed -- 1 so -- 1 soft -- 1 software -- 1 take -- 1 cense -- 1 tic -- 1 tical -- 1 away -- 1 war -- 1 ware -- 1 way -- 1 act -- 1 most -- 1 you -- 1 your -- 1

==========================================================
完整的程序和源代码请到这里下载：http://down.chinaz.com/soft/24828.htm
==========================================================