信息检索导论-permuterm index

最新推荐文章于 2022-03-11 14:22:33 发布

ylq339198

最新推荐文章于 2022-03-11 14:22:33 发布

阅读量706

点赞数 1

分类专栏：信息检索导论文章标签： linux 其他内核

本文链接：https://blog.csdn.net/ylq339198/article/details/121581632

版权

Permuterm Index是一种处理通配符查询的索引结构，通过将词尾添加$并轮排，如ab$、$ab、b$a。对于单个通配符查询，如*b，转化为b$*进行查找；对于多个通配符，如a*b*，先查找$a*，再用a*b*过滤。这种方法虽然可能导致词典变大，但在特定场景下有效。代码已在Ubuntu上实现。

摘要由CSDN通过智能技术生成

permuterm index是专用于通配符查询的索引结构的一种方法：

方法：$表示一个词的末尾（正则），即如果ab，则表示成ab$，并进行轮排，形成ab$,$ab,b$a,并指向ab；

在处理单个通配符查询时，如果要查*b,则先添上$，然后旋转，使得*在词的尾端，即b$*,并在搜索树中查找。发现b$a满足要求，则ab满足要求。

在处理多个通配符查询时，如果要查询a*b*,则先添加$即a*b*$,然后旋转为$a*b*,先查询$a*,取得的结果再通过a*b*过滤即可。

缺点：词典会变得很大。

以下代码，经过本人在ubuntu上实现：

Makefile:

gcc -o permuterm_trie permuterm.c -std=gnu99

搜索文件（words_ordered.txt）：

（文件中的内容，可以根据自己的需要增加或删减）

a
aardvark
aardvarks
abaci
aback
abacus
abacuses
abaft
abalone
abalones
abandon
abandoned
abandoning
abandonment
abandons
abase
abased
abasement

以下是permuterm.c代码：

#include <stdio.h>
#include <stdlib.h>	// malloc
#include <string.h>	// strdup
#include <ctype.h>	// isupper, tolower

#define MAX_DEGREE	27 // 'a' ~ 'z' and EOW
#define EOW			'$' // end of word



// used in the following functions: trieInsert, trieSearch, triePrefixList
#define getIndex(x)		(((x) == EOW) ? MAX_DEGREE-1 : ((x) - 'a'))

// TRIE type definition
typedef struct trieNode {
	int 			index; // -1 (non-word), 0, 1, 2, ...
	struct trieNode	*subtrees[MAX_DEGREE];
} TRIE;


// Prototype declarations

/* Allocates dynamic memory for a trie node and returns its address to caller
	return	node pointer
			NULL if overflow
*/
TRIE *trieCreateNode(void);

/* Deletes all data in trie and recycles memory
*/
void trieDestroy( TRIE *root);

/* Inserts new entry into the trie
	return	1 success
			0 failure
*/
int trieInsert( TRIE *root, char *str, int dic_index);

/* Retrieve trie for the requested key
	return	index in dictionary (trie) if key found
			-1 key not found
*/
int trieSearch( TRIE *root, char *str);

/* prints all entries in trie using preorder traversal
*/
void trieList( TRIE *root, char *dic[]);

/* prints all entries starting with str (as prefix) in trie
	ex) "abb" -> "abbas",