快速URL排重的方法(二)

最新推荐文章于 2022-03-10 17:05:59 发布

iteye_3952

最新推荐文章于 2022-03-10 17:05:59 发布

阅读量56

点赞数

文章标签：算法 Perl J# C C++

接上篇，起初我为了输入输出方便，是用perl去实现的，后来发现perl中求模速度太慢，就改用C了

常量定义：SPACE指你要分配多大的内存空间，我这里是为5000万数据的每一条分配4字节

const int SPACE = 50000000 * 4 ;
const int MAXNUM = SPACE * 8 ;
#define LINE_MAX2048
int bits[] = { 0x1 , 0x2 , 0x4 , 0x8 , 16 , 32 , 64 , 128 };
char * db = NULL;

主程序：这里循环读入标准输入的每一行，进行排重。

int main( int argc, char * argv[])
{
db = new char [SPACE];
memset(db, 0 ,SPACE);
char line[LINE_MAX];
while (fgets(line,LINE_MAX,stdin) != NULL){
int len = strlen(line);
len -- ;
if (len <= 0 ) continue ;
if ( ! is_exist(line,len)){
// addcodehere
}
}
return 0 ;
}

判定函数：我没有做Bloom filter算法中描述的10次hash，而是做了一个MD5，一个SHA1，然后折合成9次hash。

bool is_exist( const char * str, int len){
unsigned int hashs[ 9 ];
unsigned char buf[ 20 ];
MD5(str,len,buf);
memcpy(hashs,buf, 16 );
SHA1(str,len,buf);
memcpy(hashs + 4 ,buf, 20 );
int k = 0 ;
for ( int j = 0 ;j < sizeof (hashs) / sizeof (hashs[ 0 ]);j ++ ){
int bitnum = hashs[j] % MAXNUM;
int d = bitnum / 8 ;
int b = bitnum % 8 ;
char byte = db[d];
if ( byte & bits[b] == bits[b]){
} else {
byte |= bits[b];
db[d] = byte ;
k ++ ;
}
}
return (k == 0 );
}

主要算法就在这里了，实际应用的话可以采用循环监视磁盘文件的方法来读入排重数据，那些代码就与操作系统相关，没必要在这写了