signature=7328a084f98cefd4bfaed744e73530dc,Signature Based Duplication Detection in Digital Librarie...

geoffreyhinton

于 2021-05-29 16:02:24 发布

阅读量56

点赞数

文章标签： signature=7328a084f98cefd4bfaed744e73530dc

本文探讨了一种快速有效地检测图书重复的方法。通过使用签名文件进行相似性搜索，即使存在拼写错误、单词错乱、不一致的缩写或缺失词汇，也能识别出重复的书籍。该方法的高效性源于所有签名都以二进制格式存储，计算过程采用低级逻辑操作完成。

摘要由CSDN通过智能技术生成

摘要：

Duplications should be removed to improve both efficiency and effectiveness of an information Retrieval system. In Digital Libraries due to varied sources of books that are distributed across various parts of the country, duplicates could arise between scanning points. The Duplication of the books can be identified only using metadata of a book. If the metadata is incorrect, abbreviated, missing or incomplete it makes the duplicate detection all the more difficult. This paper discusses a technique that works fast and efficiently in detecting the duplication of the books. Duplicate detection was done by similarity search using signature file method where we can detect the duplicate with typographical mistakes, word disorder, inconsistent abbreviations and even with missing words. The performance of the similarity search is efficient since all the signatures are in the binary format and computations are done by low level logical operations.

展开