摘要:
Duplications should be removed to improve both efficiency and effectiveness of an information Retrieval system. In Digital Libraries due to varied sources of books that are distributed across various parts of the country, duplicates could arise between scanning points. The Duplication of the books can be identified only using metadata of a book. If the metadata is incorrect, abbreviated, missing or incomplete it makes the duplicate detection all the more difficult. This paper discusses a technique that works fast and efficiently in detecting the duplication of the books. Duplicate detection was done by similarity search using signature file method where we can detect the duplicate with typographical mistakes, word disorder, inconsistent abbreviations and even with missing words. The performance of the similarity search is efficient since all the signatures are in the binary format and computations are done by low level logical operations.
展开