<script language="JavaScript" type="text/javascript"> </script> | ||
![]() | ||
|
Subject: Re: Text/HTML Similarity Algorithms Answered By: eiffel-ga on 29 Apr 2004 05:48 PDT Rated: ![]() |
Hi nathanntg, Using computers for sophisticated similarly analysis of text documents is an area of active research interest, but straightforward similarity measurement is a "solved problem". Jack Lynch, of the University of Pennsylvania, has written a web page that provides an overview of algorithms for text comparison, working from the simplest to the most sophisticated. The methods include: - Measuring shared letters (discarded as being not useful) - Measuring shared runs of letters - Measuring shared words - Measuring shared runs of words - Measuring shared runs of word stems - Measuring shared vocabularies - Measuring shared meanings - Measuring shared syntax, even when vocabulary differs Lynch discusses the advantages and disadvantages for each algorithm, and provides open source code in C (for measuring shared runs of letters) and in Perl (for measuring shared vocabularies): Text Analysis with Compare http://www.english.upenn.edu/~jlynch/Computing/compare.html Lynch dismisses as impractical the measuring of shared meanings, because it "would involve not only the most powerful processor, but an almost unimaginably large database -- the sort of thing that would require a large team working full-time for years". But he was writing in 1995, and since then processors have become more powerful, databases have become larger, and a large team has been working for years... A paper published by Michael Lee, Brandon Pincombe and Matthew Welsh of the University of Adelaide discusses text similarity comparison using word-based, n-gram and Latent Semantic Analysis methods. They correlate the results of these comparison methods with human comparison and find that the machine comparisons are useful, but inferior compared to human comparison: A Comparison of Machine Measures of Document Similarity with Human Judgements http://www.psychology.adelaide.edu.au/members/staff/michaellee/homepage/lee_pincombe_welsh.pdf Further practical work on document comparison using Latent Semantic Analysis has been done since then. Thomas Landauer describes LSA in detail in the following article, which doesn't include a complete algorithm but does contain a complete textual description of the method from which an algorithm can be derived: Chapter 13. Landauer - The computational basis of learning http://lsa.colorado.edu/papers/Ross-final-submit.pdf Web applications have been written to demonstrate that it is possible to implement the concepts of LSA described in the above paper. I found the "One to many comparison" application to be most interesting. I entered the text of your question as the main text. For the texts to compare against, I entered the text of your clarification and an unrelated text. It picked your two pieces as being semantically closely related, compared to the third piece: Latent Semantic Analysis at Colorado University Boulder http://lsa.colorado.edu/ If you want something more down-to-earth and less cutting-edge, you could consider Dick Grune's software and text similarity tester. This has been in use at Vrije University Amsterdam for a number of years to compare submitted assignments - to check for copying. It compares texts in programming languages or in natural language. It is available for free as C source code and as DOS binaries: The software and text similarity tester SIM http://www.cs.vu.nl/~dick/sim.html The SIM algorithms are available as pseudocode: Concise Report on the Algorithms in SIM ftp://ftp.cs.vu.nl/pub/dick/similarity_tester/TechnReport The above applications are designed to work with plain text files, whereas you also wish to work with HTML files. Two approaches are available: 1. Treat the HTML file as if it were a text file, in which case your comparisons will be based on similarity in the HTML markup as well as similarity in the text, or 2. Convert each HTML file to a text file (on the fly if necessary). An easy way to convert HTML to text is to use a text-mode web browser that has a command-line "dump to file" option. The lynx browser can do this using the "-dump" option: Lynx Information http://www.lynx.browser.org/ You can try out the HTML-to-text functionality of the lynx browser online. Given a URL it will display the text content of that web page: Lynx Viewer http://www.delorie.com/web/lynxview.html I trust that this answer provides the information that you are seeking. Please request clarification if this answer does not yet meet your needs. Google Search Strategy: "compare two text files" "index of similarity" http://www.google.com/search?q=%22compare+two+text+files%22+%22index+of+similarity%22 text similarity comparing OR compare OR comparison http://www.google.com/search?q=text+similarity+comparing+OR+compare+OR+comparison "text mode browser" http://www.google.com/search?q=%22text+mode+browser%22 Regards, eiffel-ga |