Clustering With Search Engines(Tara Calishain)

原创 2004年05月16日 13:40:00
Tara Calishain has authored or co-authored several books on using the Internet, including The Lawyer's Guide to Internet Research. She is the editor of ResearchBuzz, a free weekly newsletter on Internet search offerings and search engine news.  Tara is also the author of LLRX Buzz, a weekly column on new web sites and services focused on the legal community.

Published June 3, 2002


Search engines still aren't as smart as we'd like 'em to be. Sure, Google's great, and Yahoo comes in real handy sometimes, but sometimes your search terms just aren't finding what you're looking for.

Enter clustering. With clustering search engines gather results into groups around a certain theme, or in some cases just provide you with related keywords that perhaps you wouldn't have thought of yourself, helping you zero in on your goal. The Internet Archive (IA) is a virtual time machine. A non-profit company, the IA is working to 損revent the Internet - a new medium with major historical significance - and other "born-digital" materials from disappearing into the past.?To date, the archive抯 collection consists of 10 billion web pages, 16 million Usenet postings, 360 archival movies, and 5,000 pages from Arpanet (from the U.S. Department of Defense). Not only is IA a wonderful way to preserve the Internet, but is most helpful in answering reference questions and has been my assistant (or should that be the other way around) in many a legal research project.

In part I of Clustering With Search Engines, we'll look at regular search engines that cluster -- and boy, are there are a lot of 'em! In Part II of this article, we'll look at meta-search engines that cluster as well as specialty clustering search engines and a search engine that is still offering clustering on a limited basis.

We'll start with one you might not have heard of yet: Google Labs' clustering agent, Google Sets.

Google Sets -

Google Sets doesn't provide search results. Instead, it helps you find similar terms to the ones you've already entered, letting you create more complex queries in one area.

Enter a couple of words - Tamoxifen and Arimidex work; they're drugs used to treat breast cancer. You'll get a small set of results, but it'll include items you might not have heard of. Be sure to click on them to get Google search results to see how they're related to your original search terms.

Let's do a more general example -- say dog breeds. Enter collie, chihuahua, and german shepherd in the set boxes. You'll get back an enormous list of dog breeds. You don't want to use all of these, of course, but it'll give you an idea of how to narrow your search.

Use Google sets to build queries when you're looking for similar items or brainstorm on how to put a search together. The other search engines in this article cluster in a more traditional way; we'll start with Wisenut.

Wisenut --

Wisenut is a full-text search engine that was recently bought by LookSmart. Enter a search in it -- we'll use "neurosurgery" as the primary example for the rest of the article -- and you'll see that the search results include a black area at the top of the page which has related topics (neurosurgery university, pediatric neurosurgery, etc.) and a number of results. WiseNut calls this the WiseGuide. Some results have a + beside them; click on the + for subtopics. The subtopics will show up in a gray area underneath the clustered results.

There's also a [search this] link next to each of the clustered results, which runs another search with those keywords. Those keywords take you to a different set of clustered results in addition to Web page results, and so on and so on.

Teoma --

Teoma was recently purchased by Ask Jeeves, and has gotten a lot of press as a potential "Google Killer." While I don't think I'd go that far, it does have interesting clustering technology.

Run the neurosurgery search and you'll get four sets of results. Top left are sponsored results. Bottom left are Web site (non-sponsored) results. Top right are the suggestions for refining the result (that's what we'll focus on). and bottom right are the "Link Collections from Experts and Enthusiasts," as Teoma calls them. If you're just looking for general information then use the link collections. If you're interested in narrowing your search, though, use the suggestions.

Just click on one and your search will be run again, with the suggested term you searched on included in the link. You'll get a different set of site results, suggestions, and expert link collections, too. --

This site isn't a search engine per se but is rather a demonstration of Infonetware's "RealTerm Technology."

Enter a search term at the top of the page. The results page is framed. The area on the left provides you with topics related to your search term, while the frame on the right shows the Web page search results. The topics have a number in parens beside them that shows how many results are in that particular topic.

Click on a topic and the results for that topics will appear in the right frame. With some of the terms, you'll see sub- topics that allow you to narrow your search results even more.

While Infonetware works with full-text searching, the Oingo engine uses the Open Directory Project and offers suggestions for searching.

Oingo --

Since Oingo uses the Open Directory Project as its search source, it's already clustered in a way. (ODP is a searchable subject index like Yahoo.) When you do a search, the search results page will first give you a drop-down list of potential meanings for your search, if any. Beneath that is a list of categories which relate to your search (listed in order of relevance.) Finally, site results from the directory itself.

Unfortunately, the suggestions are limited; searching for neurosurgery provides very few suggestions. It's only when you do a search for a more general term does Oingo's usefulness come through. Searching for Rose, for example, provides several suggestions (plant life, pink wine, several
different American towns, etc.) and a manageable list of categories.

If you pick a suggested definition, Oingo will run a search again using the definition you specified. All the definitions I looked at for "rose" provided just category results, not results of individual sites. This is a good one to try if you're searching for something that's in a pretty broad category, like flowers, trees, animals, etc.

AlltheWeb --

Now that the Northern Light Web search is no longer publicly available (supposedly), my favorite search engine that nobody remembers is AlltheWeb. AlltheWeb provides two ways to narrow search results. They're both on the right side of the results screen.

The first way is FAST Topics, which apparently uses both ODP topics and dynamically generated topics. Click on a topic and you'll get a list of Web sites related to that topic.

There's also a "Narrow Your Search" option that lists search terms related to your search. Click on one of those and your search will be run again with the term you clicked. Not all search terms have both Topics and Narrow Your Search terms, but all the ones I looked at had either one or the other.

That's it. Next week we'll look at meta-clusterers, and a full-text search engine that's still testing its clustering.


In part one of this article we took a look at general search engines that offer clustering features. In this episode we're going to look at one more general search engine that is still offering clustering -- AltaVista -- but not yet offering it to the general public. Then we'll take a look at a few meta-search engines that cluster, and a specialty search engine that clusters.

AltaVista --

You may remember that several weeks ago AltaVista was testing their clustering technology with a small percentage of their users. They're still testing it, but I was able to take a second look at it.

AltaVista's paraphrase looks a little like AlltheWeb's recommended terms results; once you run a search, AltaVista's recommendations for narrowing down search results show up at the top of the page. A search result for "neurosurgery" shows about a dozen results, including brain, functional results, and Johns Hopkins. Clicking on one of the results to narrow down the search leads to another collection of recommended narrowing terms (Clicking on Johns Hopkins leads to suggestions that include pediatric neurosurgery, Johns Hopkins Hospital, and Johns Hopkins University) and so on.

As I mentioned, this is not yet publicly available, but I like the suggestions it makes. If you use AltaVista keep an eye out for it.

In addition to AltaVista and many other general search engines, there are some meta-search engines that cluster their results. Vivisimo is probably the most famous, but there are other ones available too.

Vivisimo --

Vivisimo has a very simple front page, but the search results are organized in groups. A search for neurosurgery provides 163 results. On the left side of the screen are the groups of results, which in this case include Neurosurgeons, Programs, and Nervous System. Click on the + beside the search results to get narrower and narrower search results, until you get to actual page listings. Click on the page title and get the page on the right side of the screen. This page design makes it really easy to explore several categories without "losing your place."

Don't forget to check out Vivisimo's advanced search, which allows you to specify the search engines you want to use and specify how many results you want (the more results you specify the more interesting the categories get -- that's what my experimenting showed, anyway). You can also specify in what language your search results should be and how you want your pages to display (in a frame, in a window, or in a new window). There's even a filter for removing offensive content (though that does limit the number of search engines available.)

While Vivisimo is fairly well known, Query Server is more just a demo site. But it's a demo site worth looking at -- it offers clustering search for several different categories of Web search.

Query Server --
Query Server offers several different types of search on the left side of the front page. You'll see links to search there for Web, News, Health, Money, and Government. Each of these searches cluster results, and they all have pretty much the same interface. But they each delve into different resources.

Search results are presented in a frame on the right side of the site. The top of the frame has a query box. Below that is a listing of the search engines queried. Below that is a listing of the groups that search results were clustered into, while below that are the results themselves. Results are divided by cluster and assigned scores based on how relevant they are. A search for "neurosurgery" provided several different clusters, including Cyber Museum of
Neurosurgery, UCLA Neurosurgery, and Harvard Medical School.

The other search engines provide results in much the same way, but I encourage you to check out each engine, and especially the small customize link on the lower right of each query box. The customize lets you specify the engines used, specify whether or not you want to search for ALL or ANY term given, how many results you want total, and how long you want to Query Server to search.

Surfwax --

Before you start playing with Surfwax, I have to tell you something: I have never been able to get Surfwax to work except with Internet Explorer.

Surfwax is a service that offers both subscription-based and free services. The subscription-based service gives you access to more search engines and more features, but there is some searching that you can do for free.

After you've done a search, you'll see a "focus" link in the upper-left corner. Click on the little box beside the word. You'll get "focus words" that you can add to the search. Focus words are divided into narrower or broader, and the big difference between this list and others you've seen is that this
list contains generic words, and not links to specific people or places like Johns Hopkins or Harvard Medical School. This makes for a different set of search results than the other ones I've mentioned in this article.

Surfwax has been around for a while, but it's not been around nearly as long as the old reliable Northern Light. And while Northern Light no longer offers Web search, it still uses its clustering technology for news search.

Northern Light News Search --

I'm not able to use neurosurgery for this example since a search has to have a certain number of results in order to be classified into folders.

"George Bush" works well for a search, though. Search results are divided into several different folders, including stock markets, macroeconomics, terrorism, and Pakistan. Pick a folder and you'll get the results that appear in the folders. Unfortunately the folder listing does not provide information about what's in a particular folder, but there are subfolders provided if the topic is broad enough. It also appears that the search results are listed by order of date; handy if you're looking for recent stuff.

You can't always come up with a search query that's specific enough that you'll only find a few search results. In that case, using clustering search engines can break out several hundred results into manageable packages, or provide you suggestions that reduce the ocean of information to a reasonable level. Enjoy!


发表在 Science 上的一种新聚类算法

今年 6 月份,Alex Rodriguez 和 Alessandro Laio 在 Science 上发表了一篇名为《Clustering by fast search and find of de...
  • peghoty
  • peghoty
  • 2014年08月29日 17:39
  • 31225

漫谈 Clustering 系列 - 笔记  Chiyuan Zhang Computer Science& Artificial Intelligence Laborat...
  • cmonkey_cfj
  • cmonkey_cfj
  • 2014年02月12日 16:44
  • 1953


当把聚类(Clustering)和分类(Classification)放到一起时,很容易弄混淆两者的概念,下分别对两个概念进行解释。       1 聚类(Clustering):       ...
  • gdp12315
  • gdp12315
  • 2015年11月11日 10:59
  • 7579

聚类——层次聚类(Hierarchical Clustering)

Hierarchical Clustering,一如其字面意思,是层次化的聚类,得出来的是树形结构(计算机科学的树是一棵根在最上的树,:-D)。...
  • lanchunhui
  • lanchunhui
  • 2016年03月13日 12:43
  • 2023

斯坦福大学公开课 :机器学习课程(Andrew Ng)——9、无监督学习:K-means Clustering Algorithm

1)K-means聚类算法(K-means Clustering Algorithm)描述 2)2-means聚类算法(K-means Clustering Algorithm)效果展示图片 3)k...
  • mmc2015
  • mmc2015
  • 2015年01月05日 10:52
  • 1652


K-MEANS的最初中心点选择对最后的分类效果有很大关系,比如下图出现的聚类,就有很大的问题练习: 聚类特征 salary exercised_stock_options 练习:部署聚类### clu...
  • grape875499765
  • grape875499765
  • 2017年11月30日 16:59
  • 83


python正则表达式模块re中的search方法应用(python2.x) 函数,string,flags=0)          flags标识位 0x00: ...
  • hard_lushunming
  • hard_lushunming
  • 2017年03月26日 22:12
  • 2526


最近要在spark上做一个聚类的项目,数据规模和类的数目都比较大。因此总结了一下常见的聚类算法。最终选择mini-batch kmeans,并使用kmeans++来初始化类中心。这样算法的执行速度比较...
  • lzt1983
  • lzt1983
  • 2014年09月12日 22:27
  • 8809


1  图(graph)、顶点(vertices)、边(edges) 图由顶点和边组成,是表示物件与物件(objects)之间的关系的方法。在其他的术语中,图也被称作网络(network),顶点被称...
  • minenki
  • minenki
  • 2013年02月24日 21:48
  • 5516


1.基本参数与使用   1.1 常规介绍 使用预计算光照需要在Window/Lighting面板下找到预计算光照选项,保持勾选预计算光照并保证场景中有一个光照静态的物体 此时在编辑器内构建后,...
  • dhfv737x
  • dhfv737x
  • 2017年02月19日 15:32
  • 849
您举报文章:Clustering With Search Engines(Tara Calishain)