•Top 10 Challenges in Search Engine

http://www.searchforum.org.cn/seminar/lectures/2006-9-25-JirongWen-Search%20Engine%20Overview.PDF

 

#1: Spamming and Content Quality

•Click =>Money, Spam=>Click ==> Spam->Money
•An endless game between spammers and search engines
•How to determine the quality of web content?
̵Traditional IR: every document is authoritative and accurate.

•Prove either of the following propositions:
̵There are spam-immune ranking algorithms
̵There is NO a spam-immune ranking algorithm

#2: Data Acquisition
•Growing speed of the Web >> Growth of indexing capability of search engines.
•Re-crawl frequently updated pages: news, blog, bbs
•Dynamic contents: deep Web, Web 2.0
•Crawling is the first step of search, but its importance is largely ignored by academia.

 

#3: Infrastructure

The Cycle of Web Innovation

Ideas->Prototypes->Products->Testing->Deployment

̵Continuous Innovation, tuning, and hacking
•RTM RTW
•Always Beta!
̵Quick Prototyping and deployment

•Very Difficult to Do Web-scale Innovations
̵Long innovation cycle
•How difficult to test a new algorithm in 5B pages?
•How difficult to calculate the query frequencies in 100T search logs?

 

#4: Ranking
•Essence of ranking
•How to combine innumerous evidences to do a good ranking

#5: Evaluation*
•Traditional IR evaluation
̵Limited binary judgment
̵Static collection of documents (few million)
̵A small set of queries (around 50-100)
̵Use pooling
•Pool top 1000 results from various techniques
•Assume all possible relevant documents judged
•Biased against revolutionary new methods
̵Judge new documents if needed
•On the Web
̵Collection is dynamic
•10-20% urls change every month
•Spam methods are dynamic
•Need to keep the collection recent
̵Queries are also time sensitive
•Topics are hot then not
•Need to keep a representative sample
̵Result quality is important
•Multiple level judgment
•Clicks as implicit judgment?

#6: Query Formulation
•Query = information need?
•How do you compose your queries?
̵Guess if the terms occur in the wanted pages
̵Relevant to terms, instead of relevant to query

#7: Personalization
•Personalized search, a long history, but never a success story
•Is personalized search really useful?
̵There is NOT a widely-used personalized search engine
•How does personalized search work in a real large-scale search engine?
̵User study: in a closed environment
•Does one size fit all?
̵It is unclear whether personalization is consistently effective on different queries, for different users, and under different search contexts

#8: Structure in the Web
•Are Web data really unstructured?
•More structure = better search

#9: Go Beyond Page-level Search*
•Is page the only and best atomic information unit?
̵A tradition from IR, but not necessary for Web
•What we did and am doing
̵Block-based search
̵Deep Web search
̵Object-level search

#10: The Next Big Thing?

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值