http://www.searchforum.org.cn/seminar/lectures/2006-9-25-JirongWen-Search%20Engine%20Overview.PDF
#1: Spamming and Content Quality
•Click =>Money, Spam=>Click ==> Spam->Money
•An endless game between spammers and search engines
•How to determine the quality of web content?
̵Traditional IR: every document is authoritative and accurate.
•Prove either of the following propositions:
̵There are spam-immune ranking algorithms
̵There is NO a spam-immune ranking algorithm
#2: Data Acquisition
•Growing speed of the Web >> Growth of indexing capability of search engines.
•Re-crawl frequently updated pages: news, blog, bbs
•Dynamic contents: deep Web, Web 2.0
•Crawling is the first step of search, but its importance is largely ignored by academia.
#3: Infrastructure
The Cycle of Web Innovation
Ideas->Prototypes->Products->Testing->Deployment
̵Continuous Innovation, tuning, and hacking
•RTM RTW
•Always Beta!
̵Quick Prototyping and deployment
•Very Difficult to Do Web-scale Innovations
̵Long innovation cycle
•How difficult to test a new algorithm in 5B pages?
•How difficult to calculate the query frequencies in 100T search logs?
#4: Ranking
•Essence of ranking
•How to combine innumerous evidences to do a good ranking
#5: Evaluation*
•Traditional IR evaluation
̵Limited binary judgment
̵Static collection of documents (few million)
̵A small set of queries (around 50-100)
̵Use pooling
•Pool top 1000 results from various techniques
•Assume all possible relevant documents judged
•Biased against revolutionary new methods
̵Judge new documents if needed
•On the Web
̵Collection is dynamic
•10-20% urls change every month
•Spam methods are dynamic
•Need to keep the collection recent
̵Queries are also time sensitive
•Topics are hot then not
•Need to keep a representative sample
̵Result quality is important
•Multiple level judgment
•Clicks as implicit judgment?
#6: Query Formulation
•Query = information need?
•How do you compose your queries?
̵Guess if the terms occur in the wanted pages
̵Relevant to terms, instead of relevant to query
#7: Personalization
•Personalized search, a long history, but never a success story
•Is personalized search really useful?
̵There is NOT a widely-used personalized search engine
•How does personalized search work in a real large-scale search engine?
̵User study: in a closed environment
•Does one size fit all?
̵It is unclear whether personalization is consistently effective on different queries, for different users, and under different search contexts
#8: Structure in the Web
•Are Web data really unstructured?
•More structure = better search
#9: Go Beyond Page-level Search*
•Is page the only and best atomic information unit?
̵A tradition from IR, but not necessary for Web
•What we did and am doing
̵Block-based search
̵Deep Web search
̵Object-level search
#10: The Next Big Thing?