Lucene Revolution 2012

21 篇文章 0 订阅

I attended Lucene Revolution 2012 in Boston, MA on May 9-10, 2012. This was my third one, having attended the first one in 2010 (also in Boston) and then Barcelona in 2011. Lucid Imagination, the main conference sponsor and organizer, links to all the conferences, where you can download the slides and see videos of the presentations.

For me, this years big take-aways were:

  • Lucene 4.0
  • SolrCloud / Distributed Solr
  • Finite State Automata

Also, near real-time search remains a hot topic, but it seems that there is enough support for it in Solr/Lucene now that it’s less of a “omg, what are we gonna do” and more of a known problem with reasonable solutions for many use-cases.

Lucene 4.0

Although Lucene 4.0 wasn’t the central focus of many talks, it was the foundation for most of the interesting features and developments being presented at the conference. Lucene 4.0 is the culmination of years of work by the Lucene/Solr developers to re-structure and re-factor the deep plumbing of Lucene. Lucene, up through the 3.x line, has had a very tight coupling between the binary on-disk layout of index data and the Java APIs and code. This inhibited a lot of innovation and research in advanced search and information retrieval using Lucene. If you poke around on the interwebs, you’ll also find Lucene 4.0 being discussed in terms of “flexible indexing” and “heavy committing”; the latter phrase regarding the implementation of this low-level re-structuring.

Lucene 4.0 isn’t officially released just yet. The Apache Lucene/Solr team is planning a more measured alpha and beta period before making 4.0 official. However, many at the conference were quick to point out that many live, production services on some of the biggest websites were already running on the 4.0-trunk branch. The Lucene developers said that the main risk of using 4.0-trunk is that there might be a last-minute tweak to the index format, which would mean you’d have to re-build your indexes if/when you migrate to the 4.0 release.

SolrCloud / Distributed Solr

Winning the award for the most misleading name is “SolrCloud”. It’s not entirely wrong, but it does suggest things like “running Solr in AWS” or some other cloud provider. It’s really about distributed search: multiple Solr servers coordinating together to load-balance indexing and queries, as well as replicating each other’s data for automagic, live fail-over. Pretty awesome stuff.

Over the past 2-3 years, Solr hasn’t been keeping up with the industry’s needs when it comes to distributed search. Sure, you could implement distributed search with Solr, but you’d have to do all the nasty and difficult parts yourself. In fact, the first of the two talks on SolrCloud had a couple slides focused on the shortcomings with Solr 3.x when it comes to distributed search, and the pain and agony people suffered implementing a distributed search system with Solr 3.x; followed by explanation of how SolrCloud solves those problems. The next day, Mark Miller talk gave a greatly detailed exposition of the SolrCloud architecture. Unfortunately, the slides are not posted yet.

But, a great introduction to the challenges in distributed search are discussed by Shay Banon, the primary developer of Elastic Search, in his presentation at Berlin Buzzwords: Elastic Search – A Distributed Search Engine. The Elastic Search project — based on Lucene — competes with Solr as the web-application layer on top of Lucene, and IMO it’s been kicking Solr’s ass for distributed search. If you watch Shay’s presentation, then Mark’s (when it’s posted), you’ll see that the Solr is stealing all the good ideas from Elastic Search (as it should work with open source).

For me, and the full-text search work I do at Internet Archive, the SolrCloud project is very relevant to my interests. The size of our web archives is such that distributed search is an absolute necessity. If we were able to buy one giant machine to power a single Solr server, it would be prohibitively expensive. And, we would need two of them to provide enough availability for a production system. Distributed search across many “cheap” machines is definitely the way to go for us.

Chaos Monkey

Waiting for our flights home after the conference, Eric Pugh and I were discussing how we were both impressed that Mark Miller’s talk on SolrCloud focused as much on the testing of it as the development. Kudos to Mark and the other developers to focus so much time and effort on testing and verification.

They stole a great idea from Netflix and implemented their own “chaos monkey” for testing distributed search. When testing distributed search, they would use a stand-alone Solr server as the oracle and send the same documents to both the oracle and the distributed system for indexing. While the corpus was being indexed, their “chaos monkey” would randomly kill servers in the distributed system, and inject other chaos by simulating typical failure scenarios. However, the distributed system should still have the correct set of documents, even in the face of the chaos. At the end, they would check the contents of the index in the distributed system and compare it to the stand-alone Solr oracle. The rigor and effort put into testing gave me a nice, warm fuzzy.

Also, in collaboration with Dawid Weiss of Carrot2 fame, their chaos monkey code is now packaged as a stand-alone test framework for anyone to use: RandomizedTesting: Randomized testing infrastructure (and more!) for JUnit, ANT and Maven.

Finite State Automata

Before reading further, go watch Dawid Weiss (him again) give this great introduction on finite state automata and how they are being used in Lucene: Finite State Automata in Lucene, Theory and Applications.

Ok, for those who didn’t watch the video and kept reading on….Finite State Automata (FSA) — often also called Finite State Machines (FSM) — are state machines that accept or reject input sequences. If you’ve ever used a regular expression, you’ve used a FSA since all regex libraries are implemented using FSAs. FSAs are really great because they are compact and therefore fit into memory and are wicked fast. Another cool feature of FSAs is that they can be weighted, so that as you traverse the state machine, you can use the weights to choose different transitions. For example, with a weighted FSA, you can add rankings to your completions. Now, go back and watch that video.

One of the main uses of FSAs in Lucene was to replace the term dictionary. Previously, the term dictionary — the list of words in the index — was stored in a more-or-less compressed skip-list. It worked for Lucene for a long time but had a lot of limitations, especially when it came to fuzzy matching and search completion. FSAs can keep even large term dictionaries in memory, are great at fuzzy matching, and can do work-completion in nanoseconds. Of course, none of this was even possible in Lucene without the re-structuring efforts of Lucene 4.0.

The short story is that FSAs are going to make many features of Lucene a whole lot faster and easier.

The Largest FSA EVAR?

When I first read about the FSA work for Lucene (over a year ago) I talked with Brad about giving it a whirl for the Wayback machine’s URL index. We didn’t have time to do anything at that point, but now that I was about to attend the Lucene conference and hear more about it, I wanted to give it a test drive.

A couple weeks before the conference, I exchanged a few emails with Mike McCandless about building an FSA for the Wayback Machine’s URL index. He thought it was worth a try and if successful would definitely be the larges FSA he’d ever seen. So, I took a sample of URLs from the global Wayback Machine CDX file and started shoving them into the FSA command-line test driver…

I blew it up.

The FSA implementation in Lucene is optimized for memory and speed, and the key data structure is a Java byte[] — which due to the limitations built into the Java language specification, is limited to 2^31-1 array elements. I got to about 80-90 million URLs when the array filled up. Mike wasn’t surprised by this, and during the conference he and Robert Muir and I discussed it a bit. In the short term, that limitation will remain. There is some patch floating around that claims to allow for more states, but it’s not maintained and probably already out-of-date. Robert and Mike both suggested that forthcoming features allowing FSAs to be composed into larger (virtual) FSAs would likely solve the problem. Even so, the FSA will likely be too large for practical use. Sure, they are space-efficient, but 90 million URLs produced a 1.5GB FSA, so 170 billion URLs extrapolates to ~2.8TB of FSAs.

A more realistic scenario would be to build an FSA for just hostnames. Then we could investigate doing as-you-type-completion with hostnames. Or something like that.

Near Real-time Search

Last year, it seemed that everyone was obsessed with (near) real-time search. In fact, probably the “can’t miss” talk of Apache Lucene EuroCon in Barcelona last fall was Michael Busch’s keynote on Realtime Search at Twitter.

Since then, I get the feeling that although many are still very interested and need “realtime” search, there is enough support for it in Lucene and Solr to satisfy a lot of use-cases.

Other Talks of Interest

Some other talks are worth mentioning:

  • Building Query Auto-Completion Systems with Lucene 4.0by Sudarshan Gaikaiwari, Software Engineer, Yelp.

    Rather than build the recommendation engine in Lucene/Solr, he took the weighted FSA library and built a system incorporating names of businesses, geohashes (to find things close to you) and weights to rank the completions and recommendations based on reviews/scores. He used to quickly build a prototype system.

  • Updateable Fields in Lucene and other Codec Applicationsby Andrzej Bialeki, Lucid Imagination.

    Long considered the holy grail of Lucene, Andrzej sketches out an approach taking advantage of the re-structuring introduced by the Lucene 4.0 work. It’s not done by any means, but it now looks entirely reasonable and within reach. Maybe Lucene 4.1?

  • Grouping and Joining in Lucene/Solr by Martijn van Groningen, SearchWorkings.

    Martijn gives a thorough walk-through of grouping and joining. Grouping is an absolute necessity for us, allowing us to give only the top 1-2 hits from each archived website. There wasn’t anything new or groundbreaking for me in this presentation, but I liked that the feature is getting more use and attention.

  • How to Access Your Library Book Collections Using Solrby Engy Ali, Software Project Manager, The Library of Alexandria.

    I couldn’t miss her talk because, well, it’s about libraries and digitized books, plus I had just met with colleagues from Bibliotheca Alexandrina at the IIPC GA in Washington D.C. the week before.

Wrap-up

All-in-all, it was a great conference. The SolrCloud still will (I hope) really make a difference for our full-text search systems. And the FSA stuff is just super-cool. Counting-down to the next one.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值