Solr Searching(一)

In this chapter, you are going to learn about:

•     Request handlers
•     Query parameters
•     Solr's "lucene" query syntax

 

The URL is http://192.168.0.248:9080/solr/admin/form.jsp,Click on the Search button, and you'll get output like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">392</int>
  <lst name="params">
    <str name="explainOther"/>
    <str name="fl">*,score</str>
    <str name="indent">on</str>
    <str name="start">0</str>
    <str name="q">*:*</str>
    <str name="hl.fl"/>
    <str name="qt"/>
    <str name="wt"/>
    <str name="fq"/>
    <str name="version">2.2</str>
    <str name="rows">10</str>
  </lst>
</lst>
<result name="response" numFound="399182" start="0" maxScore="1.0">
  <doc>
    <float name="score">1.0</float>
    <arr name="a_member_id">
      <long>123122</long><long>346621</long>
    </arr>
    <arr name="a_member_name">
      <str>Backslash</str><str>Thomas Woznik</str>
    </arr>
    <str name="a_name">Plasticmen</str>
    <str name="a_type">group</str>
    <str name="id">Artist:309843</str>
    <date name="indexedAt">2011-07-15T05:20:06Z</date>
    <str name="type">Artist</str>
  </doc>
  <doc>
    <float name="score">1.0</float>
    <str name="a_name">Gnawa Njoum Experience</str>
    <str name="id">Artist:215841</str>
    <date name="indexedAt">2011-07-15T05:20:06Z</date>
    <str name="type">Artist</str>
  </doc>
  <doc>
    <float name="score">1.0</float>
    <str name="a_name">Ultravoice vs. Rizo</str>
    <str name="a_type">group</str>
    <str name="id">Artist:482488</str>
    <date name="indexedAt">2011-07-15T05:20:06Z</date>
    <str name="type">Artist</str>
  </doc>
<!-- ** 7 other docs omitted for brevity ** -->
</result>
</response>
 

 

Solr's generic XML structured data representation
Solr has its own generic XML representation of typed and named data structures. This XML is used for most of the response XML and it is also used in parts of solconfig.xml too. The XML elements involved in this partial XML schema are:

•     <lst>: A named list. Each of its child nodes should have a name attribute. This generic XML is often stored within an element that is not part of this schema, like <doc>, but is in effect equivalent to lst.
•     <arr>: An array of values. Each of its child nodes is a member of this array.

The following elements represent simple values with the text of the element storing the value. The numeric ranges match that of the Java language. They will have a name attribute if they are underneath lst (or an equivalent element like doc), but  not otherwise.

•     <str>: A string of text
•     <int>: An integer in the range -2^31 to 2^31-1
•     <long>: An integer in the range -2^63 to 2^63-1
•     <float>: A floating point number in the range 1.4e-45 to about 3.4e38
•     <double>: A floating point number in the range 4.9e-324 to about 1.8e308

•     <bool>: A boolean value represented as true or false. When supplying
values in a configuration file: on, off, yes, and no are also supported.

•     <date>: A date in the ISO-8601 format like so: 1965-11-30T05:00:00Z, which is always in the UTC time zone represented by Z. Even if your data isn't actually in this time zone, when working with Solr you might pretend that it is because Solr doesn't support anything else.

 

Solr's XML response format

The <response/> element wraps the entire response.

The first child element is <lst name="responseHeader">, which is intuitively the response header that captures some basic metadata about the response.

•     status: Always 0. If a Solr error occurs, then the HTTP response status code will reflect it and a plain HTML page will display the error.
•     QTime: The number of milliseconds Solr takes to process the entire request on the server. Due to internal caching, you should see this number drop to a couple of milliseconds or so for subsequent requests of the same query. If subsequent identical searches are much faster, yet you see the same QTime, then your web browser (or intermediate HTTP Proxy) cached the response. Solr's HTTP caching configuration is discussed in Chapter 10.

•     Other data may be present depending on query parameters.The main body of the response is the search result listing enclosed by this: <result name="response" numFound="1002272" start="0" maxScore="1.0">, and it contains a <doc> child node for each returned document. Some of the fields are explained below:

•     numFound: The total number of documents matched by the query. This is not impacted by the rows parameter and as such may be larger (but not smaller) than the number of child <doc> elements.
•     start: The same as the start request parameter, described shortly, which is the offset of the returned results into the query's result set.
•     maxScore: Of all documents matched by the query (numFound), this is the highest score. If you didn't explicitly ask for the score in the field list using the fl request parameter, described shortly, then this won't be here. Scoring is described in the next chapter.

The contents of the resultant element are a list of doc elements. Each of these elements represent a document in the index. The child elements of a doc element represent fields in the index and are named correspondingly. The types of these elements use Solr's generic data representation, which was described earlier. They are simple values if they are not multi-valued in the schema. For multi-valued values, the field would be represented by an ordered array of simple values.
There was no data following the results element in our demonstration query. However, there can be, depending on the query parameters enabling features such as faceting and highlighting. When we cover those features, the corresponding XML will be explained.

 

Parsing the URL

The search form is as basic as they come. It submits the form using HTTP GET, essentially resulting in the browser loading a new URL with the form elements becoming part of the URL's query string. Take a good look at the URL in the browser page showing the XML response. Understanding the URL's structure is very important for grasping how searching Solr works:

http://192.168.0.248:9080/solr/select?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&wt=&explainOther=&hl.fl=
 

•     The /solr/ is the web application context where Solr is installed on the Java servlet engine. If you have a dedicated server for Solr, then you might opt to install it at the root. This would make it just /. How to do this is out of scope of this book, but letting it remain at /solr/ is fine.
•     After the web application context is a reference to the Solr core named mbartists. If you are experimenting with Solr's example setup then you won't see a core name because it has a default one. We'll learn more about configuring Solr cores in Chapter 8, Deployment.
•     The /select in combination with the qt= parameter is a reference to the Solr request handler. More on this is covered next.

•     Following the ?, is a set of unordered URL parameters, also known as query parameters in the context of searching. The format of this part of the URL is an & separating set of unordered name=value pairs. As the form doesn't have an option for all query parameters, you will manually modify the URL in your browser to add query parameters as needed.

Text in the URL must be UTF-8 encoded then URL-escaped so that the URL complies with its specification. This concept should be familiar to anyone who has done web development. Depending on the context in which the URL is actually constructed, there are API calls you should use to ensure this escaping happens properly. For example, in JavaScript, you would use encodeURIComponent(). In the URL above, Solr interpreted the %3A as a colon and %2C as a comma. The most common escaped
character in URLs is a space, which is escaped as either + or %20. Fortunately, when experimenting with the URL, browsers are lenient and will permit some characters that should be escaped. For more information on URL encoding see http://en.wikipedia.org/wiki/Percent-encoding.

 

Request handlers

Searching Solr and most other interactions with Solr, including indexing for that matter, is processed by what Solr calls a request handler. Request handlers are configured in the solrconfig.xml file and are clearly labeled as such. Most of them exist for special purposes like handling a CSV import, for example. Our searches in this chapter have been directed to the default request handler because we didn't
specify one in the URL. Here is how the default request handler is configured:

<requestHandler name="standard" class="solr.SearchHandler" 
                default="true">
  <!-- default values for query parameters -->
  <lst name="defaults">
  <str name="echoParams">explicit</str>
  <int name="rows">10</int>
  <str name="fl">*</str>
  <str name="version">2.1</str>
  </lst>
</requestHandler>

 The request handlers that perform searches allow configuration of two things:
•     Establishing default parameters and making some unchangeable
•     Registering Solr search components such as faceting and highlighting

Let's say that in the MusicBrainz search interface we have a search form that searches for bands. We have a Solr core just for artists named mbartists but this contains not only bands but also individual band members. When the field named a_type is "group", we have a band. To start, copy the default configuration, removing the attribute default="true", and give it a name such as bands. We can now use this request handler with qt=bands in the URL as shown below:

/solr/mbartists/select?qt=bands&q=Smashing&.....

/solr/select?qt=bands&q=Smashing&.....

 An alternative to this is to precede the name with /. Now this handler is invoked  
like this:

/solr/bands&q=Smashing&.....

 Let's now configure this request handler to filter searches to find only the bands, without the searching application having to specify this. We'll also set a few other options.

<requestHandler name="bands" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">none</str>
      <int name="rows">20</int>
    </lst>
    <lst name="appends">
      <str name="fq">a_type:group</str>
    </lst>
    <lst name="invariants">
      <str name="facet">false</str>
    </lst>
</requestHandler>

 Request handlers have several lists to configure. These use Solr's generic XML data structure, which was described earlier.

•     defaults: These simply establish default values for various request parameters. Parameters in the request will override them.
•     appends: For parameters that can be set multiple times, like fq, this section specifies values that will be set in addition to any that may be specified by the request.
•     invariants: This sets defaults that cannot be overridden. This is useful for security purposes—a topic for Chapter 8, Deployment.

•     first-components, components, last-components: These list the Solr
search components to be registered for possible use with this request handler. By default, a set of search components are already registered to enable functionality such as querying and faceting. Setting first-components or last-components would prepend or append to this list respectively, whereas setting components would override the list completely. For more information about search components, read Chapter 7, Search Components.


Query parameters

There are a great number of request parameters for configuring Solr searches, especially when considering all of the components like faceting and highlighting. Only the core search parameters not specific to any query parser are listed here. Furthermore, in-depth explanations for some lie further in the chapter.


Search criteria related parameters

•     q: The user query or just "query" for short. This typically originates directly from user input. The query syntax is determined by the defType parameter.
•     defType: A reference to the query parser for the user query in q. The default is lucene with the syntax to be described shortly. You'll most likely use dismax or edismax discussed later in the chapter.

•     fq: A filter query that limits the scope of the user query, similar to a WHERE clause in SQL. Unlike the q parameter, this has no effect on scoring. This parameter can be repeated as desired. Filtering is described later in the chapter.
•     qt: A reference to the query type, more commonly known as the request handler, described earlier. An alternative is to name the request handler with a leading / such as /artists and then use /artists in your URL path instead of /select?....

 

Result pagination related parameters

A query could match any number of the documents in the index, perhaps even all of them, such as in our first example of *:*. Solr doesn't generally return all the documents. Instead, you indicate to Solr with the start and rows parameters to return a contiguous series of them. The start and rows parameters are  explained below:

•     start: (default: 0) This is the zero-based index of the first document to be returned from the result set. In other words, this is the number of documents to skip from the beginning of the search results. If this number exceeds the result count, then it will simply return no documents, but it's not considered
an error.
•     rows: (default: 10) This is the number of documents to be returned in the response XML starting at index start. Fewer rows will be returned if there aren't enough matching documents. This number is basically the number of results displayed at a time on your search user interface.


Output related parameters

The output related parameters are explained below:

•     fl: This is the field list, separated by commas and/or spaces. These fields are to be returned in the response. Use * to refer to all of the fields but not the score. In order to get the score, you must specify the pseudo-field score.
•     sort: A comma-separated field listing to sort on, with a directionality specifier of asc or desc after each field. Example: r_name asc, score desc. The default is score desc. You can also sort by functions, which is a more advanced subject for the next chapter. There is more to sorting than meets the eye; read more about it later in this chapter.
•     wt: The response format, also known as writer type or query response writer, defined in solrconfig.xml. Since the subject of picking a response format has to do with how you will integrate with Solr, further recommendations and details are left to Chapter 9, Integrating Solr. For now, here is the list of options by name: xml (the default and aliased to standard), json, python, php, phps, ruby, javabin, csv, xslt, velocity.

•     version: The requested version of Solr's response structure, if different than the default. This is not particularly useful at the time of writing. However, if Solr's response structure changes, then it will do so under a new version. By using this in the request, a best-practice for your automated searches, you reduce the chances of your client code breaking if Solr is updated.

 

Diagnostic related parameters

These diagnostic parameters are helpful during development with Solr. Obviously,you'll want to be sure NOT to use these, particularly debugQuery, in a production setting because of performance concerns.

•     indent: A boolean option that will indent the output to make it easier to read. It works for most of the response formats.

•     debugQuery: If true, then following the search results is <lst name="debug"> with diagnostic information. It contains voluminous information about the parsed query string, how the scores were computed, and millisecond timings for all of the Solr components to perform their part of the processing such as faceting. You may need to use the View Source function of your browser to preserve the formatting used in the score computation section. Debugging queries and enhancing relevancy is documented further in the next chapter.
     ° explainOther: If you want to determine why a particular document wasn't matched by the query or why it wasn't scored high enough, then you can put a query for this value, such as  id:"Release:12345", and debugQuery's output will be sure to include the first document matching this query in its output.
•     echoHandler: If true, then this emits the Java class name identifying the Solr request handler.

•     echoParams: Controls if any query parameters are returned in the response header, as seen verbatim earlier. This is for debugging URL encoding issues or for verifying the complete set of parameters in effect, taking into consideration those defined in the request handler. Specifying none disables this, which is appropriate for production real-world use. The standard request handler is configured for this to be explicit by default, which means to list those parameters explicitly mentioned in the URL. Finally, you can use all to include those parameters configured in the request handler in addition to those in the URL.

Finally, there is another parameter not easily categorized above called timeAllowed in which you specify a time limit in milliseconds for a query to take before it is interrupted and intermediate results are returned. Long-running queries should be very rare and this allows you to cap them so that they don't over-burden your production server.

 

Query parsers and local-params

A query parser parses a string into an internal Lucene query object, potentially considering request parameters and so-called local-params—parameters local to the query string. Only a few parsers actually do real parsing and some parsers like those for geospatial don't even use the query string. The default query parser throughout Solr is named lucene and it has a special leading syntax to switch the parser to another and/or to specify some parameters. Here's an example choosing the dismax parser along with two local-params and a query string of "billy corgan":

{!dismax qf="a_name^2 a_alias" tie=0.1}billy corgan

There are a few things to know about the local-params syntax:
•     The leading query parser name (for example, dismax) is optional. Without it, the parser remains as lucene. Furthermore, this syntax is a shortcut for putting the query parser name in the type local-param.

•     Usually, a query parser treats local-params as an override to request parameters in the URL.
•     A parameter value can refer to a request parameter via a leading $, for example v=$qq.
•     The special parameter v can be used to hold the query string as an alternative to it following the }. Some advanced scenarios require this approach.
•     A parameter value doesn't have to be quoted if there are no spaces. There wasn't any for the tie parameter in the example above.

 

For an interesting example, see the sub-query syntax later.
Solr includes quite a few different query parsers. In the next section you'll learn all about lucene. For processing user queries, you should typically use dismax or edismax (short for extended-dismax), which are described afterwards. The other query parsers are for special things like geospatial search, also described at the end of this chapter. This book only explores the most useful parsers; for further
information, see: http://wiki.apache.org/solr/SolrQuerySyntax.

 

Query syntax (the lucene query parser)

Solr's native / full query syntax is implemented by the query parser named lucene. It is based on Lucene's old syntax with a few additions that will be pointed out explicitly. In fact, you've already seen the first addition which is local-params.

The lucene query parser does have a couple query parameters that can be set. Usually these aren't specified as Lucene is rarely used for the user query and because Lucene's query syntax is easily made explicit to not need these options.

•     q.op: The default query operator, either AND or OR to signify if all of the search terms or just one of the search terms need to match. If this isn't present, then the default is specified in schema.xml near the bottom in the defaultOperator attribute. If that isn't specified, then the default is OR.
•     df: The default field that will be searched by the user query. If this isn't specified, then the default is specified in schema.xml near the bottom in the <defaultSearchField> element. If that isn't specified, then a query that does not explicitly specify a field to search will be an error.

 

In the following examples:
•     If you are using the example data with the book, you could use the search form here: http://localhost:8983/solr/mbartists/admin/. No changes are needed.
•     q.op is set to OR (which is the default choice, if it isn't specified anywhere).
•     The default search field was set to a_name in the schema. Had it been something else, we'd use the df parameter.
•     You may find it easier to scan the resulting XML if you set fl (the field list) to a_name, score.

 

Use debugQuery=on
To see a normalized string representation of the parsed query tree,
enable query debugging. Then look for parsedquery in the debug
output. See how it changes depending on the query.

 

A final point to be made is that there are query capabilities within Lucene that are not exposed in query parsers that come with Solr. Notably, there is a family of so-called "span queries" which allow for some advanced phrase queries that can be composed together by their relative position. To learn more about that, I advise reading the Lucene In Action book. There is also the ability to search for a term matching a regular expression.

 

Matching all the documents

Lucene doesn't natively have a query syntax to match all documents. Solr enhanced Lucene's query syntax to support it with the following syntax:
*:*
When using dismax, it's common to set q.alt to this match-everything query so that a blank query returns all results.

 

Mandatory, prohibited, and optional clauses

Lucene has a unique way of combining multiple clauses in a query string. It is tempting to think of this as a mundane detail common to boolean operations in programming languages, but Lucene doesn't quite work that way.
A query expression is decomposed into a set of unordered clauses of three types:

•     A clause can be mandatory: +Smashing

This matches only artists containing the word Smashing.
•     A clause can be prohibited: -Smashing
This matches all artists except those with Smashing. You can also use an
exclamation mark as in !Smashing but that's rarely used.
•     A clause can be optional: Smashing

 

The term "optional" deserves further explanation. If the query expression contains at least one mandatory clause, then any optional clause is just that—optional. This notion may seem pointless, but it serves a useful function in scoring documents that match more of them higher. If the query expression does not contain any mandatory clauses, then at least one of the optional clauses must match. The next two examples illustrate optional clauses.
Here, Pumpkins is optional, and my favorite band will surely be at the top of the list, ahead of bands with names like Smashing Atoms:

+Smashing Pumpkins

 

In this example there are no mandatory clauses and so documents with Smashing or Pumpkins are matched, but not Atoms. My favorite band is at the top because it matched both, followed by other bands containing only one of those words:

Smashing Pumpkins –Atoms

 

If you would like to specify that a certain number or percentage of optional clauses should match or should not match, then you can instead use the dismax query parser with the min-should-match feature, described later in the chapter.

 

Boolean operators

The boolean operators AND, OR, and NOT can be used as an alternative syntax to arrive at the same set of mandatory, optional, and prohibited clauses that were mentioned previously. Use the debugQuery feature, and observe that the parsedquery string normalizes-away this syntax into the previous (clauses being optional by default such as OR).

 

When the AND or && operator is used between clauses, then both the left and right sides of the operand become mandatory if not already marked as prohibited. So:
Smashing AND Pumpkins
is equivalent to:
+Smashing +Pumpkins


Similarly, if the OR or || operator is used between clauses, then both the left and right sides of the operand become optional, unless they are marked mandatory or prohibited. If the default operator is already OR then this syntax is redundant. If the default operator is AND, then this is the only way to mark a clause as optional.

 

To match artist names that contain Smashing or Pumpkins try:
Smashing || Pumpkins


The NOT operator is equivalent to the - syntax. So to find artists with Smashing but not Atoms in the name, you can do this:
Smashing NOT Atoms

 

We didn't need to specify a + on Smashing. This is because it is the only optional clause and there are no explicit mandatory clauses. Likewise, using an AND or OR would have no effect in this example.

 

It may be tempting to try to combine AND with OR such as:
Smashing AND Pumpkins OR Green AND Day


However, this doesn't work as you might expect. Remember that AND is equivalent to both sides of the operand being mandatory, and thus each of the four clauses becomes mandatory. Our data set returned no results for this query. In order to combine query clauses in some ways, you will need to use sub-queries.


Sub-queries

You can use parenthesis to compose a query of smaller queries, referred to as sub-queries or nested queries. The following example satisfies the intent of the  previous example:
(Smashing AND Pumpkins) OR (Green AND Day)


Using what we know previously, this could also be written as:
(+Smashing +Pumpkins) (+Green +Day)


But this is not the same as:
+(Smashing Pumpkins) +(Green Day)

 

The preceding sub-query is interpreted as documents that must have a name with either Smashing or Pumpkins and either Green or Day in its name. So if there was a band named Green Pumpkins, then it would match.


Solr added another syntax for sub-queries to Lucene's old syntax that allows the sub-query to use a different query parser including local-params. This is an advanced technique so don't worry if you don't understand it at first. The syntax is a bit of a hack using a magic field named _query_ with its value being the sub-query, which practically speaking, needs to be quoted. As an example, suppose you have a search interface with multiple query boxes, whereas each box is for searching a different field. You could compose the query string yourself but you would have some query escaping issues to deal with. And if you wanted to take advantage of the dismax parser then with what you know so far, that isn't possible. Here's an approach using this new syntax:

+_query_:"{!dismax qf=a_name v=$q.a_name}" +_query_:"{!dismax qf=a_
alias v=$q.a_alias}"

This example assumes that request parameters of q.a_name and q.a_alias are supplied for the user input for these fields in the schema. Recall from the local-params definition that the parameter v can hold the query and that the $ refers to another named request parameter.

 

Limitations of prohibited clauses in sub-queries

Lucene doesn't actually support a pure negative query, for example:
-Smashing -Pumpkins


Solr enhances Lucene to support this, but only at the top level query such as in thepreceding example. Consider the following admittedly strange query:
Smashing (-Pumpkins)

 

This query attempts to ask the question: Which artist names contain either Smashing or do not contain Pumpkins? However, it doesn't work and only matches the first clause—(4 documents). The second clause should essentially match most documents resulting in a total for the query that is nearly every document. The artist named Wild Pumpkins at Midnight is the only one in my index that does not contain Smashing but does contain Pumpkins, and so this query should match every document except that one. To make this work, you have to take the sub-expression containing only negative clauses, and add the all-documents query clause: *:*, as shown below:
Smashing (-Pumpkins *:*)


Interestingly, this limitation is fixed in the edismax query parser. Hopefully a future version of Solr will fix it universally, thereby making this work-around unnecessary.

 

Field qualifier

To have a clause explicitly search a particular field, you need to precede the relevant clause with the field's name, and then add a colon. Spaces may be used in-between, but that is generally not done.
a_member_name:Corgan


This matches bands containing a member with the name Corgan. To match, Billy and Corgan:
+a_member_name:Billy +a_member_name:Corgan

 

Or use this shortcut to match multiple words:
a_member_name:(+Billy +Corgan)


The content of the parenthesis is a sub-query, but with the default field being overridden to be a_member_name, instead of what the default field would be otherwise. By the way, we could have used AND instead of + of course. Moreover, in these examples, all of the searches were targeting the same field, but you can certainly match any combination of fields needed.


Phrase queries and term proximity

A clause may be a phrase query: a contiguous series of words to be matched in that order. In the previous examples, we've searched for text containing multiple words like Billy and Corgan, but let's say we wanted to match Billy Corgan (that is, the two words adjacent to each other in that order). This further constrains the query. Double quotes are used to indicate a phrase query, as shown in the following code:
"Billy Corgan"


Related to phrase queries is the notion of the term proximity, aka the slop factor or a near query. In our previous example, if we wanted to permit these words to be separated by no more than say three words in–between, then we could do this:
"Billy Corgan"~3

 

For the MusicBrainz data set, this is probably of little use. For larger text fields, this can be useful in improving search relevance. The dismax query parser, which is described in the next chapter, can automatically turn a user's query into a phrase query with a configured slop.

 

Wildcard queries

A plain keyword search will look in the index for an exact match, subsequent to text analysis processing on both the query and input document text (for example, tokenization, lowercasing). But sometimes you need to express a query for a partial match expressed using wildcards.

 

There are a few points to understand about wildcard queries:

•     No text analysis is performed on the search word containing the wildcard, not even lowercasing. So if you want to find a word starting with Sma, then sma* is required instead of Sma*, assuming the index side of the field's type includes lowercasing. This shortcoming is tracked on SOLR-219. Moreover,
if the field that you want to use the wildcard query on is stemmed in the analysis, then smashing* would not find the original text Smashing because the stemming process transforms this to smash. Consequently, don't stem.
•     Wildcard queries are one of the slowest types you can run. Use of ReversedWildcardFilterFactory helps with this a lot. But if you have an asterisk wildcard on both ends of the word, then this is the worst-case scenario.
•     Leading wildcards will result in an error in Solr unless ReversedWildcardFilterFactory is used.

 

To find artists containing words starting with Smash, you can do:
smash*


Or perhaps those starting with sma and ending with ing:
sma*ing


The asterisk matches any number of characters (perhaps none). You can also use ? to force a match of any character at that position:
sma??*

 

That would match words that start with sma and that have at least two more characters but potentially more.


As far as scoring is concerned, each matching term gets the same score regardless of how close it is to the query pattern. Lucene can support a variable score at the expense of performance but you would need to do some hacking to get Solr to do that.


Fuzzy queries

Fuzzy queries are useful when your search term needn't be an exact match, but the closer the better. The fewer the number of character insertions, deletions, or exchanges relative to the search term length, the better the score. The algorithm used is known as the Levenshtein Distance algorithm, also known as the edit distance. Fuzzy queries have the same need to lowercase and to avoid stemming just as wildcard queries do. For example:
Smashing~


Notice the tilde character at the end. Without this notation, simply Smashing would match only four documents because only that many artist names contain that word. Smashing~ matched 578 words and it took my computer 359 milliseconds. You can modify the proximity threshold, which is a number between 0 and 1, defaulting to 0.5. For instance, changing the proximity to a more stringent 0.7:
Smashing~0.7

 

25 matched documents resulted and it took 174 milliseconds. If you want to use fuzzy queries, then you should consider experimenting with different thresholds.To illustrate how text analysis can still pose a problem, consider the search for:
SMASH~


There is an artist named S.M.A.S.H., and our analysis configuration emits smash as a term. So SMASH would be a perfect match, but adding the tilde results in a search term in which every character is different due to the case difference and so this search returns nothing. As with wildcard searches, if you intend on using fuzzy searches then you should lowercase the query string.

 

Range queries

Lucene lets you query for numeric, date, and even text ranges. The following query matches all of the bands formed in the 1990s:
a_type:2 AND a_begin_date:[1990-01-01T00:00:00.000Z TO 1999-12-31T24:59:99.999Z]


Observe that the date format is the full ISO-8601 date-time in UTC, which Solr mandates (the same format used by Solr to index dates and that which is emitted in search results). The .999 milliseconds part is optional. The [ and ] brackets signify an inclusive range, and therefore it includes the dates on both ends. To specify an exclusive range, use { and }.  In Solr 3, both sides must be inclusive or both exclusive; Solr 4 allows both. The workaround in Solr 3 is to introduce an extra clause to include or exclude a side of the range. There is an example of this below.

 

For most numbers in the MusicBrainz schema, we only have identifiers, and so it made sense to use the plain long field type, but there are some other fields. For the track duration in the tracks data, we could do a query such as this to find all of the tracks that are longer than 5 minutes (300 seconds, 300,000 milliseconds):
t_duration:[300000 TO *]

 

In this example, we can see Solr's support for open-ended range queries by using *.his feature is not available in Lucene.


Although uncommon, you can also use range queries with text fields. For this to have any use, the field should have only one term indexed. You can control this either by using the string field type, or by using the KeywordTokenizer. You may want to do some experimentation. The following example finds all documents where somefield has a term starting with B. We effectively make the right side of the range exclusive by excluding it with another query clause.
somefield:([B TO C] -C)

 

Both sides of the range, B and C, are not processed with text analysis that could exist in the field type definition. If there is any text analysis like lowercasing, you will need to do the same to the query or you will get no results.

 

Date math

Solr extended Lucene's old query parser to add date literals as well as some simple math that is especially useful in specifying date ranges. In addition, there is a way to specify the current date-time using NOW. The syntax offers addition, subtraction, and rounding at various levels of date granularity, like years, seconds, and so on down to milliseconds. The operations can be chained together as needed, in which case they are executed from left to right. Spaces aren't allowed. For example:
r_event_date:[* TO NOW-2YEAR]

 

In the preceding example, we searched for documents where an album was released over two years ago. NOW has millisecond precision. Let's say what we really wanted was precision to the day. By using / we can round down (it never rounds up):
r_event_date:[* TO NOW/DAY-2YEAR]


The units to choose from are: YEAR, MONTH, DAY, DATE (synonymous with DAY), HOUR, MINUTE, SECOND, MILLISECOND, and MILLI (synonymous with MILLISECOND). Furthermore, they can be pluralized by adding an S as in YEARS.

 

Score boosting

You can easily modify the degree to which a clause in the query string contributes to the ultimate relevancy score by adding a multiplier. This is called boosting. A value between 0 and 1 reduces the score, and numbers greater than 1 increase it. You'll learn more about scoring in the next chapter. In the following example, we search for artists that either have a member named Billy, or have a name containing the  word Smashing:
a_member_name:Billy^2 OR Smashing


Here we search for an artist name containing Billy, and optionally Bob or Corgan, but we're less interested in those that are also named Corgan:
+Billy Bob Corgan^0.7

 

Existence (and non-existence) queries

This is actually not a new syntax case, but an application of range queries. Suppose you wanted to match all of the documents that have an indexed value in a field. Here we find all of the documents that have something in a_name:
a_name:[* TO *]


As a_name is the default field, just [* TO *] will do.This can be negated to find documents that do not have a value for a_name, as shown in the following code:
-a_name:[* TO *]

 

Like wildcard and fuzzy queries, these are expensive, slowing down as the number of distinct terms in the field increases.

 

Escaping special characters

The following characters are used by the query syntax, as described in this chapter:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \


In order to use any of these without their syntactical meaning, you need to escape them by a preceding \ such as seen here:
id:Artist\:11650

 

This also applies to the field name part. In some cases such as this one where the character is part of the text that is indexed, the double-quotes phrase query will also work, even though there is only one term:
id:"Artist:11650"

 

A common situation in which a query needs to be generated, and thus escaped properly, is when generating a simple filter query in response to choosing a field-value facet when faceting. This syntax and suggested situation is getting ahead of us but I'll show it anyway since it relates to escaping. The query uses the term query parser as follows: {!term f=a_type}group. What follows } is not escaped at all, even a \ is interpreted literally, and so with this trick you needn't worry about escaping rules at all.

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值