【总结】搜索服务Solr

最新推荐文章于 2018-07-16 14:33:00 发布

iteye_2022

最新推荐文章于 2018-07-16 14:33:00 发布

阅读量219

点赞数

文章标签：大数据 java json

1, Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via JSON, XML, CSV or binary over HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary results.

2, Solr Administration User Interface

Logging
Cloud Screens
Core Admin
Java Properties
Thread Dump
Core-Specific Tools
Analysis Screen
Dataimport screen
Documents Screen
Files Screen
Ping
Plugin & Stats Screen
Query Screen
Replication Screen
Schema Browser Screen
Segments Info

3, Documents, Fields, Schema Design

Field Properties
indexed
stored
docValues
sortMissingFirst / sortMissingLast
multiValued
omitNorms
omitTermFreqAndPositions
omitPositions
termVectors / termPositions / termOffsets / termPayloads
required

Field Types BinaryField BoolField CollationField CurrencyField DataRangeField ExternalFileField EnumField LatLonType PointType TextField StrField TrieField TrieInt/Long/FloatField UUIDField

4, Analyzers, Tokenizers and Filters

Analyzers
An analyzer examines the text of fields and generates a token stream
Analyzers are specified as a child of the <fieldType> element in the schema.xml configuration file
Analyzers
- WhitespaceAnalyzer
- SimpleAnalyzer
- StopAnalyzer
- StandardAnalyzer

Tokenizers The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a sub-sequence of the characters in the text An analyzer is aware of the field it is configured for, but a tokenizer is not Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a TokenStream) You configure the tokenizer for a text field type in schema.xml with a <tokenizer> element, as a child of <analyzer> Tokenizers

WhitespaceTokenizer
KeywordTokenizer
LetterTokenizer
StandardTokenizer

Filters Like tokenizers, filters consume input and produce a stream of tokens Filters also derive from org.apache.lucene.analysis.TokenStream Unlike tokenizers, a filter's input is another TokenStream. The job of a filter is usually easier than that of a tokenizer since in most cases a filter looks at each token in the stream sequentially and decides whether to pass it along, replace it or discard it A filter may also do more complex analysis by looking ahead to consider multiple tokens at once, although this is less common One hypothetical use for such a filter might be to normalize state names that would be tokenized as two words. For example, the single token "california" would be replaced with "CA", while the token pair "rhode" followed by "island" would become the single token "RI" Filters

LowerCaseFilter
StopFilter
PorterStemFilter
ASCIIFoldingFilter
StandardFilter

5, Indexing

The three most common ways of loading data into a Solr index
Using the Solr Cell framework built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats
Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated
Writing a custom Java application to ingest data through Solr's Java Client API

Uploading Data with Index Handlers Index Handlers are Request Handlers designed to add, delete and update documents to the index In addition to having plugins for importing rich documents using Tika or from structured data sources using the Data Import Handler Solr natively supports indexing structured documents in XML, CSV and JSON Uploading Data with Solr Cell using Apache Tika Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself Working with this framework, Solr's ExtractingRequestHandler can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework's name: Solr Cell Uploading Structured Data Store Data with the Data Import Handler The Data Import Handler (DIH) provides a mechanism for importing content from a data store and indexing it In addition to relational databases, DIH can index content from HTTP based data sources such as RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate fields Detecting Languages During Indexing Solr can identify languages and map text to language-specific fields during indexing using the langid UpdateRequestProcessor. Solr supports two implementations of this feature

Tika's language detection feature
LangDetect language detection

UIMA Integration

UIMA(the Apache Unstructured Information Management Architecture) lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations

6, Searching

The search query is processed by a request handler
Solr supports a variety of request handlers. Some are designed for processing search queries, while others manage tasks such as index replication
To process a search query, a request handler calls a query parser, which interprets the terms and parameters of a query
Input to a query parser can include
- search strings---that is, terms to search for in the index
- parameters for fine-tuning the query by increasing the importance of particular strings or fields, by applying Boolean logic among the search terms, or by excluding content from the search results
- parameters for controlling the presentation of the query response, such as specifying the order in which results are to be presented or limiting the response to particular fields of the search application's schema
Search parameters may also specify a query filter

Query Syntax and Parsing The Standard Query Parser

Solr's default Query Parser is also known as the "lucene" parser
The key advantage of the standard query parser is that it supports a robust and fairly intuitive syntax allowing you to create a variety of structured queries
The largest disadvantage is that it's very intolerant of syntax errors, as compared with something like the DisMax query parser which is designed to throw as few errors as possible

The DisMax Query Parser

The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field
Additional options enable users to influence the score based on rules specific to each use case independent of user input)

The Extended DisMax Query Parser

The Extended DisMax (eDisMax) query parser is an improved version of the DisMax query parser. In addition to supporting all the DisMax query parser parameters

Other Parsers

Block Join Query Parsers
Boost Query Parser
Collapsing Query Parser
Complex Phrase Query Parser
Field Query Parser
Function Query Parser
Function Range Query Parser
Join Query Parser
Lucene Query Parser
Max Score Query Parser
More Like This Query Parser

Query TermQuery TermRangeQuery NumericRangeQuery PrefixQuery BooleanQuery PhraseQuery WildcardQuery FuzzyQuery MatchAllDocsQuery Faceting faceting is the arrangement of search results into categories based on indexed terms Searchers are presented with the indexed terms, along with numerical counts of how many matching documents were found were each term Faceting makes it easy for users to explore search results, narrowing in on exactly the results they are looking for Highlighting Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response There are three highlighting implementations available

The Standard Highlighter is the swiss-army knife of the highlighters. It has the most sophisticated and fine-grained query representation of the three highlighters
FastVector Highlighter
- The FastVector Highlighter requires term vector options (termVectors, termP ositions, and termOffsets) on the field, and is optimized with that in mind
- It tends to work better for more languages than the Standard Highlighter, because it supports Unicode breakiterators. On the other hand, its query-representation is less advanced than the Standard Highlighter
- for example it will not work well with the surround parser. This highlighter is a good choice for large documents and highlighting text in a variety of languages
Postings Highlighter
- The Postings Highlighter requires storeOffsetsWithPositions to be configured on the field
- This is a much more compact and efficient structure than term vectors, but is not appropriate for huge numbers of query terms (e.g. wildcard queries). Like the FastVector Highlighter
- it supports Unicode algorithms for dividing up the document

Spell Checking Query Re-Ranking Suggester MoreLikeThis Pagination of Results Result Grouping Spatial Search The Term Vector Component: For each document in the response, the TermVectorCcomponent can return the term vector, the term frequency, inverse document frequency, position, and offset information The Stats Component: The Stats component returns simple statistics for numeric, string, and date fields within the document set Response Writers CSVResponseWriter JSONResponseWriter VelocityResponseWriter XMLResponseWriter

7, The Well-Configured Solr Instance

Configuring solrconfig.xml
request handlers, which process the requests to Solr, such as requests to add documents to the index or requests to return results for a query
listeners, processes that "listen" for particular query-related events; listeners can be used to trigger the
execution of special code, such as invoking some common queries to warm-up caches
the Request Dispatcher for managing HTTP communications
the Admin Web interface
parameters related to replication and duplication (these parameters are covered in detail in Legacy Scaling and Distribution)

Solr Cores and solr.xml

Solr.xml已经从配置一个Solr core进化到支持多个Solr core，并最终为SolrCloud定义参数

8, SolrCloud

概念
Collection：在SolrCloud集群中逻辑意义上的完整的索引。它常常被划分为一个或多个Shard，它们使用相同的Config Set。如果Shard数超过一个，它就是分布式索引，SolrCloud让你通过Collection名称引用它，而不需要关心分布式检索时需要使用的和Shard相关参数
Config Set: Solr Core提供服务必须的一组配置文件。每个config set有一个名字。最小需要包括solrconfig.xml (SolrConfigXml)和schema.xml (SchemaXml)，除此之外，依据这两个文件的配置内容，可能还需要包含其它文件。它存储在Zookeeper中。Config sets可以重新上传或者使用upconfig命令更新，使用Solr的启动参数bootstrap_confdir指定可以初始化或更新它
Core: 也就是Solr Core，一个Solr中包含一个或者多个Solr Core，每个Solr Core可以独立提供索引和查询功能，每个Solr Core对应一个索引或者Collection的Shard，Solr Core的提出是为了增加管理灵活性和共用资源。在SolrCloud中有个不同点是它使用的配置是在Zookeeper中的，传统的Solr core的配置文件是在磁盘上的配置目录中
Leader: 赢得选举的Shard replicas。每个Shard有多个Replicas，这几个Replicas需要选举来确定一个Leader。选举可以发生在任何时间，但是通常他们仅在某个Solr实例发生故障时才会触发。当索引documents时，SolrCloud会传递它们到此Shard对应的leader，leader再分发它们到全部Shard的replicas
Replica: Shard的一个拷贝。每个Replica存在于Solr的一个Core中。一个命名为“test”的collection以numShards=1创建，并且指定replicationFactor设置为2，这会产生2个replicas，也就是对应会有2个Core，每个在不同的机器或者Solr实例。一个会被命名为test_shard1_replica1，另一个命名为test_shard1_replica2。它们中的一个会被选举为Leader
Shard: Collection的逻辑分片。每个Shard被化成一个或者多个replicas，通过选举确定哪个是Leader
Zookeeper: Zookeeper提供分布式锁功能，对SolrCloud是必须的。它处理Leader选举。Solr可以以内嵌的Zookeeper运行，但是建议用独立的，并且最好有3个以上的主机

Features Central configuration for the entire cluster Automatic load balancing and fail-over for queries ZooKeeper integration for cluster coordination and configuration

Nodes, Cores, Clusters and Leaders

Nodes and Cores

In SolrCloud, anodeis Java Virtual Machine instance running Solr, commonly called a server. Each Solr core can also be considered a node. Any node can contain both an instance of Solr and various kinds of data
A Solrcoreis basically an index of the text and fields found in documents
A single Solr instance can contain multiple "cores", which are separate from each other based on local criteria
When you start a new core in SolrCloud mode, it registers itself with ZooKeeper. This involves creating an Ephemeral node that will go away if the Solr instance goes down, as well as registering information about the core and how to contact it

Clusters

A cluster is set of Solr nodes managed by ZooKeeper as a single unit
When you have a cluster, you can always make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a unit and be durable, i.e., you won't lose data. Updates can be seen right after they are made and the cluster can be expanded or contracted

Leaders and Replicas

The concept of aleaderis similar to that ofmasterwhen thinking of traditional Solr replication. The leader is responsible for making sure thereplicasare up to date with the same information stored in the leader
However, with SolrCloud, you don't simply have one master and one or more "slaves", instead you likely have distributed your search and index traffic to multiple machines

Shards and Indexing Data in SolrCloud

When your data is too large for one node, you can break it up and store it in sections by creating one or more shards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results ZooKeeper provides failover and load balancing

Distributed Requests

One of the advantages of using SolrCloud is the ability to distribute requests among various shards that may or may not contain the data that you're looking for. You have the option of searching over all of your data or just parts of it Configuring the ShardHandlerFactory

You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr. This allows for finer grained control and you can tune it to target your own specific requirements. The default configuration favors throughput over latency

9, 中文分词器

mmseg4j
mmseg4j用Chih-Hao Tsai 的MMSeg算法实现的中文分词器
MMSeg 算法有两种分词方法：Simple和Complex，都是基于正向最大匹配。Complex加了四个规则过虑

paoding Paoding's Knives 中文分词具有极高效率和高扩展性。引入隐喻，采用完全的面向对象设计，构思先进高效率：在PIII 1G内存个人机器上，1秒可准确分词 100万汉字采用基于不限制个数的词典文件对文章进行有效切分，使能够将对词汇分类定义能够对未知的词汇进行合理解析 ictclas4j

ictclas4j中文分词系统是sinboy在中科院张华平和刘群老师的研制的FreeICTCLAS的基础上完成的一个java开源分词项目

IKAnalyzer 它是以开源项目Lucene为应用主体的，结合词典分词和文法分析算法的中文分词组件采用了特有的“正向迭代最细粒度切分算法“，具有60万字/秒的高速处理能力采用了多子处理器分析模式，支持：英文字母（IP地址、Email、URL）、数字（日期，常用中文数量词，罗马数字，科学计数法），中文词汇（姓名、地名处理）等分词处理对中英联合支持不是很好,在这方面的处理比较麻烦.需再做一次查询,同时是支持个人词条的优化的词典存储，更小的内存占用支持用户词典扩展定义针对Lucene全文检索优化的查询分析器IKQueryParser；采用歧义分析算法优化查询关键字的搜索排列组合，能极大的提高Lucene检索的命中率 ansj 这是一个ictclas的java实现.基本上重写了所有的数据结构和算法.词典是用的开源版的ictclas所提供的.并且进行了部分的人工优化内存中中文分词每秒钟大约100万字(速度上已经超越ictclas) 文件读取分词每秒钟大约30万字准确率能达到96%以上目前实现了.中文分词. 中文姓名识别 . 用户自定义词典可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目

10,Solr 性能因素