1, Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via JSON, XML, CSV or binary over HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary results.
2, Solr Administration User Interface
- Logging
- Cloud Screens
- Core Admin
- Java Properties
- Thread Dump
- Core-Specific Tools
- Analysis Screen
- Dataimport screen
- Documents Screen
- Files Screen
- Ping
- Plugin & Stats Screen
- Query Screen
- Replication Screen
- Schema Browser Screen
- Segments Info
3, Documents, Fields, Schema Design
- Field Properties
- indexed
- stored
- docValues
- sortMissingFirst / sortMissingLast
- multiValued
- omitNorms
- omitTermFreqAndPositions
- omitPositions
- termVectors / termPositions / termOffsets / termPayloads
- required
4, Analyzers, Tokenizers and Filters
- Analyzers
- An analyzer examines the text of fields and generates a token stream
- Analyzers are specified as a child of the <fieldType> element in the schema.xml configuration file
- Analyzers
-
- WhitespaceAnalyzer
- SimpleAnalyzer
- StopAnalyzer
- StandardAnalyzer
- WhitespaceTokenizer
- KeywordTokenizer
- LetterTokenizer
- StandardTokenizer
- LowerCaseFilter
- StopFilter
- PorterStemFilter
- ASCIIFoldingFilter
- StandardFilter
5, Indexing
- The three most common ways of loading data into a Solr index
- Using the Solr Cell framework built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats
- Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated
- Writing a custom Java application to ingest data through Solr's Java Client API
- Tika's language detection feature
- LangDetect language detection
- UIMA(the Apache Unstructured Information Management Architecture) lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations
6, Searching
- The search query is processed by a request handler
- Solr supports a variety of request handlers. Some are designed for processing search queries, while others manage tasks such as index replication
- To process a search query, a request handler calls a query parser, which interprets the terms and parameters of a query
- Input to a query parser can include
-
- search strings---that is, terms to search for in the index
- parameters for fine-tuning the query by increasing the importance of particular strings or fields, by applying Boolean logic among the search terms, or by excluding content from the search results
- parameters for controlling the presentation of the query response, such as specifying the order in which results are to be presented or limiting the response to particular fields of the search application's schema
- Search parameters may also specify a query filter
- Solr's default Query Parser is also known as the "lucene" parser
- The key advantage of the standard query parser is that it supports a robust and fairly intuitive syntax allowing you to create a variety of structured queries
- The largest disadvantage is that it's very intolerant of syntax errors, as compared with something like the DisMax query parser which is designed to throw as few errors as possible
- The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field
- Additional options enable users to influence the score based on rules specific to each use case independent of user input)
- The Extended DisMax (eDisMax) query parser is an improved version of the DisMax query parser. In addition to supporting all the DisMax query parser parameters
- Block Join Query Parsers
- Boost Query Parser
- Collapsing Query Parser
- Complex Phrase Query Parser
- Field Query Parser
- Function Query Parser
- Function Range Query Parser
- Join Query Parser
- Lucene Query Parser
- Max Score Query Parser
- More Like This Query Parser
- The Standard Highlighter is the swiss-army knife of the highlighters. It has the most sophisticated and fine-grained query representation of the three highlighters
- FastVector Highlighter
-
- The FastVector Highlighter requires term vector options (termVectors, termP ositions, and termOffsets) on the field, and is optimized with that in mind
- It tends to work better for more languages than the Standard Highlighter, because it supports Unicode breakiterators. On the other hand, its query-representation is less advanced than the Standard Highlighter
- for example it will not work well with the surround parser. This highlighter is a good choice for large documents and highlighting text in a variety of languages
- Postings Highlighter
-
- The Postings Highlighter requires storeOffsetsWithPositions to be configured on the field
- This is a much more compact and efficient structure than term vectors, but is not appropriate for huge numbers of query terms (e.g. wildcard queries). Like the FastVector Highlighter
- it supports Unicode algorithms for dividing up the document
7, The Well-Configured Solr Instance
- Configuring solrconfig.xml
- request handlers, which process the requests to Solr, such as requests to add documents to the index or requests to return results for a query
- listeners, processes that "listen" for particular query-related events; listeners can be used to trigger the
- execution of special code, such as invoking some common queries to warm-up caches
- the Request Dispatcher for managing HTTP communications
- the Admin Web interface
- parameters related to replication and duplication (these parameters are covered in detail in Legacy Scaling and Distribution)
- Solr.xml已经从配置一个Solr core进化到支持多个Solr core,并最终为SolrCloud定义参数
8, SolrCloud
- 概念
- Collection:在SolrCloud集群中逻辑意义上的完整的索引。它常常被划分为一个或多个Shard,它们使用相同的Config Set。如果Shard数超过一个,它就是分布式索引,SolrCloud让你通过Collection名称引用它,而不需要关心分布式检索时需要使用的和Shard相关参数
- Config Set: Solr Core提供服务必须的一组配置文件。每个config set有一个名字。最小需要包括solrconfig.xml (SolrConfigXml)和schema.xml (SchemaXml),除此之外,依据这两个文件的配置内容,可能还需要包含其它文件。它存储在Zookeeper中。Config sets可以重新上传或者使用upconfig命令更新,使用Solr的启动参数bootstrap_confdir指定可以初始化或更新它
- Core: 也就是Solr Core,一个Solr中包含一个或者多个Solr Core,每个Solr Core可以独立提供索引和查询功能,每个Solr Core对应一个索引或者Collection的Shard,Solr Core的提出是为了增加管理灵活性和共用资源。在SolrCloud中有个不同点是它使用的配置是在Zookeeper中的,传统的Solr core的配置文件是在磁盘上的配置目录中
- Leader: 赢得选举的Shard replicas。每个Shard有多个Replicas,这几个Replicas需要选举来确定一个Leader。选举可以发生在任何时间,但是通常他们仅在某个Solr实例发生故障时才会触发。当索引documents时,SolrCloud会传递它们到此Shard对应的leader,leader再分发它们到全部Shard的replicas
- Replica: Shard的一个拷贝。每个Replica存在于Solr的一个Core中。一个命名为“test”的collection以numShards=1创建,并且指定replicationFactor设置为2,这会产生2个replicas,也就是对应会有2个Core,每个在不同的机器或者Solr实例。一个会被命名为test_shard1_replica1,另一个命名为test_shard1_replica2。它们中的一个会被选举为Leader
- Shard: Collection的逻辑分片。每个Shard被化成一个或者多个replicas,通过选举确定哪个是Leader
- Zookeeper: Zookeeper提供分布式锁功能,对SolrCloud是必须的。它处理Leader选举。Solr可以以内嵌的Zookeeper运行,但是建议用独立的,并且最好有3个以上的主机
- In SolrCloud, anodeis Java Virtual Machine instance running Solr, commonly called a server. Each Solr core can also be considered a node. Any node can contain both an instance of Solr and various kinds of data
- A Solrcoreis basically an index of the text and fields found in documents
- A single Solr instance can contain multiple "cores", which are separate from each other based on local criteria
- When you start a new core in SolrCloud mode, it registers itself with ZooKeeper. This involves creating an Ephemeral node that will go away if the Solr instance goes down, as well as registering information about the core and how to contact it
- A cluster is set of Solr nodes managed by ZooKeeper as a single unit
- When you have a cluster, you can always make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a unit and be durable, i.e., you won't lose data. Updates can be seen right after they are made and the cluster can be expanded or contracted
- The concept of aleaderis similar to that ofmasterwhen thinking of traditional Solr replication. The leader is responsible for making sure thereplicasare up to date with the same information stored in the leader
- However, with SolrCloud, you don't simply have one master and one or more "slaves", instead you likely have distributed your search and index traffic to multiple machines
- You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr. This allows for finer grained control and you can tune it to target your own specific requirements. The default configuration favors throughput over latency
9, 中文分词器
- mmseg4j
- mmseg4j用Chih-Hao Tsai 的MMSeg算法实现的中文分词器
- MMSeg 算法有两种分词方法:Simple和Complex,都是基于正向最大匹配。Complex加了四个规则过虑
- ictclas4j中文分词系统是sinboy在中科院张华平和刘群老师的研制的FreeICTCLAS的基础上完成的一个java开源分词项目
10,Solr 性能因素
- Schema Design Considerations(数据模型方面考虑)
- indexed fields
- Configuration Considerations(配置方面考虑)
- mergeFactor