关于solr配置文文件我所理解的,大部分都是可以在网上查到的。所以主要列举下一些网上大家谈的很少的知识。
1.solrconfig.xml
- 配置文件(略)
- 这个文件,我没做很深入的研究。只有一点说明,solr1.3和solr1.4是有区别的
- solr1.3里面,需要指定solr index date的位置
- <!-- Used to specify an alternate directory to hold all index data
- other than the default ./data under the Solr home.
- If replication is in use, this should match the replication configuration. -->
- <dataDir>${solr.data.dir:./solr/db/data}</dataDir>
- solr1.4里面,自动把生成index目录:data(和conf同级)-index (索引文件)
- <!-- The directory where your SpellChecker Index should live. -->
- <!-- May be absolute, or relative to the Solr "dataDir" directory. -->
- <!-- If this option is not specified, a RAM directory will be used -->
- <str name="spellcheckerIndexDir">spell</str>
2.db-data-config.xml
- 配置文件(人所在的公司为例)
- <!--dataConfig is designed for member search, by using the table member,organization-->
- <dataConfig>
- <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
- url="jdbc:mysql://../test" user=".."
- password=".." />
- <document name="private-beta">
- <!-- field for news search, from member table-->
- <entity name="member" pk="memid"
- query="select distinct * from member"
- deltaQuery="select memid from member
- where mem_date_updated > '${dataimporter.last_index_time}'">
- <field column="memid" name="memid" />
- <field column="name" name="name" />
- <field column="mem_first_name" name="mem_first_name" />
- <field column="mem_last_name" name="mem_last_name" />
- <field column="mem_org_title" name="mem_org_title" />
- <field column="mem_org_dept" name="mem_org_dept" />
- <field column="mem_job_func" name="mem_job_func" />
- <field column="mem_job_level" name="mem_job_level" />
- <field column="mem_add_city" name="mem_add_city" />
- <field column="mem_add_state_code" name="mem_add_state_code" />
- <field column="mem_add_country_code" name="mem_add_country_code" />
- <field column="mem_date_created" name="mem_date_created" />
- <field column="mem_email" name="mem_email" />
- <field column="mem_orgid" name="mem_orgid" />
- <field column="mem_add_address" name="mem_add_address" />
- <!-- field for member search, from organization table-->
- <entity name="organization" pk="orgid"
- query="select * from organization where orgid='${member.mem_orgid}'" >
- <field column="orgid" name="orgid" />
- <field column="org_name" name="org_name" />
- <field column="org_website" name="org_website" />
- <field column="org_website_protocol" name="org_website_protocol" />
- <field column="org_subcat_id" name="org_subcat_id" />
- <field column="org_subcat_id2" name="org_subcat_id2" />
- <field column="org_subcat_id3" name="org_subcat_id3" />
- <field column="org_subcat_id4" name="org_subcat_id4" />
- <field column="org_subcat_id5" name="org_subcat_id5" />
- </entity>
- </entity>
- </document>
- </dataConfig>
- delta-import:通过deltaQuery,查到数据库中未被索引记录的ids(id1,id2...),然后执行deltaImportQuery,如果没用deltaImportQuery,就通过Query去组装deltaImportQuery。对于上面的配置文件,就应该组装成这样的query:select distinct * from member where memid = id1,把这条记录导入索引库里面,然后再导id2。曾以为solr应该组装select .. from .. where .. and memid in (id1, id2)这样的query,可好像它并没这样做。
- 只有deltaQuery里面select的field和PK一样时(都为memid),才会组装成select distinct * from member where memid = id1。否则组装成select distinct * from member and memid = id1,增量导入时会报错。突然明白了
solr wiki 写道pk : The primary key for the entity. It is optional and only needed when using delta-imports. It has no relation to the uniqueKey defined in schema.xml but they both can be the same.
- left join:涉及到多个表之间有外键连接时,solr源码里面给我们展示的,是<entity></entity>里面,去引用上一层<entity></entity>对应表的外键:orgid='${member.mem_orgid}'。那么,我们何不直接将配置文件中的Query改为:select distinct * from member left join organization o on orgid = mem_orgid。这样就可以不用分层,把所有<field></field>定义到一块了。这样是可行的,只是效率还是个未知数。不过推荐分层的<entity></entity>,逻辑清楚。
3.schema.xml
- 配置文件
- <!-- This is an example oTokenFilterFactoriesf using the KeywordTokenizer along
- With various to produce a sortable field
- that does not include some properties of the source text
- -->
- <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
- <analyzer>
- <!-- KeywordTokenizer does no actual tokenizing, so the entire
- input string is preserved as a single token
- -->
- <tokenizer class="solr.KeywordTokenizerFactory"/>
- <!-- The LowerCase TokenFilter does what you expect, which can be
- when you want your sorting to be case insensitive
- -->
- <filter class="solr.LowerCaseFilterFactory" />
- <!-- The TrimFilter removes any leading or trailing whitespace -->
- <filter class="solr.TrimFilterFactory" />
- <!-- The PatternReplaceFilter gives you the flexibility to use
- Java Regular expression to replace any sequence of characters
- matching a pattern with an arbitrary replacement string,
- which may include back refrences to portions of the orriginal
- string matched by the pattern.
- See the Java Regular Expression documentation for more
- infomation on pattern and replacement string syntax.
- http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
- -->
- <filter class="solr.PatternReplaceFilterFactory"
- pattern="([^a-z])" replacement="" replace="all"
- />
- </analyzer>
- </fieldType>
- <fields>
- <!-- field for member search,all field is in one-to-one correspondence with the field in db-data-config.xml -->
- <!-- field from member table-->
- <field name="memid" type="long" indexed="true" stored="true"
- multiValued="false" required="true"/>
- ...
- <field name="mem_orgid" type="long" indexed="true" stored="true"
- multiValued="false"/>
- <!-- field from organization table-->
- <field name="orgid" type="long" indexed="true" stored="true"
- multiValued="false"/>
- <field name="org_name" type="alphaOnlySort" indexed="true" stored="true"
- multiValued="false" />
- ...
- alphaOnlySort:对于org_name这样的field,一般我们会将其定义为text类型,如果同时我们需要按org_name排序,那怎么办?显然,text是不能用了。刚好,alphaOnlySort,可以为我们解决这个问题,你应该可以理解上面那段配置。
- 特别要说明的是<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>,排除了所有非字母的文本。举个例子,按上面的配置,你查找“123”这样的公司名,它会匹配到任何公司名为数字的公司。所以我把这段给注释掉了。 - 如果选择了alphaOnlySort,输入ibm,将无法匹配到ibm.com,因为alphaOnlySort类型的ibm.com并不会被分割(<filter class="solr.WordDelimiterFilterFactory...")。
4.stopwords.txt
- 配置文件(略)
- stopwords,只针对定义了texttype的field。
- 里面定义的一些停用词我们需要注意了,比喻“OR”,你能把它停掉吗。不一定,美国就有Oregon洲缩写为OR,如果你把org_state_code的fieldtype设置为text的话,你搜索OR,传到solr server的url:http://localhost:8888/solr/../select?q=*:*?&fq=org_state_code:OR就会变成http://localhost:8888/solr/../select?q=*:*?&fq=org_state_code:,报错!
- 所以,要么你将org_state_code定义为其他fieldtype,要么你在stopwords里面除掉这些特殊的words。
文章转自:http://zy19982004.iteye.com/blog/805717