solr(五)同义词

       solr中自带有synonyms的功能,但是功能很有限,因为中文需要在分词的基础上进行搜索,所以官方的配置就没有多大意义。

       概念说明:同义词大体的意思是指,当用户输入一个词时,solr会把相关有相同意思的近义词的或同义词的term的语段内容从索引中取出,展示给用户,提高交互的友好性(当然这些同义词的定义是要在配置文件中事先定义好的),比如:用户输入:日本,那么就可能有一些相关的近义词如:鬼子,屠杀,战犯等的词,这个可在配置文件中事先定义好。

      一) 官方的配置:这个配置是在cookbook中有提及的,但是在中文分词上没办法加在一起,所以基本上没用。

        1:在schema.xml的<types>标签中添加<fieldType>,如下:

      

<fieldType name="text_syn" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.
txt" ignoreCase="true" expand="false" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

 这其中涉及到的synonyms.txt文件是配置文件中原来就有的,这个就是同义词的配置文件。大体格式如下

 

# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#-----------------------------------------------------------------------
#some test synonym mappings unlikely to appear in real input text
aaafoo => aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz
fooaaa,baraaa,bazaaa

# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
中国,英国,日本

# Synonym mappings can be used for spelling correction too
pixima => pixma

 我已经在上面加入了中文的配置(由于字符集的问题,请修改完成后用EditNote打开并选择格式-->UTF-8编码,如有乱码就改),意思是输入这几个中文字都是一样的搜索结果,另外其中还有=>及以逗号分隔的,这里引用官方的话做参考:

 

Let's get back to our example for a second. What if the person from the marketing
department says that he/she wants not only to be able to find books that have the word
"machine" to be found when entering the word "electronics", but also all the books that
have the word "electronics", to be found when entering the word "machine". The answer
is simple. First, we would set the expand attribute (of the filter) to true. Then we would
change our synonyms.txt file to something like this:
machine, electronics
As I said earlier Solr would expand synonyms to equivalent forms.

 就是说=>指一对一,以逗号分隔的是组群,也就是多对多。

 

当然这个还得定义相关字段为这个类型,如下。

<field name="content_copy" type="text_syn" indexed="true" stored="true"/>

 这时,在界面analysis上测试一下, 输入pixima,  会出现pixma的匹配词组。

 

 

 

            

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值