solr搭建电商搜索引擎03 - 配置solrconfig和schema

最新推荐文章于 2023-05-21 11:08:31 发布

sul818

最新推荐文章于 2023-05-21 11:08:31 发布

阅读量276

点赞数

分类专栏： solr

本文链接：https://blog.csdn.net/weixin_41970885/article/details/86614875

版权

solr 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

利用solr构建搜索引擎，我们需要“新建core——配置文件——索引数据”三个步骤才能实现基本的查询，在此基础上还需要继续研究查询解析、中文分词、排序、分面、高亮等功能，才能实现一个成熟的搜索。当然，配置文件的修改是伴随我们功能的开发不断进行的。

1 配置solrconfig文件

前面介绍过，solrconfig的作用是配置索引创建、查询、solr缓存以及solr组件处理器，这里我们对solrconfig的修改主要是变更schema的模式：

<schemaFactory class="ClassicIndexSchemaFactory"/>；

注意，在变更schema模式后，我们需要把managed-schema的文件名改为schema。

2 配置schema文件

2.1 schema文件的结构

当我们把数据传递给solr时，solr要按照我们预先设定的模式对这些数据进行处理，如字段的类型、字段的用途、字段的分词方法等（solr中把字段定义为域，即field），而这些模式就是我们通过schema设定的。在schema中，我们需要在solr内置的预定义域类型的基础上（class，本质是java类）定义我们自己的域类型（fieldtype），然后利用域类型去定义域（field），以一个简单的schema例子来说明：

Line 1、Line 2：定义schema的版本、编码和名称等；
Line 3：定义了一个名称为“long”的域类型，继承了solr的预定义域类型“solr.TrieLongField
”，并且为该域类型设定了“precisionStep”和“positionIncrementGap”两个参数；
Line 4：配置“id”域作为文件的唯一标识，solr会根据该域决定增量导入或重复导入；
Line 5、Line 6：定义了两个名称为“id”和“series_code”的域，域类型为我们刚创建的“long”，并且为该域设定了“indexed”和“stored”两个参数；
Line 5：schema设定结束。

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<uniqueKey>id</uniqueKey>
<field name="id" type="long" indexed="true" stored="true"/>
<field name="series_code" type="long" indexed="true" stored="true"/>
</schema>

2.2 域类型的定义（fieldtype）

<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>

如之前的schema例子，域类型的定义一般需要name、class和properties三类信息：

name：域类型的名称，必须指定，且要保证schema文件内的唯一性；
class：域类型继承的solr的预定义域类型，必须指定，内置的域类型以solr为前缀，具体如下（部分）：

域类型	描述
solr.BoolField	布尔类型
solr.TrieDateField	日期类型
solr.TrieIntField	整数类型
solr.TrieLongField	长整数类型
solr.TrieFloatField	浮点数类型
solr.TrieDoubleField	双精度浮点数类型
solr.StrField	字符串类型，不分词
solr.TextField	长文本类型，需要设置分词器

properties：域类型的属性一般取决于class，但是也有一些属性是通用的（部分）：

域类型属性	描述
positionIncrementGap	设定多值之间的间隙，适用多值域
autoGeneratePhraseQueries	是否自动为相邻term生成短语查询，默认true

2.3 域的定义（field）

<field name="series_code" type="long" indexed="true" stored="true"/>

如之前的schema例子，域的定义一般需要name、class、default和properties四类信息：

name：域的名称，必须指定，且要保证schema文件内的唯一性；
class：域的类型，必须指定，且要能够在fieldtype的name属性中找到；
default：域的默认值，可以不指定，索引时如没有赋值则使用默认值；
properties：域的属性，域和域类型有很多相同的配置属性（如下表），这些属性可以同时在定义域类型和定义域的时候指定，但定义域中的参数优先级更高。

域属性	描述
indexed	是否需要索引，默认true，配置为true时才能被查询
stored	是否需要存储，默认true，配置为true时才能被抽取
compressed	是否压缩域值，默认false，适用字符串和长文本，且stored为true时有效
multiValued	是否为多值域，默认false
required	是否必需，默认false，配置为true时文件必须含该字段才能被索引

除了以上域的定义外，还有一些特殊的方法：复制域，动态域和文本域。

2.3.1 复制域

我们在构建搜索功能的时候，可能需要对同一个字段进行不同的处理，比如调用两个不同的分词模块对文本进行分词；又或者需要把多个不同字段的数值汇聚成新字段，比如我们想把商品的名称、型号和品牌汇聚成一个多值字段用于搜索。在这两种情况下，我们就需要用到solr的复制域。

<field name="goods" type="string" index="true" stored="false" multiValues="true" />
<copyField source="series_name" dest="goods" />
<copyField source="series_code" dest="goods" />
<copyField source="brand_name" dest="goods" />

如上所示，我们首先定义了一个字段“goods”，然后分别从“series_name”，“series_code”和“brand_name”三个字段复制到了“goods。”有两点需要我们注意：第一，solr不会返回复制字段的数值，即stored只能是false（来源字段不受影响）；第二，如果来源字段是多值域或者有多个来源字段，复制字段都应设定为多值域。

2.3.2 动态域

根据solr的索引机制，只有schema中定义的域才能被索引，这就意味当业务变更时我们必须在schema中追加域并频繁的加载，而动态域则可以让域的定义不再那么死板。简单来说，动态域定义了一个具有特定前缀或后缀的name的域，当我们索引文档的时候，如果solr没有找到对应的域，那么就会根据模糊匹配继续寻找对应的动态域。基于动态域，我们可以更灵活的应对以下场景：

相同属性的域：如果商品信息具有长度、宽度、重量等属性相同的整数域，我们可以定义一个动态域“_num”，然后在索引数据的时候使用动态域“length_num”，“width_num”和“weight_num”；
临时追加的域：如果商品信息经常出现新的字段如“面积”、“半径”等，我们可以定义一个动态域“_att”，然后在索引数据的时候使用动态域“area_att”和“radius_att”。

如下所示，我们定义了一个“_s”后缀的动态域，这样当我们索引数据时就可以直接使用“f1_s”、“f2_s”等类似域名。

<dynamicField name="*_s" type="string" index="true" stored="true"/>

2.3.3 文本域

文本域的定义方法我们前面已经讲过，这里再次介绍的原因是和整数域、字符串域相比，文本域需要额外配置分词器（这里默认大家对倒排索引和中文分词已经有了基本认识）。solr的分词器Analyzer包括两个核心组件Tokenizer和TokenFilter，前者用于生成Token流，后者用于对Token流进行过滤（可以没有）。

以下是一个完整的文本域定义方法。我们首先定义了域“series_name”，域类型是“solr.TextField”。接着为该文本域配置了分词器“analyzer”，包括一个Tokenizer和两个TokenFilter。solr内置了很多分词器，我们可以用下面类的方法直接调用，但还有很多第三方的中文分词包则需要进行更复杂的配置，但schema中的定义方法都是一样的。分词器的具体配置和使用方法我们后续再讲。

<fieldType name="series_name" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

2.4 完整的schema样例

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">

<!-- common fieldType -->
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<!-- text fieldtype -->
<fieldType name="text_cn_index" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- text fieldtype -->
<fieldType name="text_cn_fulltext" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- field -->
<field name="_id_" type="string" indexed="true" stored="true"/>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<uniqueKey>product_id</uniqueKey>
<field name="product_id" type="long" indexed="true" stored="true"/>
<field name="product_name" type="text_cn_index" indexed="true" stored="true"/>
<field name="product_code" type="string" indexed="true" stored="true"/>
<field name="order_code" type="string" indexed="true" stored="true"/>
<field name="product_price" type="float" indexed="true" stored="true"/>
<field name="product_brand" type="text_cn_index" indexed="true" stored="true" multiValued="true"/>
<field name="product_unit" type="string" indexed="true" stored="true"/>
<field name="product_days" type="int" indexed="true" stored="true"/>
<field name="product_rank" type="int" indexed="true" stored="true"/>
</schema>

sul818

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
solr搭建电商搜索引擎03 - 配置solrconfig和schema

利用solr构建搜索引擎，我们需要“新建core——配置文件——索引数据”三个步骤才能实现基本的查询，在此基础上还需要继续研究查询解析、中文分词、排序、分面、高亮等功能，才能实现一个成熟的搜索。当然，配置文件的修改是伴随我们功能的开发不断进行的。配置solrconfig文件前面介绍过，solrconfig的作用是配置索引创建、查询、solr缓存以及solr组件处理器，这里我们对solrconf...
复制链接

扫一扫

专栏目录