使用Solr搭建“小”数据集群的搜索和推荐功能

之所以说是“小”数据，是针对大数据而言的，我还没有巨量数据，不想用Hadoop, Spark, Storm 再加上Mahout这样的思路起步。

对于数据量特别大，结构特别复杂的系统来说，搜索和推荐是一个技术含量相对较高的工作。一般的应用，通过现有的Apache的开源项目做一些简单的配置也可以实现。在实际的使用中，这样的搭配在功能上，性能上和可扩展性上能够满足现有的需求。

Solr 的User’s Guide虽然很厚，真的要用的时候发现，搭建一个demo容易，但是在实际使用方面可能会遇到很多问题，而且答案不太好找。先谈一下我遇到的两个小问题：（1）和数据库连接的问题（2）位置搜索中位置信息的数据结构的处理。

Solr在Mac OS上面的安装比较容易，用homebrew 安装solr以后，在 /usr/local/Cellar/solr/5.3.1/ 的目录下面有几个目录 bin, example 和server。敲 ./solr start启动服务。

访问http://localhost:8983 就可以看到一个控制台页面了。找到“Core Admin”, 选择"Add core"。界面如图：

可以看到配置的时候几个关键的参数，创建好这个Core Admin, 就能够在Core Selector中看到了。其中solrconfig.xml 和schema.xml 是接下来我们要处理的两个配置文件。

（1）连接Mysql 数据库的配置

在 /usr/local/Cellar/solr/5.3.1/server/solr 中找到这个Core Admin, 我这个名字叫YIDU: 配置文件的目录：usr/local/Cellar/solr/5.3.1/server/solr/YiDu/conf

1)在solrconfig.xml里面配置一下 dataimport, 告诉solr, 我要导入数据了：

[AppleScript] 纯文本查看复制代码

1

2

3

4

5

 
                    < 
                    requestHandler  
                    name 
                    = 
                    "/dataimport" 
                     class 
                    = 
                    "org.apache.solr.handler.dataimport.DataImportHandler" 
                    > 
                   
 
                    < 
                    lst  
                    name 
                    = 
                    "defaults" 
                    > 
                   
 
                    < 
                    str  
                    name 
                    = 
                    "config" 
                    > 
                    data 
                    - 
                    config.xml 
                    < 
                    / 
                    str 
                    > 
                   
 
                    < 
                    / 
                    lst 
                    > 
                   
 
                    < 
                    / 
                    requestHandler 
                    > 
                   

2）在data-config.xml 里面这么写，告诉solr, 我要连接mysql, 运行这个sql 语句把数据给取出来：

[AppleScript] 纯文本查看复制代码

 
                    < 
                    dataConfig 
                    > 
                   

                       
                   
 
                    < 
                    dataSource type 
                    = 
                    "JdbcDataSource" 
                   
 
                             
                    driver 
                    = 
                    "com.mysql.jdbc.Driver" 
                   
 
                                 
                    url 
                    = 
                    "jdbc:mysql://XXXXX" 
                   
 
                                 
                    user 
                    = 
                    “admin 
                    " 
                   
 
                                 
                    password=“password" 
                    / 
                    > 
                   

                       
                   
 
                    < 
                    document 
                    > 
                   
 
                       
                    < 
                    entity  
                    name 
                    = 
                    "t_userinfo"  
                     transformer 
                    = 
                    "script:puttwodouble" 
                     pk 
                    = 
                    "c_id" 
                     query 
                    = 
                    “sql statement 
                    "> 
                   
 
                          
                    <field column=" 
                    c_id 
                    " name=" 
                    c_id 
                    "/> 
                   
 
                          
                    <field column=" 
                    c_name 
                    " name=" 
                    c_name 
                    "/> 
                   
 
                          
                    <field column=" 
                    c_longitude 
                    " name=" 
                    location_ 
                    1 
                    _coordinate 
                    "/> 
                   
 
                          
                    <field column=" 
                    c_latitude 
                    " name=" 
                    location_ 
                    0 
                    _coordinate" 
                    / 
                    >   
                   
 
                       
                    < 
                    / 
                    entity 
                    > 
                   
 
                    < 
                    / 
                    document 
                    > 
                   
 
                    < 
                    / 
                    dataConfig 
                    > 
                   

3) 回过头来再在schema.xml 里面把这些结果反应到solr的数据结构上：

[AppleScript] 纯文本查看复制代码

1

2

3

 
                    < 
                    field  
                    name 
                    = 
                    "c_id" 
                     type 
                    = 
                    "string" 
                     indexed 
                    = 
                    "true" 
                     stored 
                    = 
                    "true" 
                     required 
                    = 
                    "true" 
                     multiValued 
                    = 
                    "false" 
                     / 
                    > 
                   
 
                    < 
                    field  
                    name 
                    = 
                    "c_name" 
                     type 
                    = 
                    "text_ws" 
                     indexed 
                    = 
                    "true" 
                     stored 
                    = 
                    "true" 
                    / 
                    > 
                   
 
                    < 
                    field  
                    name 
                    = 
                    "location" 
                     type 
                    = 
                    "location" 
                     indexed 
                    = 
                    "true" 
                     stored 
                    = 
                    "true" 
                     / 
                    > 
                   

这3步做好了，mysql里面的数据就可以通过solr连接上了。

（2）位置搜索的数据结构处理
我在MySQL里面经纬度是两个column, solr 的location的数据类型不是这么处理的，需要在取数据的时候把两个数据并一下，符合solr的数据结构定义。
1）在data-config 中这么写：

[AppleScript] 纯文本查看复制代码

 
                    < 
                    script 
                    > 
                    < 
                    ![CDATA[ 
                   
 
                      
                    function puttwodouble 
                    ( 
                    row 
                    )        
                     { 
                   
 
                            
                    var attrVal 
                    1 
                     = 
                     row. 
                    get 
                    ( 
                    "c_latitude" 
                    ) 
                    ; 
                   
 
                            
                    var attrVal 
                    2 
                     = 
                     row. 
                    get 
                    ( 
                    "c_longitude" 
                    ) 
                    ; 
                   
 
                            
                    var attrVal  
                    = 
                     attrVal 
                    1 
                     + 
                     "," 
                     + 
                     attrVal 
                    2 
                    ; 
                   
 
                             
                    arr. 
                    add 
                    ( 
                    attrVal 
                    1 
                    ) 
                    ; 
                   
 
                             
                    arr. 
                    add 
                    ( 
                    attrVal 
                    2 
                    ) 
                    ; 
                   
 
                             
                    row.put 
                    ( 
                    "location" 
                    , 
                    attrVal 
                    ) 
                    ; 
                   
 
                             
                    return 
                     row; 
                   
 
                                     
                    } 
                   
 
                         
                    ]] 
                    > 
                    < 
                    / 
                    script 
                    > 
                   

这里，用了个script 把两条数据并了起来。

2）在schema.xml 里面可以找到这句话：

[AppleScript] 纯文本查看复制代码

1 2	`<` `!` `-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->` `<` `fieldType` `name` `=` `"location"` `class` `=` `"solr.LatLonType"` `subFieldSuffix` `=` `"_coordinate"` `/` `>`

相应地这个field的定义就是告诉solr, location我给你弄好了，放好：

[AppleScript] 纯文本查看复制代码

1	`<` `field` `name` `=` `"location"` `type` `=` `"location"` `indexed` `=` `"true"` `stored` `=` `"true"` `/` `>`

Admin Console的界面如下图：

以上是开始研究solr的功能和性能特点，逐步根据需求定制化的基础。

我把solr 的学习分成三个部分：

初级：安装配置和查询
中级：数据结构定义，index, Analyzer,tokenizer以及Filter和Query的使用
高级：Cloud, 自定义tokenizer 和 filter 以及 Debug 和调优

其中，针对自己的需求，可以对solr做一些定制化：比如solr在做文字匹配的时候，可能会把出现频率比较高的关键字作为噪音过滤掉，而在一些应用场景中，这个看起来合理的处理方式带来一些麻烦。再比如，不同的数据可能对实时性的需求不同，因此对元数据的index方式也不能一刀切。
从搜索和推荐模型的角度，有这么几个问题需要解决：

1）数据预处理
i) 元数据的标准化
ii)元数据的预处理
2）推荐/应用系统
3）推荐算法（模型）
4）推荐效果评估

在应用中使用Solr, 上述两个列举出来的部分也很重要，但是限于篇幅这些问题就不在这个部分讨论了，有兴趣的朋友，可以通过订阅号交流沟通，互相学习。

在选择推荐系统的时候，有不少资源可以参考。比如：在Quora 搜一下Solr能够找到不少很好的文章。我也学习了Lucene, Sphinx, Mahout。其中 Mahout 提供了不少推荐算法，是一个值得学习的工具。
http://www.aboutyun.com/thread-17861-1-1.html

来源：用Solr快速搭建“小”数据的搜索和推荐功能的起手式