Lucene / Solr全站搜索快速部署

最新推荐文章于 2020-06-02 08:30:30 发布

weixin_33719619

最新推荐文章于 2020-06-02 08:30:30 发布

阅读量447

点赞数

文章标签： php java 网络

原文链接：https://my.oschina.net/u/589241/blog/1819231

版权

2019独角兽企业重金招聘Python工程师标准>>>

第一步：安装Java 8 VM。因为Solr的官网上说，必须要安装Java 8 VM。到这个页面来下载：

http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

下载rpm，在服务器上直接安装：

# rpm -ivh jdk-8u171-linux-x64.rpm

第二步：下载Solr

http://www.apache.org/dyn/closer.lua/lucene/solr/7.3.1

第三步：将Solr解压缩

#unzip solr-7.3.1.zip
#cd solr-7.3.1
#mv solr-7.3.1 ./solr   //改个名字，否则到搜索的时候总是要带着很复杂的名字
#chown -R nobody:nobody ./solr  //将目录授权给nobody
#/sbin/iptables -I INPUT -p tcp --dport 8983 -j ACCEPT //打开8983端口
#/etc/rc.d/init.d/iptables save

Linux 7则是这样的：
#firewall-cmd --permanent --add-port=8983/tcp --zone=trusted
#systemctl restart firewalld

第四步：建立索引

#su nobody  //nobody是网站目录的所有者
$cd bin
$./solr create -c mysite
$./post -c mysite http://www.mysite.com/product.php?id=1

这样一条一条地太慢，编写一个脚本来处理。当然，它还可以索引pdf、doc、txt、rtf等等等等文件格式，不过，我只需要html的就行。

#！/bin/bash
#myindexer.sh

for((i=1;i<2000;i++))
do
   ./post -c mysite http://www.mysite.com/product.php?id=$i
done

后面的处理虽然简单，我也一并写下来，供参考。

$chmod +x ./myindexer.sh   //让它可以执行
$./myindexer.sh  //运行起来

运行的时候是这样的：

COMMITting Solr index changes to http://localhost:8983/solr/mysite/update/extract...
Time spent: 0:00:10.473
java -classpath /home/www/www.mysite.com/www/solr/dist/solr-core-7.3.1.jar -Dauto=yes -Dc=mysite-Ddata=web org.apache.solr.util.SimplePostTool http://www.mysite.com/product.php?id=1
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/mysite/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
Entering crawl at level 0 (1 links total, 1 new)
POSTed web resource http://www.mysite.com/product.php?id=1 (depth: 0)
1 web pages indexed.

第五步：设置一个搜索框

首先要说的是，这个Solr开动起来之后，就会以RESTFUL的形式提供服务，比如我在我的网站上搜索"小说"”的时候是这样的。

http://www.ypgogo.com:8983/solr/mysite/select?q=小说

它就会返回一个JSON查询结果回来，如下：

{
  "responseHeader":{
    "status":0,
    "QTime":4,
    "params":{
      "q":"小说"}},
  "response":{"numFound":157,"start":0,"docs":[
      {
        "url":["http://www.ypgogo.com/Read/info/id/87"],
        "id":"http://www.ypgogo.com/Read/info/id/87",
        "stream_size":["null"],
        "x_ua_compatible":["IE=9; IE=8; IE=7; IE=edge; chrome=1"],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.html.HtmlParser"],
        "stream_content_type":["text/html"],
        "cache_control":["no-transform"],
        "keywords":["阅读,读书,文学,书籍,书单,读后感,好书,好书推荐,reading,books,小说"],
        "viewport":["width=device-width, initial-scale=1.0"],
        "dc_title":["你在高原 -  \n        \t雅朋网 - 分享阅读的快乐！"],
        "content_encoding":["UTF-8"],
        "description":["雅朋网是一个分享阅读书单的社区！有好书，请来这里推荐给大家。"],
        "title":["你在高原 -  \n        \t雅朋网 - 分享阅读的快乐！"],
        "content_type":["text/html; charset=UTF-8"],
        "stream_size_str":["null"],
        "url_str":["http://www.ypgogo.com/Read/info/id/87"],
        "cache_control_str":["no-transform"],
        "x_ua_compatible_str":["IE=9; IE=8; IE=7; IE=edge; chrome=1"],
        "dc_title_str":["你在高原 -  \n        \t雅朋网 - 分享阅读的快乐！"],
        "x_parsed_by_str":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.html.HtmlParser"],
        "description_str":["雅朋网是一个分享阅读书单的社区！有好书，请来这里推荐给大家。"],
        "content_type_str":["text/html; charset=UTF-8"],
        "stream_content_type_str":["text/html"],
        "viewport_str":["width=device-width, initial-scale=1.0"],
        "title_str":["你在高原 -  \n        \t雅朋网 - 分享阅读的快乐！"],
        "keywords_str":["阅读,读书,文学,书籍,书单,读后感,好书,好书推荐,reading,books,小说"],
        "_version_":1601524572017917952,
        "content_encoding_str":["UTF-8"]},
}

需要用PHP再处理一下子，将结果展示在网页上就可以了。

后来，我发现，这个8983端口，谁都可以访问，好像不妥，于是只允许本地程序访问它。

不过，你前面要是在第三步设置过，还要把那一句删除掉。再用上面这两句，顺序不能错。

#vi /etc/sysconfig/iptables

找到第三步设置的那一句，把它删除掉。然后再进行下面的操作。

#iptables -A INPUT -p tcp --dport 8983 -j DROP
#iptables -A INPUT -p tcp -s 127.0.0.1 --dport 8983 -j ACCEPT
#/etc/rc.d/init.d/iptables save
#service iptables restart

不过，后来我遇到了困难，还没有办法访问，于是，我新设了一个虚拟主机叫localhost

<VirtualHost *:80>
    DocumentRoot   /some/path
    ServerName     localhost

    <Directory /some/path >
        Require ip 127.0.0.1  ::1
    </Directory>

</VirtualHost>

我重启Apache后，试了一下：

# curl http://localhost:8983/solr/mysite/select?q=小说

后来在程序里出问题了，出错信息如下：

Server Error
Caused by:
org.apache.solr.common.SolrException: URLDecoder: The query string contains a not-%-escaped byte &gt; 127 at position 2

原来，查询的文字需要处理一下。注意我用的开发语言是PHP。

//$q = '小说'；//错的
$q = urlencode('小说')；
$url = 'http://localhost:8983/solr/mysite/select?q='.$q;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$results = curl_exec($ch);
curl_close($ch);

转载于:https://my.oschina.net/u/589241/blog/1819231