SpringBoot+Es7.6.1+Jsoup+Vue+Docker打造古诗词实时搜索功能

服务安装

下载安装elasticsearch

Docker 安装 elasticsearch:7.6.1

docker pull elasticsearch:7.6.1

image-20200910105705616

mkdir -p /Users/szcl/mydata/elasticsearch/config

mkdir -p /Users/szcl/mydata/elasticsearch/data

echo "http.host: 0.0.0.0" >> /Users/szcl/mydata/elasticsearch/config/elasticsearch.yml

chmod -R 777 /Users/szcl/mydata/elasticsearch/

docker run --name elasticsearch -p 9200:9200 -p 9300:9300  -e "discovery.type=single-node" -e ES_JAVA_OPTS="-Xms64m -Xmx128m" -v /Users/szcl/mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /Users/szcl/mydata/elasticsearch/data:/usr/share/elasticsearch/data -v /Users/szcl/mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins -d elasticsearch:7.6.1

查看是否启动成功

docker ps -a

如果未启动成功,通过以下命令查看日志:

docker logs -f b016c22606e1

访问服务器的9200端口:

安装elasticsearch head插件

docker pull mobz/elasticsearch-head:5

docker run -d -p 9100:9100 docker.io/mobz/elasticsearch-head:5

启动成功后访问:

image-20200910113321929

刚安装的话可能存在跨域拒绝访问问题,需要修改配置,有两种方式:

  • 直接修改elasticsearch外挂的配置

    cd /mydata/elasticsearch/config
    
    vim elasticsearch.yml
    

    在配置中新增

    http.cors.enabled: true
    http.cors.allow-origin: "*" 
    

    重启容器

    docker restart b016c22606e1
    
  • 进入容器修改配置

    docker exec -it b016c22606e1 /bin/bash
    
    cd ./config
    
    vim elasticsearch.yml
    

    在配置中新增

    http.cors.enabled: true
    http.cors.allow-origin: "*"
    

    重启容器

    docker restart b016c22606e1
    

新建索引

发现点OK时,没有反应,查看控制台

发现返回406错误代码,点进去查看详情

发现不支持x-www-form-urlencoded

解决方法:

  • 进入head容器

    docker exec -it 62c5c56241ae /bin/bash
    
  • 进入_site文件夹

  • 编辑vendor.js

    vim vendor.js
    

    • 把容器的文件copy到宿主机中编辑

      参考:https://blog.csdn.net/zhaoyajie1011/article/details/98610002

    • 安装vim

      apt-get update
      apt-get install vim
      
  • 修改内容

    contentType: "application/x-www-form-urlencoded
    修改为:
    contentType: "application/json;charset=UTF-8"
    
    var inspectData = s.contentType === "application/x-www-form-urlencoded"
    修改为:
    var inspectData = s.contentType === "application/json;charset=UTF-8"
    
  • 重启容器

这时候创建成功了!但是head这个插件主要用来数据展示,不适合做些复杂查询,我们做查询最好安装功能更强大的Kibana

安装Kibana

  • Docker 安装

    docker pull kibana:7.6.1
    
  • 启动镜像

    docker run --name kibana -e ELASTICSEARCH_HOSTS=http://IP:9200 -p 5601:5601 -d kibana:7.6.1
    
  • 修改配置

    这里我把容器中的文件copy到宿主机上进行修改

    docker cp 970f63f0babb:/usr/share/kibana/config/kibana.yml /mydata/kibana/config/
    

    直接在宿主机编辑

    vim kibana.yml
    

    修改以下内容:

    server.name: kibana
    server.port: 5601
    server.host: "0.0.0.0"
    elasticsearch.hosts: [ "http://IP:9200" ]
    i18n.locale: "zh-CN"
    xpack.monitoring.ui.container.elasticsearch.enabled: true
    

    把修改好的配置copy到容器中

    docker cp /mydata/kibana/config/kibana.yml 970f63f0babb:/usr/share/kibana/config/
    
  • 重启容器

    docker restart 970f63f0babb
    
  • 浏览器访问5601端口

安装ik分词器

  • 进入elasticsearch容器

    docker exec -it 98d725e6291e /bin/bash
    
  • 安装

    elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.6.1/elasticsearch-analysis-ik-7.6.1.zip
    

  • 重启所有容器

  • 测试分词效果

    • 打开kibana控制台http://localhost:5601/

    • 侧边栏找到Dev Tools

    • 测试ik_max_word(最细粒度拆分)

      POST _analyze
      {
        "analyzer": "ik_max_word",
        "text": "中国共产党"
      }
      

    • 测试ik_smart(最少切分)

      POST _analyze
      {
        "analyzer": "ik_smart",
        "text": "中国共产党"
      }
      

  • 自定义分词

    • 比如我要对“我爱赵亚杰”进行分词,不管是ik_smart 还是 ik_max_word,都会把名字拆分成单个字

    • 这时候就需要用到自定义分词,进入容器,找到ik分词器的配置

      exec -it 98d725e6291e /bin/bash
      
      cd config/analysis-ik/
      
      vi IKAnalyzer.cfg.xml 
      

      <entry key="ext_dict"></entry>中配置自己的分词字典

      <entry key="ext_dict">my.dic</entry>
      

      保存,新建my.dic词典

      vi my.dic
      

      my.dic中输入赵亚杰三个字,保存

    • 重启elasticsearch容器

    • 测试自定义分词效果

ElasticSearch基本操作

操作说明

操作methodURL地址
创建文档(指定文档ID)PUTlocalhost:9200/索引名称/类型名称/文档ID
创建文档(随机文档ID)POSTlocalhost:9200/索引名称/类型名称
修改文档POSTlocalhost:9200/索引名称/类型名称/文档ID/_update
删除文档DELETElocalhost:9200/索引名称/类型名称/文档ID
查看文档(通过文档ID)GETlocalhost:9200/索引名称/类型名称/文档ID
查询所有数据POSTlocalhost:9200/索引名称/类型名称/_search

常用操作

  • 查看健康状态

    GET _cat/health
    
    1599732945 10:15:45 elasticsearch yellow 1 1 5 5 0 0 2 0 - 71.4%
    
  • 查看_cat里包含哪些东西

    GET _cat/indices
    
    yellow open poem                     tWco8rUWQCS1YuMtkrCl4A 1 1  1 0  5.1kb  5.1kb
    green  open .kibana_task_manager_1   1SxsVdvgSZOOQ3X9wKXJzQ 1 0  2 1 16.2kb 16.2kb
    yellow open poem2                    xWMF79GYTaKco1Ljo2SmrA 1 1  0 0   283b   283b
    green  open .apm-agent-configuration UGRU7tD0Tj-bOnmo-nfZrw 1 0  0 0   283b   283b
    green  open .kibana_1                r65DwNYWSha1v7AW5v62QQ 1 0 20 6   48kb   48kb
    

    通过_cat可以查看很多信息

###创建索引

默认字段类型
PUT /poem/poem/1
{
  "title": "相思",
  "author": "王维",
  "content": "红豆生南国,春来发几枝。愿君多采撷,此物最相思。"
}

执行

使用elasticsearch head插件查看index

通过数据浏览查看文档内容

指定字段类型(定义索引规则)
PUT /poem2
{
	"mappings": {
		"properties": {
			"title": {
      	"type": "text"
      },
      "date": {
      	"type": "date"
      },
      "content": {
      	"type": "text"
      }
		}
	}
}

使用head插件查看

查询

普通查询
GET /poem/_doc/1
或
GET /poem/poem/1

查询index为poem,_doc是默认的type,在elasticsearch8.x后,type会被淘汰,1是id为1的内容

{
  "_index" : "poem",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "title" : "相思",
    "author" : "王维",
    "content" : "红豆生南国,春来发几枝。愿君多采撷,此物最相思。"
  }
}
按条件查询

content包含“一”的:

GET /poem/_search?q=content:一
{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.74386525,
    "hits" : [
      {
        "_index" : "poem",
        "_type" : "poem",
        "_id" : "2",
        "_score" : 0.74386525,
        "_source" : {
          "title" : "登鹳雀楼",
          "author" : "王之涣",
          "content" : "白日依山尽,黄河入海流。欲穷千里目,更上一层楼。"
        }
      },
      {
        "_index" : "poem",
        "_type" : "poem",
        "_id" : "3",
        "_score" : 0.6489038,
        "_source" : {
          "title" : "九月九日忆山东兄弟",
          "author" : "王维",
          "content" : "独在异乡为异客,每逢佳节倍思亲。遥知兄弟登高处,遍插茱萸少一人。"
        }
      }
    ]
  }
}

这里是否是模糊查询,取决于定义index的时候,字段的类型,如果是text类型,那么将会被分词,如果为keyword类型,将不会被分词。

查询指定字段
GET /poem/poem/_search
{
  "query": {
    "match": {
      "content": "一"
    }
  },
  "_source": ["title", "content"]
}

match会使用分词器解析

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.74386525,
    "hits" : [
      {
        "_index" : "poem",
        "_type" : "poem",
        "_id" : "2",
        "_score" : 0.74386525,
        "_source" : {
          "title" : "登鹳雀楼",
          "content" : "白日依山尽,黄河入海流。欲穷千里目,更上一层楼。"
        }
      },
      {
        "_index" : "poem",
        "_type" : "poem",
        "_id" : "3",
        "_score" : 0.6489038,
        "_source" : {
          "title" : "九月九日忆山东兄弟",
          "content" : "独在异乡为异客,每逢佳节倍思亲。遥知兄弟登高处,遍插茱萸少一人。"
        }
      }
    ]
  }
}
排序
GET /poem/poem/_search
{
  "query": {
    "match": {
      "content": "一"
    }
  },
  "_source": ["title", "content","date"],
  "sort": [
    {
      "date": {
        "order": "asc"
      }
    }
  ]
}
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "poem",
        "_type" : "poem",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "date" : "2020-09-10",
          "title" : "登鹳雀楼",
          "content" : "白日依山尽,黄河入海流。欲穷千里目,更上一层楼。"
        },
        "sort" : [
          1599696000000
        ]
      },
      {
        "_index" : "poem",
        "_type" : "poem",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "date" : "2020-09-11",
          "title" : "九月九日忆山东兄弟",
          "content" : "独在异乡为异客,每逢佳节倍思亲。遥知兄弟登高处,遍插茱萸少一人。"
        },
        "sort" : [
          1599782400000
        ]
      }
    ]
  }
}
分页
GET /poem/poem/_search
{
  "query": {
    "match": {
      "content": "一"
    }
  },
  "_source": ["title", "content","date"],
  "sort": [
    {
      "date": {
        "order": "asc"
      }
    }
  ],
  "from": 0,
  "size": 1
}

from: 从多少条开始查询;

size:查询条数

多条件查询
GET /poem/poem/_search
{
  "query": {
    "bool": {
      "must": [
        {
         "match": {
           "author": "王维"
         }
        },
        {
          "match": {
            "date": "2020-09-11"
          }
        }
      ]
    }
  }
}
或
GET /poem/poem/_search
{
  "query": {
    "bool": {
      "should": [
        {
         "match": {
           "author": "王维"
         }
        },
        {
          "match": {
            "date": "2020-09-12"
          }
        }
      ]
    }
  }
}

must 相当于mysql的and

must_not 相当于mysql的not

should 相当于mysql的or

匹配多条件查询,多个词用空格分开

GET /poem/poem/_search
{
  "query": {
    "match": {
      "content": "三 一"
    }
  }
}

范围查询
GET /poem/poem/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "author": "王维"
          }
        }
      ],
      "filter": {
        "range": {
          "index": {
            "gte": 1,
            "lt": 3
          }
        }
      }
    }
  }
}

gt 大于; gte大于等于;lt小于;lte小于等于

高亮显示
GET /poem/poem/_search
{
  "query": {
    "match": {
      "content": "一"
    }
  },
  "highlight": {
    "pre_tags": "<span style='color: red'>",
    "post_tags": "</span>",
    "fields": {
      "content": {}
    }
  }
}

使用highlight关键字

修改

POST /poem/_doc/1/_update
{
	"doc": {
		"date": "2020-09-10"
	}
}
{
  "_index" : "poem",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,
  "_seq_no" : 1,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "title" : "相思",
    "author" : "王维",
    "content" : "红豆生南国,春来发几枝。愿君多采撷,此物最相思。",
    "date" : "2020-09-10"
  }
}

每次修改version都会自增

删除

DELETE /poem2/_doc/1(删除指定文档)
或
DELETE /poem2(删除index)
{
  "_index" : "poem2",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 3,
  "result" : "not_found",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}
{
  "acknowledged" : true
}

通过GET _cat/indices查看所有的index

yellow open poem                     tWco8rUWQCS1YuMtkrCl4A 1 1  1 0 12.2kb 12.2kb
green  open .kibana_task_manager_1   1SxsVdvgSZOOQ3X9wKXJzQ 1 0  2 1 16.2kb 16.2kb
green  open .apm-agent-configuration UGRU7tD0Tj-bOnmo-nfZrw 1 0  0 0   283b   283b
green  open .kibana_1                r65DwNYWSha1v7AW5v62QQ 1 0 23 3 73.7kb 73.7kb

发现poem2已经被删掉了

SpringBoot集成ES

官方文档:https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.6/java-rest-high-document-index.html

引入maven依赖

<dependency>
	<groupId>org.springframework.boot</groupId>
	<artifactId>spring-boot-starter-data-elasticsearch</artifactId>
</dependency>

不指定版本有可能引入的和实际使用的版本不一致

<properties>
	<java.version>1.8</java.version>
	<elasticsearch.version>7.6.1</elasticsearch.version>
</properties>

新建ElasticSearch配置类

package com.youngj.es.config;

import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

/**
 * ElasticSearch配置文件
 * @author YoungJ
 */
@Configuration
public class ElasticSearchClientConfig {

    @Bean
    public RestHighLevelClient restHighLevelClient() {
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("127.0.0.1", 9200, "http")));
        return client;
    }
}

测试相关API

创建测试类
package com.youngj.es.api;

import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.CreateIndexResponse;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;

import java.io.IOException;

@SpringBootTest
class EsApiApplicationTests {
	private static final String INDEX = "youngj_poem";
	@Autowired
	private RestHighLevelClient restHighLevelClient;

	@Test
	void contextLoads() {
	}
}
创建index
@Test
void testCreateIndex() throws IOException {
	CreateIndexRequest request = new CreateIndexRequest(INDEX);
	CreateIndexResponse indexResponse = restHighLevelClient.indices().create(request, RequestOptions.DEFAULT);
	System.out.println(indexResponse);
}
判断索引是否存在
/**
 * 判断索引是否存在
 * @throws IOException
 */
@Test
void getIndex() throws IOException {
	GetIndexRequest request = new GetIndexRequest(INDEX);
	boolean exists = restHighLevelClient.indices().exists(request, RequestOptions.DEFAULT);
	System.out.println(exists);
}
删除索引
/**
 * 删除索引
 * @throws IOException
 */
@Test
void delIndex() throws IOException {
	DeleteIndexRequest request = new DeleteIndexRequest(INDEX);
	AcknowledgedResponse response = restHighLevelClient.indices().delete(request, RequestOptions.DEFAULT);
	System.out.println(response.isAcknowledged());
}
创建文档
/**
 * 创建文档
 * @throws IOException
 */
@Test
void addDoc() throws IOException {
	IndexRequest request = new IndexRequest(INDEX);
	request.id("1");
	request.timeout(TimeValue.timeValueSeconds(1));
	request.source(JSON.toJSONString(new Poem("行宫", "元稹", "寥落古行宫,宫花寂寞红。白头宫女在,闲坐说玄宗。")), XContentType.JSON);
	IndexResponse indexResponse = restHighLevelClient.index(request, RequestOptions.DEFAULT);
	System.out.println(indexResponse);
	System.out.println(indexResponse.status());
}
IndexResponse[index=youngj_poem,type=_doc,id=1,version=1,result=created,seqNo=0,primaryTerm=1,shards={"total":2,"successful":1,"failed":0}]
CREATED

image-20200911153913760

批量创建文档
/**
 * 批量创建文档
 * @throws IOException
 */
@Test
void addBatchDoc() throws IOException {
	BulkRequest request = new BulkRequest(INDEX);
	request.timeout(TimeValue.timeValueSeconds(10));

	List<Poem> list = new ArrayList<>();
	list.add(new Poem("行宫", "元稹", "寥落古行宫,宫花寂寞红。白头宫女在,闲坐说玄宗。"));
	list.add(new Poem("新嫁娘词", "王建", "三日入厨下,洗手作羹汤。未谙姑食性,先遣小姑尝。"));
	list.add(new Poem("相思", "王维", "红豆生南国,春来发几枝。愿君多采撷,此物最相思。"));
	list.add(new Poem("杂诗三首·其二", "王维", "君自故乡来,应知故乡事。来日绮窗前,寒梅著花未?"));
	list.add(new Poem("鹿柴", "王维", "空山不见人,但闻人语响。返景入深林,复照青苔上。"));
	list.add(new Poem("芙蓉楼送辛渐", "王昌龄", "寒雨连江夜入吴,平明送客楚山孤。洛阳亲友如相问,一片冰心在玉壶。"));
	list.add(new Poem("江雪", "柳宗元", "千山鸟飞绝,万径人踪灭。孤舟蓑笠翁,独钓寒江雪。"));

	for (int i = 0; i < list.size(); i++) {
		request.add(new IndexRequest(INDEX)
				.id((i+2)+"")
				.source(JSON.toJSONString(list.get(i)), XContentType.JSON)
		);
	}
	BulkResponse bulk = restHighLevelClient.bulk(request, RequestOptions.DEFAULT);
	System.out.println(bulk.status());
	System.out.println(bulk.hasFailures());
}
判断文档是否存在
/**
 * 判断文档是否存在
 * @throws IOException
 */
@Test
void chkDocExist() throws IOException {
	GetRequest request = new GetRequest(INDEX);
	request.id("1");
	boolean exists = restHighLevelClient.exists(request, RequestOptions.DEFAULT);
	System.out.println(exists);
}
获取文档
/**
 * 获取文档
 * @throws IOException
 */
@Test
void getDoc() throws IOException {
	GetRequest request = new GetRequest(INDEX);
	request.id("1");
	GetResponse documentFields = restHighLevelClient.get(request, RequestOptions.DEFAULT);
	System.out.println(JSON.toJSONString(documentFields.getSource()));
}

结果:

{"author":"元稹","title":"行宫","content":"寥落古行宫,宫花寂寞红。白头宫女在,闲坐说玄宗。"}
更新文档
/**
 * 更新文档
 * @throws IOException
 */
@Test
void updateDoc() throws IOException {
	UpdateRequest request = new UpdateRequest(INDEX, "1");
	request.timeout(TimeValue.timeValueSeconds(1));
	Poem poem = new Poem("登鹳雀楼", "王之涣", "白日依山尽,黄河入海流。欲穷千里目,更上一层楼。");
	request.doc(JSON.toJSONString(poem), XContentType.JSON);
	UpdateResponse updateResponse = restHighLevelClient.update(request, RequestOptions.DEFAULT);
	System.out.println(JSON.toJSONString(updateResponse.status()));
	System.out.println(updateResponse.getGetResult());
}
删除文档
/**
 * 删除文档
 * @throws IOException
 */
@Test
void delDoc() throws IOException {
	DeleteRequest request = new DeleteRequest(INDEX, "2");
	DeleteResponse deleteResponse = restHighLevelClient.delete(request, RequestOptions.DEFAULT);
	System.out.println(deleteResponse.status());
}
搜索
/**
 * 搜索
 * @throws IOException
 */
@Test
void search() throws IOException {
	SearchRequest request = new SearchRequest(INDEX);
	SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
	MatchQueryBuilder matchQueryBuilder = QueryBuilders.matchQuery("content", "三");
	SearchSourceBuilder query = sourceBuilder.query(matchQueryBuilder);
	request.source(query);
	SearchResponse search = restHighLevelClient.search(request, RequestOptions.DEFAULT);
	System.out.println(search.status());
	System.out.println(JSON.toJSONString(search));
}

QueryBuilders 构建查询条件

使用Jsoup爬取网页数据写入ES

引入maven依赖

<dependency>
	<groupId>org.jsoup</groupId>
	<artifactId>jsoup</artifactId>
	<version>1.10.2</version>
</dependency>

新建解析Html工具类

public class HtmlParseUtil {
}

分析网页

审查元素,找到唐诗的主体部分,找到对应的html标签

我们发现,class=”sons“的标签下面有7个div,分别对应着无言绝句、七言绝句…

我们点开第一个div,也就是五言绝句,发现里面的span标签对应着五言绝句的标题,里面是个a标签,点击跳转到诗词体

image-20200912132332811

思路:

通过循环所有的class="typecont"的div,拿到span标签下的a标签的链接,请求后拿到诗词体

public void parseHtml() throws Exception {
    // 最外层的URL
    String wrapUrl = "https://www.gushiwen.org/gushi/tangshi.aspx";
    // 使用Jsoup.parse,把HTML结果解析成Document对象,我们可以像js那样使用里面的方法
    Document document = Jsoup.parse(new URL(wrapUrl), 50000);
    Elements elements = document.getElementsByClass("typecont");
    for (int i = 0; i < elements.size(); i++) {
        Element element = elements.get(i);
        Elements spans = element.getElementsByTag("span");
        for (int j = 0; j < spans.size(); j++) {
            Element span = spans.get(j);
            String src = span.getElementsByTag("a").eq(0).attr("href");
            String title = span.getElementsByTag("a").eq(0).text();

            System.out.println("title: " + title + ", src: " + src);
        }
    }
}

public static void main(String[] args) throws Exception {
    new HtmlParseUtil().parseHtml();
}

拿到了链接之后,我们点进去分析诗词体的HTML

通过查看元素我们发现,在class="cont"里面,

h1的内容是标题,

标签里面的第二个标签内容是作者,

的内容是诗词内容,这个id是contson和url中的内容拼接

public void parseHtml() throws Exception {
    // 最外层的URL
    String wrapUrl = "https://www.gushiwen.org/gushi/tangshi.aspx";
    // 使用Jsoup.parse,把HTML结果解析成Document对象,我们可以像js那样使用里面的方法
    Document document = Jsoup.parse(new URL(wrapUrl), 50000);
    Elements elements = document.getElementsByClass("typecont");
    for (int i = 0; i < elements.size(); i++) {
        Element element = elements.get(i);
        Elements spans = element.getElementsByTag("span");
        for (int j = 0; j < spans.size(); j++) {
            Element span = spans.get(j);
            String src = span.getElementsByTag("a").eq(0).attr("href");

            // 请求每一个URL,得到诗词体
            Document sonDoc = Jsoup.parse(new URL(src), 50000);
            // 获取url中的ID,下面获取诗词体的时候用得到
            String id = src.substring(src.indexOf("_")+1, src.indexOf(".aspx"));
            Element body = sonDoc.getElementById("sonsyuanwen");
            Element cont = body.getElementsByClass("cont").get(0);
            String title = cont.getElementsByTag("h1").eq(0).text();
            String author = cont.getElementsByTag("p").get(0).getElementsByTag("a").eq(1).text();
            String content = cont.getElementById("contson" + id).text();

            System.out.println("title: " + title + ", author: " + author + ", content: " + content);
        }
    }
}

爬取的数据写入到es

public List<Poem> parseHtml() throws Exception {
    String wrapUrl = "https://www.gushiwen.org/gushi/tangshi.aspx";
    Document document = Jsoup.parse(new URL(wrapUrl), 50000);
    Elements elements = document.getElementsByClass("typecont");
    List<Poem> poems = new ArrayList<>();
    for (int i = 0; i < elements.size(); i++) {
        Element element = elements.get(i);
        Elements spans = element.getElementsByTag("span");
        for (int j = 0; j < spans.size(); j++) {
            Element span = spans.get(j);
            String src = span.getElementsByTag("a").eq(0).attr("href");
            Document sonDoc = Jsoup.parse(new URL(src), 50000);
            String id = src.substring(src.indexOf("_")+1, src.indexOf(".aspx"));
            Element body = sonDoc.getElementById("sonsyuanwen");
            Element cont = body.getElementsByClass("cont").get(0);
            String title = cont.getElementsByTag("h1").eq(0).text();
            String author = cont.getElementsByTag("p").get(0).getElementsByTag("a").eq(1).text();
            String content = cont.getElementById("contson" + id).text();

            poems.add(new Poem(title, author, content));
        }
    }
    return poems;
}

使用es的批量插入方法,将数据写入到es

@Test
void insertHtmlParser() throws Exception {
	BulkRequest request = new BulkRequest(INDEX);
	request.timeout(TimeValue.timeValueSeconds(100));
	List<Poem> poems = new HtmlParseUtil().parseHtml();
	for (Poem poem : poems) {
		request.add(new IndexRequest(INDEX)
				.source(JSON.toJSONString(poem), XContentType.JSON)
		);
	}
	BulkResponse bulk = restHighLevelClient.bulk(request, RequestOptions.DEFAULT);
	System.out.println(bulk.hasFailures());
}

image-20200912134925449

前端使用Vue.js完成搜索功能

引入js

  • axios.min.js (网络交互)

  • vue.min.js

页面编写

<!DOCTYPE html>
<html xmlns:th="http://www.thymeleaf.org">

<head>
    <meta charset="utf-8"/>
    <title>古诗词搜索</title>
    <link rel="stylesheet" th:href="@{/static/css/style.css}"/>

</head>

<body class="pg">
<div class="page" id="app">
    <div id="mallPage" class=" mallist tmall- page-not-market ">

        <div id="header" class=" header-list-app">
            <div class="headerLayout">
                <div class="headerCon ">
                    <div class="header-extra">

                        <!--搜索-->
                        <div id="mallSearch" class="mall-search">
                            <form name="searchTop" class="mallSearch-form clearfix">
                                <fieldset>
                                    <legend>搜索</legend>
                                    <div class="mallSearch-input clearfix">
                                        <div class="s-combobox" id="s-combobox-685">
                                            <div class="s-combobox-input-wrap">
                                                <input v-model="keyword" type="text" autocomplete="off" value="dd"
                                                       id="mq"
                                                       class="s-combobox-input" aria-haspopup="true">
                                            </div>
                                        </div>
                                        <button @click.prevent="searchKey" type="submit" id="searchbtn">搜索</button>
                                    </div>
                                </fieldset>
                            </form>
                        </div>
                    </div>
                </div>
            </div>
        </div>

        <div id="content">
            <div class="main">

                <div class="view">
                    <div class="product" v-for="result in results">
                        <div class="product-iWrap">
                            <div style="text-align: center">
                                {{result.title}}
                            </div>
                            <div style="text-align: center">
                                {{result.author}}
                            </div>
                            <div style="text-align: left" v-html="result.content">
                            </div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
</div>

<script th:src="@{/static/js/vue.min.js}"></script>
<script th:src="@{/static/js/axios.min.js}"></script>
<script>
    new Vue({
        el: '#app',
        data: {
            keyword: '',
            results: []
        },
        methods: {
            searchKey() {
                var keyword = this.keyword;
                console.log(keyword)
                axios.get("/search/"+keyword+"/1/100").then(res => {
                    console.log(res.data)
                    this.results = res.data;
                });
            }
        }
    });
</script>

</body>
</html>

后端编写

IndexController

package com.youngj.es.controller;

import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.GetMapping;

/**
 * description:
 *
 * @author YoungJ
 * @date 2020-09-12 15:15
 */
@Controller
public class IndexController {

    @GetMapping({"/", "/index"})
    public String index() {
        return "index";
    }
}

SearchController

package com.youngj.es.controller;

import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.index.query.TermQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightField;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;

/**
 * description:
 *
 * @author YoungJ
 * @date 2020-09-12 15:46
 */
@RestController
public class SearchController {

    private static final String INDEX = "poem";

    @Autowired
    private RestHighLevelClient restHighLevelClient;

    @GetMapping("/search/{keyword}/{pageNo}/{pageSize}")
    public List<Map<String, Object>> search(@PathVariable("keyword") String keyword,
                         @PathVariable("pageNo") int pageNo,
                         @PathVariable("pageSize") int pageSize) throws Exception {
        SearchRequest searchRequest = new SearchRequest(INDEX);

        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();

        HighlightBuilder highlightBuilder = new HighlightBuilder()
                .requireFieldMatch(false)
                .field("content")
                .preTags("<span style='color: red'>")
                .postTags("</span>");

        sourceBuilder.highlighter(highlightBuilder);

        // 分页
        sourceBuilder.from(pageNo);
        sourceBuilder.size(pageSize);

        TermQueryBuilder termQueryBuilder = new TermQueryBuilder("content", keyword);
        sourceBuilder.query(termQueryBuilder);
        sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));

        searchRequest.source(sourceBuilder);
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        List<Map<String, Object>> list = new ArrayList<>();
        for (SearchHit hit : searchResponse.getHits().getHits()) {
            Map<String, Object> sourceAsMap = hit.getSourceAsMap();
            Map<String, HighlightField> highlightFields = hit.getHighlightFields();
            HighlightField content = highlightFields.get("content");
            if (content != null) {
                Text[] fragments = content.getFragments();
                String newCon = "";
                for (Text text : fragments) {
                    newCon += text;
                }
                sourceAsMap.put("content", newCon);
            }
            list.add(sourceAsMap);
        }
        return list;
    }
}

最终效果

搜索古诗词中含有”白“的

image-20200912165812361

搜索古诗词中含有夜的


  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

YoungJ5788

您的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值