1、 本篇目标
我们在一些电商网站搜索商品时,会搜索到许多相关商品,例如在某宝和某东搜索“Java”,会出现以下结果
-
某宝
-
某东
可以看到,两边都顺利的搜到符合要求的商品,某东甚至对每个商品第一个匹配的关键字进行了高亮处理。
我们接下来先把某东商品数据扒拉下来,将数据post到我们本地的ES中,再进行搜索。
在本篇中,最终我们达到以下效果
2、 准备工作
2.1 工程创建
在上一篇创建的空项目中创建一个子模块,在子模块中创建必要的package,删除这里用不上的文件
这里我偷懒不写dao、serviceImplement等,实际开发中一定要严格按照要求编写。
2.2 pom文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.goodwin</groupId>
<artifactId>es-jingdong</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>7</source>
<target>7</target>
</configuration>
</plugin>
</plugins>
</build>
<name>es-jingdong</name>
<description>Demo project for Spring Boot</description>
<properties>
<java.version>1.8</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<spring-boot.version>2.3.7.RELEASE</spring-boot.version>
<elasticsearch.version>7.6.2</elasticsearch.version>
</properties>
<dependencies>
<!--jsoup解析页面-->
<!--解析网页,如果需要爬取视频可自行研究-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.3</version>
</dependency>
<!--fastjson-->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.75</version>
</dependency>
<!--elasticsearch-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-elasticsearch</artifactId>
</dependency>
<!--web-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- thymeleaf -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-thymeleaf</artifactId>
</dependency>
<!--devtools热部署-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-devtools</artifactId>
<scope>runtime</scope>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
<optional>true</optional>
</dependency>
<!--lombok-->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<!--test-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
<exclusions>
<exclusion>
<groupId>org.junit.vintage</groupId>
<artifactId>junit-vintage-engine</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-dependencies</artifactId>
<version>${spring-boot.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
</project>
3、 代码实现
修改 application.preperties
配置文件
# 应用名称
spring.application.name=es-jingdong
# 应用服务 WEB 访问端口
server.port=8888
spring.thymeleaf.cache=false
访问测试
package com.goodwin.controller;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
/**
* @author goodwin
*/
@RestController
public class IndexController {
@GetMapping({"/", "/index","/main"})
public String index(){
return "index";
}
}
RestHighLevelClient注入
根据上一篇,我们操作ES的API
需要一个RestHighLevelClient
对象。
package com.goodwin.config;
import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
/**
* @author goodwin
*/
@Configuration
public class EsConfig {
@Bean
public RestHighLevelClient restHighLevelClient(){
return new RestHighLevelClient(
RestClient.builder(
new HttpHost("127.0.0.1", 9200, "http")
)
);
}
}
创建商品实体类
根据京东的搜索结果,我们定义一个商品实体类。只记录搜索结果的商品名、价格、图片、商家和搜索结果的来源。
package com.goodwin.pojo;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* @author goodwin
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class Product {
private String name;
private String price;
private String img;
private String shop;
private String source;
}
获得搜索的结果
- 某东的搜索URL
这里能看到我们实体类需要的对应信息。
代码实现爬取数据
package com.goodwin.util;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
/**
* @author goodwin
*/
public class HtmlParseUtil {
public static void main(String[] args) throws IOException {
// 请求的URL
String url = "https://search.jd.com/Search?keyword=Java";
Document document = Jsoup.parse(new URL(url), 30000);
Element element = document.getElementById("J_goodsList");
Elements lis = element .getElementsByTag("li");
for (Element li : lis) {
String img = li.getElementsByTag("img").eq(0).attr("src");
String name = li.getElementsByClass("p-name").eq(0).text();
String price = li.getElementsByClass("p-price").eq(0).text();
String shop = li.getElementsByClass("hd-shopname").eq(0).text();
System.out.println(img);
System.out.println(name);
System.out.println(price);
System.out.println(shop);
System.out.println("----------------------------------------");
}
}
}
输出
能够获取数据,但发现丢失了部分数据(img)。
我们打印整个li
System.out.println(li);
发现我们要找的图片src
藏在data-lazy-img
中,发现img
标签中并没有属性src
的设置。原来,一般图片特别多的网站,图片是通过延迟加载的,修改一下代码。
String img = li.getElementsByTag("img").eq(0).attr("data-lazy-img");
我们将以上代码封装成一个工具类方法parseJdHtml
package com.goodwin.util;
import com.goodwin.pojo.Product;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
/**
* @author goodwin
*/
public class HtmlParseUtil {
public static void main(String[] args) throws IOException {
List<Product> products = parseJdHtml("洗衣粉");
for (Product product : products) {
System.out.println(product);
}
}
public static List<Product> parseJdHtml(String keyword) throws IOException {
String url = "https://search.jd.com/Search?keyword=" + keyword;
Document document = Jsoup.parse(new URL(url), 30000);
Element goodsList = document.getElementById("J_goodsList");
assert goodsList != null;
Elements lis = goodsList.getElementsByTag("li");
List<Product> products = new ArrayList<>();
for (Element li : lis) {
String img = li.getElementsByTag("img").eq(0).attr("data-lazy-img");
String name = li.getElementsByClass("p-name").eq(0).text();
String price = li.getElementsByClass("p-price").eq(0).text();
String shop = li.getElementsByClass("hd-shopname").eq(0).text();
Product product = new Product(name,price,img,shop,"京东");
products.add(product);
}
return products;
}
}
输出
将搜索结果存入ES
我们先创建ES索引product
这里使用Kibana创建,你也可以直接在java中实现,在上一篇有讲过相关的API
PUT /product
{
"mappings":{
"properties":{
"name":{
"type":"text",
"store":true,
"index":true,
"analyzer":"ik_smart"
},
"price":{
"type":"text",
"store":true
},
"img":{
"type":"text",
"store":true
},
"shop":{
"type":"text",
"store":true
},
"source":{
"type":"text",
"store":true,
"index":true,
"analyzer":"ik_smart"
}
}
}
}
通过前面几篇,我们知道在post过程中,如果不指定文档id
,ES会生成一个看不懂的id
,我在这里通过常量去指定
public final class EsConstant {
//设置下一个id
public static int NEXT_ID = 1;
//设置分页时页的大小
public static int PAGE_SIZE = 10;
}
- controller
@RestController
public class EsController {
@Autowired
private EsService esService;
@GetMapping("/es/parse/{keyword}")
public boolean parse(@PathVariable String keyword) throws IOException {
return esService.parseContent(keyword);
}
}
- service
@Service
public class EsService {
@Autowired
@Qualifier("restHighLevelClient")
private RestHighLevelClient client;
public boolean parseContent(String keyword) throws IOException {
List<Product> products = HtmlParseUtil.parseJdHtml(keyword);
BulkRequest bulkRequest = new BulkRequest();
for (Product product : products) {
bulkRequest.add(
new IndexRequest("product")
.id("" + (EsConstant.NEXT_ID++))
.source(JSON.toJSONString(product), XContentType.JSON)
);
}
BulkResponse response = client.bulk(bulkRequest, RequestOptions.DEFAULT);
return !response.hasFailures();
}
}
前端请求http://localhost:8888/es/parse/java
添加数据
查看elasticsearch-head,id实现了自增,正确插入了30条数据。可以多增加一些数据供后面搜索。
搜索
@GetMapping("/es/search/{keyword}/{page}")
public List<Product> search(@PathVariable String keyword, @PathVariable Integer page) throws IOException {
return esService.search(keyword,page);
}
上一篇已经给出了ES搜索相关API在java中使用的介绍,这里不再赘述。
public List<Product> search(String keyword, Integer page) throws IOException {
if(page <= 0 ){
page = 1;
}
SearchRequest request = new SearchRequest("product");
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
//模糊查询
MatchQueryBuilder matchQueryBuilder = new MatchQueryBuilder("name",keyword);
searchSourceBuilder.query(matchQueryBuilder);
//分页,EsConstant.PAGE_SIZE = 10
searchSourceBuilder.from((page - 1) * EsConstant.PAGE_SIZE);
searchSourceBuilder.size(EsConstant.PAGE_SIZE);
//设置超时
searchSourceBuilder.timeout(new TimeValue(10, TimeUnit.SECONDS));
//设置高亮
HighlightBuilder highlightBuilder = new HighlightBuilder();
highlightBuilder.field("name");
highlightBuilder.preTags("<span style='color:red'>");
highlightBuilder.postTags("</span>");
searchSourceBuilder.highlighter(highlightBuilder);
request.source(searchSourceBuilder);
SearchResponse response = client.search(request,RequestOptions.DEFAULT);
//解析结果
SearchHits hits = response.getHits();
List<Product> products = new ArrayList<>();
for (SearchHit hit : hits) {
Map<String, Object> map = hit.getSourceAsMap();
Product product = new Product();
product.setName((String) map.get("name"));
product.setPrice((String) map.get("price"));
product.setImg((String) map.get("img"));
product.setShop((String) map.get("shop"));
product.setSource((String) map.get("source"));
products.add(product);
}
return products;
}
在这里,大家可以自己简单实现前端页面。或者直接请求http://localhost:8888/es/search/{keyword}/{page}
查看是否能正确返回数据即可。
从上面截图可以看到我已经正确返回了搜索结果,而且实现了分页。在上述代码中配置了HighlightBuilder
,但这里没有高亮显示。在前面几篇,我们知道我们通过搜索,返回的是hits
中的数据,而高亮搜索出来的数据不在hits
中,而在highlight
中,我们只需要将highlight
中的字段替代hits
对应的字段返回即可。
//获得高亮字段
Map<String, HighlightField> highlightFields = hit.getHighlightFields();
HighlightField highlightName = highlightFields.get("name");
// 替换
if (highlightName != null){
Text[] fragments = highlightName.fragments();
StringBuilder newName = new StringBuilder();
for (Text text : fragments) {
newName.append(text);
}
product.setName(newName.toString());
}else {
product.setName((String) map.get("name"));
}
最终,我们达到了以下效果,还简单实现了分页查询
完整搜索代码
public List<Product> search(String keyword, Integer page) throws IOException {
if(page <= 0 ){
page = 1;
}
SearchRequest request = new SearchRequest("product");
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
//模糊查询
MatchQueryBuilder matchQueryBuilder = new MatchQueryBuilder("name",keyword);
searchSourceBuilder.query(matchQueryBuilder);
//分页,EsConstant.PAGE_SIZE = 10
searchSourceBuilder.from((page - 1) * EsConstant.PAGE_SIZE);
searchSourceBuilder.size(EsConstant.PAGE_SIZE);
//设置超时
searchSourceBuilder.timeout(new TimeValue(10, TimeUnit.SECONDS));
//设置高亮
HighlightBuilder highlightBuilder = new HighlightBuilder();
highlightBuilder.field("name");
highlightBuilder.preTags("<span style='color:red'>");
highlightBuilder.postTags("</span>");
searchSourceBuilder.highlighter(highlightBuilder);
request.source(searchSourceBuilder);
SearchResponse response = client.search(request,RequestOptions.DEFAULT);
//解析结果
SearchHits hits = response.getHits();
List<Product> products = new ArrayList<>();
for (SearchHit hit : hits) {
Map<String, Object> map = hit.getSourceAsMap();
Product product = new Product();
//获得高亮字段
Map<String, HighlightField> highlightFields = hit.getHighlightFields();
HighlightField highlightName = highlightFields.get("name");
// 替换
if (highlightName != null){
Text[] fragments = highlightName.fragments();
StringBuilder newName = new StringBuilder();
for (Text text : fragments) {
newName.append(text);
}
product.setName(newName.toString());
}else {
product.setName((String) map.get("name"));
}
product.setPrice((String) map.get("price"));
product.setImg((String) map.get("img"));
product.setShop((String) map.get("shop"));
product.setSource((String) map.get("source"));
products.add(product);
}
return products;
}
各位可以根据自己需求设置查询条件,本篇到此结束,(¦3[▓▓] 晚安。