使用Jsoup获取京东页面元素,并使用elasticsearch简单搜索

学习来源 :狂神说Java,b站地址,点击进入

所需安装的环境 (连接有提供):ik分词器,Elasticsearch,kabana,ElasticSearch Head(可以在谷歌浏览器中安装扩展包)

链接:https://pan.baidu.com/s/1WO676lT1pAihEYofESgPHw
提取码:bv7n

狂神使用的是 vue ,我使用的是 thymeleaf

功能:

  • 获取京东页面元素解析到自己网站中
  • 将搜索到的商品信息(图片,价格,题目)存放到 Elasticsearch 中的 jd_goods 索引中(前30条)
  • 搜索显示出信息 (暂不支持中文搜索,第一次搜索不到,请再搜一次,有待完善)
  • 实现关键字高亮

具体实现可以参考码云地址

pox.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.2.5.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>com.hjm</groupId>
    <artifactId>springboot-es-jd</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>springboot-es-jd</name>
    <description>Demo project for Spring Boot</description>

    <properties>
        <java.version>1.8</java.version>
        <elasticsearch.version>7.6.1</elasticsearch.version>
    </properties>

    <dependencies>

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.68</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-thymeleaf</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-configuration-processor</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
            <exclusions>
                <exclusion>
                    <groupId>org.junit.vintage</groupId>
                    <artifactId>junit-vintage-engine</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>

</project>

使用 Jsoup 解析京东商城页面

@Component
public class HtmlParseUtil {
    public List<Content> parseJD(String keywords) throws Exception {
        //获取请求https://search.jd.com/Search?keyword=java
        String url = "https://search.jd.com/Search?keyword="+keywords+"&enc=utf-8";
        //解析网页(就是js页面对象)
        Document document = Jsoup.parse(new URL(url), 30000);
        Element element = document.getElementById("J_goodsList");
        //System.out.println(element.html());
        Elements elements = element.getElementsByTag("li");

        ArrayList<Content> contents = new ArrayList<>();
        for (Element el : elements) {
            //图片地址
            String img = el.getElementsByTag("img").eq(0).attr("src");
            String price = el.getElementsByClass("p-price").eq(0).text();
            String title = el.getElementsByClass("p-name").eq(0).text();
//            System.out.println("======================");
//            System.out.println(img);
//            System.out.println(price);
//            System.out.println(title);
            Content content = new Content();
            content.setImg(img);
            content.setTitle(title);
            content.setPrice(price);
            contents.add(content);
        }
        return contents;
    }
}

将解析的数据放入 ES 中

@Autowired
    private RestHighLevelClient restHighLevelClient;

    /**
     * 解析数据放到 ElasticSearch 索引中
     * @param keywords
     * @return
     * @throws Exception
     */
    public boolean parseContent(String keywords) throws Exception {
        List<Content> contents = new HtmlParseUtil().parseJD(keywords);
        BulkRequest bulkRequest = new BulkRequest();
        bulkRequest.timeout("2m");
        //检测是否存在 jd_goods
        GetIndexRequest request = new GetIndexRequest("jd_goods");
        boolean exists =restHighLevelClient.indices()
                .exists(request,RequestOptions.DEFAULT);
        //如果不存在 jd_goods 索引,就创建一个jd_goods索引
        if (!exists) {
            CreateIndexRequest createIndexRequest = new CreateIndexRequest("jd_goods");
            CreateIndexResponse createIndexResponse1 =
                    restHighLevelClient.indices().create(createIndexRequest, RequestOptions.DEFAULT);
        }
        //将解析到的数据批量加入到 jd_goods 中
        for (int i = 0; i < contents.size(); i++) {
            bulkRequest.add(new IndexRequest("jd_goods")
                    .source(JSON.toJSONString(contents.get(i)), XContentType.JSON) );
        }
        BulkResponse bulk = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
        return !bulk.hasFailures();
    }

高亮功能的实现

/**
     * 实现高亮功能
     * @param keyWord
     * @return
     * @throws IOException
     */
    public  List<Map<String,Object>> searchPageHighLight(String keyWord) throws IOException {
        //条件查询
        SearchRequest searchRequest = new SearchRequest("jd_goods");
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();

        //分页
        sourceBuilder.from(1);
        sourceBuilder.size(30);

        //精准匹配
        TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title", keyWord);
        sourceBuilder.query(termQueryBuilder);
        sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));

        //高亮
        HighlightBuilder highlightBuilder = new HighlightBuilder();
        highlightBuilder.field("title");
        //多个高亮关闭 例如 java 店铺只卖 java 书本 ,只高亮一个 java
        highlightBuilder.requireFieldMatch(false);
        highlightBuilder.preTags("<span style='color:red'>");
        highlightBuilder.postTags("</span>");
        sourceBuilder.highlighter(highlightBuilder);

        searchRequest.source(sourceBuilder);
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
        //解析结果
        ArrayList<Map<String, Object>> list = new ArrayList<>();
        for (SearchHit hit : searchResponse.getHits().getHits()) {
            Map<String, HighlightField> highlightFields = hit.getHighlightFields();

            HighlightField title = highlightFields.get("title");
            Map<String, Object> sourceAsMap = hit.getSourceAsMap();
            if (title != null) {
                Text[] fragments = title.fragments();
                String n_title = "";
                for (Text text: fragments) {
                    n_title += text;
                    //System.out.println(n_title);
                }
                sourceAsMap.put("title",n_title);
            }
            list.add(sourceAsMap);
        }
        return list;
    }

Controller层(使用fastjson 将 商品信息转换为 Content 类型)

@Controller
public class ContentController {
    @Autowired
    private ContentService contentService;

    @PostMapping("/parse")
    public String parse(@RequestParam(value = "keyword", required = false)String keyword) throws Exception {
       // System.out.println(contentService.parseContent(keyword));
        return "redirect:search/"+keyword;
    }

    @GetMapping("/search/{keywords}")
    public String search(@PathVariable(value = "keywords") String keywords,
                                           Model model) throws IOException {

        List<Map<String, Object>> maps = contentService.searchPageHighLight(keywords);
        String json = "";
        Content content = null;
        List<Content> list = new ArrayList<>();
        for (int i = 0; i < maps.size(); i++) {
            json = JSON.toJSONString(maps.get(i));
            content = JSON.parseObject(json, Content.class);
            //System.out.println(content);
            list.add(content);
        }
        model.addAttribute("value",list);
        return "search";
    }
}

效果图

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值