搜索引擎项目

一头创死算了

已于 2023-02-27 09:03:49 修改

阅读量1k

点赞数 1

分类专栏：笔记文章标签：搜索引擎 mybatis mysql

于 2022-08-13 10:51:18 首次发布

本文链接：https://blog.csdn.net/weixin_45715131/article/details/126316583

版权

笔记专栏收录该内容

50 篇文章 4 订阅

订阅专栏

一、认识搜索引擎
像百度、搜狗这些搜索引擎，输入一个搜索词就会出现若干条结果，每条结果包含标题，内容描述，展示url，图片等相关内容。
搜索引擎的本质就是输入一个查询词，得到若干个结果标题、描述、点击url。

搜索的核心思路
我们把一个网页称为一个文档；每一次搜索就是在所有文档中查找搜索词，检查文档中是否有搜索词。但是这种搜索方法很直接很暴力，开销很巨大，随着文档的增多，每次搜索的时间都会很长，而我们对搜索引擎的效率要求很高，试想一下你要搜索一个单词，要1分多钟的时间你还会选择这个搜索引擎吗？
所以我们引入倒排索引，这是针对搜索引擎而设计的数据结构。

倒排索引
文档：我们只检索html页面
正排索引：按照文档进行索引，一个文档中有哪些词，描述一个文档中有哪些信息，将文档中的词进行分词并处理。
倒排索引：按照词语进行分类，一个词被那些文档引用，储存了这些词在那些文档，并在这些文档中占据的重要程度。

二、项目介绍
我们针对Java API文档实现一个搜索引擎。
我们将需要的Java API文档保存到本地。
我们要实现搜索引擎需要实现以下模块：

构建索引模块：扫码下载好的文档，分析数据内容使用正排索引和倒排索引，并保存到本地文件。
搜索模块：加载索引，根据输入的查询词，基于正排和倒排索引进行检索得到检索结果。
web模块：编写一个简单的web页面，展示搜索结果。
三、索引构建具体实现
1、分词
正排索引和倒排索引都要对内容进行分词处理，我们使用ansj_seg分词技术来进行分词操作。
我们要在pom.xml文件中插入如下代码：

org.ansj ansj_seg 5.1.6

2、文件检索
在配置文件中配置要进行检索文件的地址，代码如下：

searcher:
indexer:
doc-root-path: D:\搜索引擎\docs\api
url-prefix: https://docs.oracle.com/javase/8/docs/api/

使用rootPath作为根目录，开始进行文件扫描，把所有符合要求的File对象作为结果，以List形式进行返回。首先通过@Service注解将该类注册为Spring Bean ，采用深度优先遍历，使用递归完成。
代码如下：

public List scanFile(String rootPath, FileFilter filter) {
List resultList = new ArrayList<>();
File rootFile = new File(rootPath);
traversal(rootFile, filter, resultList);
return resultList;
}
private void traversal(File directoryFile, FileFilter filter, List resultList) {
// 1. 先通过目录，得到该目录下的孩子文件有哪些
File[] files = directoryFile.listFiles();
if (files == null) {

        return;
    }
    // 2. 遍历每个文件，检查是否符合条件
    for (File file : files) {      
        if (filter.accept(file)) {
            resultList.add(file);
        }
    }
    // 3. 遍历每个文件，针对是目录的情况，继续深度优先遍历（递归）
    for (File file : files) {
        if (file.isDirectory()) {
            traversal(file, filter, resultList);
        }
    }
}

}

这样我们就完成文件的扫描。

3、库表的建立
我们使用MySQL储存我们要储存的文档。通过设计需要两个表来进行储存。一是储存正排索引的表，二是储存倒排索引的表。建表语句如下：

CREATE TABLE searcher.weights (
wid int(11) NOT NULL AUTO_INCREMENT,
docId int(11) NOT NULL,
weight int(11) NOT NULL COMMENT
)COMMENT='倒排索引中的权重信息，包含 docId + weight

CREATE TABLE searcher.documents (
docid int(11) NOT NULL COMMENT
title varchar(100) NOT NULL,
url varchar(200) NOT NULL,
content longtext NOT NULL,
PRIMARY KEY (docid)
) COMMENT=‘文档表，也就是正排索引表’;

正排索引表储存正排索引信息，倒排索引表储存倒排索引信息。同时在yml文件中配置关联MySQL。

4、对扫描的文件进行处理
第一步：扫描出来所有的html文件。代码如下：

List htmlFileList = fileScanner.scanFile(properties.getDocRootPath(), file -> {
return file.isFile() && file.getName().endsWith(“.html”);
});

第二步：针对每个 html 文件，得到其标题、URL、正文信息，把这些信息封装成一个对象（文档 Document）。代码如下：

File rootFile = new File(properties.getDocRootPath());
List documentList = htmlFileList.stream()
.parallel() // 【注意】由于我们使用了 Stream 用法，所以，可以通过添加 .parallel()，使得整个操作变成并行，利用多核增加运行速度
.map(file -> new Document(file, properties.getUrlPrefix(), rootFile))
.collect(Collectors.toList());

1、分词处理
因为读取的文件都会带有.html的后缀，不能算作文件的标题，所以在计算分词之前首先要把获取文档的后缀去掉；代码如下：

private String parseTitle(File file) {
// 从文件名中，将 .html 后缀去掉，剩余的看作标题，进行简单的拼接即可。
String name = file.getName();
String suffix = “.html”;
return name.substring(0, name.length() - suffix.length());
}

针对文档进行分词，并计算权重值（我们这里将在标题中出现的词权重10，在文档正文中出现的词权重1）。
标题分词处理
代码如下：

public Map<String, Integer> segWordAndCalcWeight() {
// 统计标题中的每个词出现次数 | 分词：标题有哪些词
List wordInTitle = ToAnalysis.parse(title)
.getTerms()
.stream()
.parallel()
.map(Term::getName)
.collect(Collectors.toList());
// 统计标题中，每个词的出现次数 | 统计次数
Map<String, Integer> titleWordCount = new HashMap<>();
for (String word : wordInTitle) {
int count = titleWordCount.getOrDefault(word, 0);
titleWordCount.put(word, count + 1);
}

内容分词处理
代码如下：

// 统计内容中的词，以及词的出现次数
List wordInContent = ToAnalysis.parse(content)
.getTerms()
.stream()
.parallel()
.map(Term::getName)
.collect(Collectors.toList());
Map<String, Integer> contentWordCount = new HashMap<>();
for (String word : wordInContent) {
int count = contentWordCount.getOrDefault(word, 0);
contentWordCount.put(word, count + 1);
}

2、权重计算
权重的计算因为是不同的单词所以要进行去重，这里采用Set来去重。然后通过遍历然后计算获得该词的权重并放入List中。

// 计算权重值
Map<String, Integer> wordToWeight = new HashMap<>();
// 先计算出有哪些词，不重复
Set wordSet = new HashSet<>(wordInTitle);
wordSet.addAll(wordInContent);

    for (String word : wordSet) {
        int titleCount = titleWordCount.getOrDefault(word, 0);
        int contentCount = contentWordCount.getOrDefault(word, 0);
        int weight = titleCount * 10 + contentCount;

        wordToWeight.put(word, weight);
    }

    return wordToWeight;
}

3、url及文件中Js的处理
因为在yml文件中配置了前缀url为：https://docs.oracle.com/javase/8/docs/api/
所以要获得完整的url就要从文件路径中获取，又因为文件路径为"\“而url中为”/"所以要进行替换并把前缀url和绝对路径进行拼接，就是完整的url了
代码如下：

// 需要得到一个相对路径，file 相对于 rootFile 的相对路径
// 比如：rootFile 是 D:\docs\api
// file 是 D:\docs\api\javax\sql\DataSource.html
// 则相对路径就是：javax\sql\DataSource.html
// 把所有反斜杠() 变成正斜杠(/)
// 最终得到 java/sql/DataSource.html
private String parseUrl(File file, String urlPrefix, File rootFile) {
String rootPath = rootFile.getCanonicalPath();
rootPath = rootPath.replace(“/”, “\”);
if (!rootPath.endsWith(“\”)) {
rootPath = rootPath + “\”;
}
String filePath = file.getCanonicalPath();
String relativePath = filePath.substring(rootPath.length());
relativePath = relativePath.replace(“\”, “/”);
return urlPrefix + relativePath;
}

Js处理
因为是html会带有JS代码，所以搜索的时候关键字中会由JS代码所以要把代码去除掉，这里采用正则表达式去除。
代码如下：

@SneakyThrows
private String parseContent(File file) {
StringBuilder contentBuilder = new StringBuilder();

   try (InputStream is = new FileInputStream(file)) {
       try (Scanner scanner = new Scanner(is, "ISO-8859-1")) {
           while (scanner.hasNextLine()) {
               String line = scanner.nextLine();
               contentBuilder.append(line).append(" ");
           }

           return contentBuilder.toString()
            // 首先去掉 <script ...>...</script>
                   .replaceAll("<script[^>]*>[^<]*</script>", " ")
                    // 去掉标签
                   .replaceAll("<[^>]*>", " ")
                     // 多带的空格的意思是，把 换行符 也视为空格了
                   .replaceAll("\\s+", " ")
                   //再去掉两边的空格
                   .trim();
       }
   }

}

5、索引的构建
索引的构建需要对数据库进行操作这里我用MyBatis进行操作，我们生成一个接口将其注册为Spring Bean并与xml文件关联，通过java对象中sql的动态参数进行映射生产最终执行的sql语句，最后再由Mybatis框架执行sql并将结果映射为java对象并返回。
这里我还做了一些优化，使用线程池来进行操作可以减少插入索引所需的时间。代码如下：

@Configuration
public class AppConfig {
@Bean
public ExecutorService executorService() {
ThreadPoolExecutor executor = new ThreadPoolExecutor(
8, 20, 30, TimeUnit.SECONDS,
new ArrayBlockingQueue<>(5000),
(Runnable task) -> {
Thread thread = new Thread(task);
thread.setName(“批量插入线程”);
return thread;
},
new ThreadPoolExecutor.AbortPolicy()
);

    return executor;
}

}

1、正排索引
其xml中的配置语句为：

insert into forward_indexes (title, url, content) values (#{doc.title}, #{doc.url}, #{doc.content})

在插入过程中我们采取批量插入来进行操作，减少操作所需的时间。因为正排索引插入的是文档，MySQL每次插入数据的大小有限制，所以我们每次插入数据的大小为10条。
代码如下：

@SneakyThrows
public void saveForwardIndexesConcurrent(List documentList) {
// 1. 批量插入时，每次插入多少条记录（由于每条记录比较大，所以这里使用 10 条就够了）
int batchSize = 10;
// 2. 一共需要执行多少次 SQL？向上取整(documentList.size() / batchSize)
int listSize = documentList.size();
int times = (int) Math.ceil(1.0 * listSize / batchSize); // ceil(天花板): 向上取整
log.debug(“一共需要 {} 批任务。”, times);
CountDownLatch latch = new CountDownLatch(times); // 统计每个线程的完全情况，初始值是 times(一共多少批)
// 3. 开始分批次插入
for (int i = 0; i < listSize; i += batchSize) {
// 从 documentList 中截取这批要插入的文档列表（使用 List.subList(int from, int to)
int from = i;
int to = Integer.min(from + batchSize, listSize);
Runnable task = () -> { // 内部类 / lambda 表达式里如果用到了外部变量，外部变量必须的 final（或者隐式 final 的变量）
List subList = documentList.subList(from, to);
// 针对这个 subList 做批量插入
mapper.batchInsertForwardIndexes(subList);
latch.countDown(); // 每次任务完成之后，countDown()，让 latch 的个数减一
};
executorService.submit(task); // 主线程只负责把一批批的任务提交到线程池，具体的插入工作，由线程池中的线程完成
}
// 4. 循环结束，只意味着主线程把任务提交完成了，但任务有没有做完是不知道的
// 主线程等在 latch 上，只到 latch 的个数变成 0，也就是所有任务都已经执行完了
latch.await();
}

这样正排索引就插入方法就完成了。

2、倒排索引
其xml文件中配置语句为：

insert into inverted_indexes (word, docid, weight) values (#{record.word}, #{record.docId}, #{record.weight})

在插入过程中我们采取批量插入来进行操作，减少操作所需的时间。倒排索引每次插入数据大小较小，所以我们每次插入10,000条，在这里我创建一个单独的方法来处理插入。代码如下：

static class InvertedInsertTask implements Runnable {
  private final CountDownLatch latch;
  private final int batchSize;
  private final List<Document> documentList;
  private final IndexDatabaseMapper mapper;

  InvertedInsertTask(CountDownLatch latch, int batchSize, List<Document> documentList, IndexDatabaseMapper mapper) {
      this.latch = latch;
      this.batchSize = batchSize;
      this.documentList = documentList;
      this.mapper = mapper;
  }

  @Override
  public void run() {
      List<InvertedRecord> recordList = new ArrayList<>();    // 放这批要插入的数据

      for (Document document : documentList) {
          Map<String, Integer> wordToWeight = document.segWordAndCalcWeight();
          for (Map.Entry<String, Integer> entry : wordToWeight.entrySet()) {
              String word = entry.getKey();
              int docId = document.getDocId();
              int weight = entry.getValue();

              InvertedRecord record = new InvertedRecord(word, docId, weight);

              recordList.add(record);

              // 如果 recordList.size() == batchSize，说明够一次插入了
              if (recordList.size() == batchSize) {
                  mapper.batchInsertInvertedIndexes(recordList);  // 批量插入
                  recordList.clear();                             // 清空 list，视为让 list.size() = 0
              }
          }
      }
      // recordList 还剩一些，之前放进来，但还不够 batchSize 个的，所以最后再批量插入一次
      mapper.batchInsertInvertedIndexes(recordList);  // 批量插入
      recordList.clear();

      latch.countDown();
  }

}

@Timing(“构建 + 保存倒排索引 —— 多线程版本”)
@SneakyThrows
public void saveInvertedIndexesConcurrent(List documentList) {
int batchSize = 10000; // 批量插入时，最多 10000 条
int groupSize = 50;
int listSize = documentList.size();
int times = (int) Math.ceil(listSize * 1.0 / groupSize);
CountDownLatch latch = new CountDownLatch(times);

  for (int i = 0; i < listSize; i += groupSize) {
      int from = i;
      int to = Integer.min(from + groupSize, listSize);
      List<Document> subList = documentList.subList(from, to);
      Runnable task = new InvertedInsertTask(latch, batchSize, subList, mapper);
      executorService.submit(task);
  }

  latch.await();

}

这样倒排索引插入的方法就完成了。

6、索引的保存
使用创建好的正排索引和倒排索引的方法再将文件传入传入即可。代码如下：

// 3. 进行正排索引的保存
indexManager.saveForwardIndexesConcurrent(documentList);
log.debug(“正排索引保存成功。”);

    // 4. 进行倒排索引的生成核保存
    indexManager.saveInvertedIndexesConcurrent(documentList);
    log.debug("倒排索引保存成功。");

1
2
3
4
5
6
7
四、Web界面的构建
Web界面的构建是使用搜索引擎的关键，通过Web界面所构建的html文件来通过后端文件来进行搜索，简单来说就是根据用户所提交的搜索词通过Mybatis操作数据库并将搜索出来的数据传给前端来展示给用户。

1、前端与数据库的交互
1、注册SearchMapper接口通过@Repository和@Mapper共同作用于dao
层，获取数据库中的信息。
代码如下：

public class DocumentWightWeight {
private int docId;
private String title;
private String url;
private String content;
public int weight;

public DocumentWightWeight() {}
public DocumentWightWeight(DocumentWightWeight documentWightWeight) {
    this.docId = documentWightWeight.docId;
    this.title = documentWightWeight.title;
    this.url = documentWightWeight.url;
    this.content = documentWightWeight.content;
    this.weight = documentWightWeight.weight;
}
 public Document toDocument() {
    Document document = new Document();
    document.setDocId(this.docId);
    document.setTitle(this.title);
    document.setUrl(this.url);
    document.setContent(this.content);

    return document;
}

}

这些是想要从数据库中拿到的信息。

@Repository
@Mapper
public interface SearchMapper {
List queryWithWeight(
@Param(“word”) String word,
@Param(“limit”) int limit,
@Param(“offset”) int offset
);
}

这是对数据库进行操作。这段代码所在的类会与xml文件关联。xml中代码如下：

select ii.docid, title, url, content, weight from inverted_indexes ii join forward_indexes fi on ii.docid = fi.docid where word = #{word} order by weight desc limit ${limit} offset ${offset}

sql语句中使用连表查询将正排索引和倒排索引关联起来。

2、对数据进行处理
通过传入的query来对数据库进行搜索，并进行分页操作。如果是多词查找就先分词然后对词语分别查找并进行聚合操作。
具体操作代码如下：

public String search(String query, @RequestParam(value = “page”, required = false) String pageString, Model model) {
//进行分词操作
List queryList = ToAnalysis.parse(query)
.getTerms()
.stream()
.map(Term::getName)
.collect(Collectors.toList());
}

重新聚合每个词在不同文件中的权重，然后按照权重大小进行排序。
代码如下：

    List<DocumentWightWeight> totalList = new ArrayList<>();
    for (String s : queryList) {
        List<DocumentWightWeight> documentList = mapper.queryWithWeight(s, limit, offset);
        totalList.addAll(documentList);
    }
    // 针对所有文档列表，做权重聚合工作
    // 维护:
    // docId -> document 的 map
    Map<Integer, DocumentWightWeight> documentMap = new HashMap<>();
    for (DocumentWightWeight documentWightWeight : totalList) {
        int docId = documentWightWeight.getDocId();
        if (documentMap.containsKey(docId)) {
            DocumentWightWeight item = documentMap.get(docId);
            item.weight += documentWightWeight.weight;
            continue;
        }
        DocumentWightWeight item = new DocumentWightWeight(documentWightWeight);
        documentMap.put(docId, item);
    }
  Collection<DocumentWightWeight> values = documentMap.values();
    // Collection 没有排序这个概念（只有线性结构才有排序的概念），所以我们需要一个 List
    List<DocumentWightWeight> list = new ArrayList<>(values);
    // 按照 weight 的从大到小排序了
    Collections.sort(list, (item1, item2) -> {
        return item2.weight - item1.weight;
    });
    int from = (page - 1) * 20;
    int to = from + 20;
    // 从 list 中把分页区间取出来
    List<DocumentWightWeight> subList = list.subList(from, to);
    List<Document> documentList = subList.stream()
            .map(DocumentWightWeight::toDocument)
            .collect(Collectors.toList());
    // lambda 中无法使用非 final 变量
    List<String> wordList = queryList;
    documentList = documentList.stream()
            .map(doc -> descBuilder.build(wordList, doc))
            .collect(Collectors.toList());
    // 这里将数据添加到 model 中，是为了在 渲染模板的时候用到
    model.addAttribute("query", query);
    model.addAttribute("docList", documentList);
    model.addAttribute("page", page);
    }

这样之后就把多词查找分别单词的权重重新聚合并进行分页。

3、展示页
这里采用了thymeleaf语法来对展示页就行操作
代码如下

使用thymeleaf语法可以使展示页中展示出从数据库拿到的信息不如url，titile，简介等信息，使得查询界面更加丰富。

原文链接：https://blog.csdn.net/m0_51529857/article/details/126297067
正排索引1W条倒排索引600W条
批量插入（batch insert）批量插入（batch insert）
for（…）{ //循环600W次
insert into 表（…）values（…）；
}
for（…）{ //循环6000次
这种插入的性能更好
insert into表（…）values（…），（…），（…），（…（…（.）；
一
一次插入1000条数据
1.修改表结构两张表解决问题∶
正排索引表（docid-pk、title、url、content）
整体数量级不大，只有1W条，但是每一条比较大（content 大）
批量插入的时候，每次记录不用太多（每次插入10条）倒排索引表（id-pk、word、docid、weight）
整体数量级较大，有600W条，每一条的记录比较小
批量插入的是时候，每次记录多插入一些（每次插入1W条）2.docid的生成方式做修改
ArrayList的尾插过程中的size（）作为docid（我们手动控制自增id）改成
利用mysql的表中的自增机制，作为docid 3.把保存索引数据的过程改成批量插入
4.整体过程不会变，但可能把一些不是很合理（比较冗余，或者代码位置不合适的类）进行调整
main】 c.p.searcher.indexer.command.Indexer ∶扫描目录结束，一共得到10460个文件。main】 c.p.searcher.indexer.command.Indexer ∶构建文档完毕，一共10460篇文档
2022-08-05 09:30:29.243 DEBUG 23272 —[2022-08-05 09:31:36.233 DEBUG 23272 —[
// 首先去掉<script …>…
//这两个操作会比较慢//这两个操作会比较慢
∶（m wirk wordyuos/>【>v】<【<v】ndyuos>m）177】vaop?dəut = aut1 //去掉标签
line = line.replaceAll(“<[^>]>“,” ");//多带的空格的意思是，把换行符也视为空格了
2022-08-05 09:34:38.149 DEBUG 8100 ----2022-08-05 09:34:49.518 DEBUG8100—
main] c.p.searcher.indexer.command.Indexer main] c.p.searcher.indexer.command.Indexer
2022-08-05 09:36:49.656 DEBUG 26408 — [2022-08-05 09:37:18.293 DEBUG 26408 — [
main] c.p.searcher.indexer.command.Indexer main] c.p.searcher.indexer.command.Indexer
MySQL批量插入语法∶
insert into forward_indexes (title, url, content) values
(1),‘2’,‘3’), (4’,‘5’,‘6’), (7),‘8’,‘9’);
使用mybatis的动态SQL特性
https://mybatis.org/mybatis-3/zh/dynamic-sql.html
动态 SQL的另一个常见使用场景是对集合进行遍历（尤其是在构建 IN 条件语句的时候）。比如∶SELECTFROM POST P

遍历collection=“list”，
<foreach item=“item” index=“index” collection=“list’
open=“ID in(” separator=”,” close=“)” nullable=“true”>#{item}
其中，下标保存在 index （index = “index”）其中，遍历时的每一项保存在item（item = “item”）

application.yml
mybatis:
AG
classpath:mapper/index-mapper.xml
mapper-locations
AL
在 Spring 的配置文件中，指定 mybatis 查找 mapper xml 文件的路径
classpath∶就代表从 src/main/resources 下进行查找【这个实际上是错误的理解，但现在这么理解关系不大】完整路径∶ src/main/resources/mapper/index-mapper.xml
index-mapper.xml ×

<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE mapper

1
PUBLIC “-//mybatis.org//DTD Mapper 3.0//EN”“http://mybatis.org/dtd/mybatis-3-mapper.dtd”>
m D
‘com.peixinchen.searcher.indexer.mapper.IndexDatabaseMapper’
<mapper namespace=
eMapper.java x
I IndexDatab
com.peixinchen.searcher.indexer.mapper;package
1
礼
import com.peixinchen.searcher.indexer.model.Document;import org.apache.ibatis.annotations.Mapper;import org.apache.ibatis.annotations.Param;import org.springframework.stereotype.Repository;
3
com.peixinchen.searcher.indexer.mapper.IndexDatabaseMapper
又5 0 7 00 9 10 1 //
import java.util.List;
@Repository // 注册 Spring bean @Mapper
梦福的
12 M
IndexDatabaseMapper
public interfac
void batchInsertForwardIndexes(@Param(“list”)List documentList);
13 14 A
index-mapper.xml
<mapper namespace=“com.peixinchen.searcher.indexer.mapper.IndexDatabaseMapper”))
“batchInsertForwardIndexes”
<insert id

这个insert标签
是为接口中的
batchlnsertForwardlndexes方法服务g bean
@Repository // 注册 Spri
// 是一个Mybatis 管理的Mapper
@Mapper public
nterface TndexdatabaseMapper {batchInsertForwardIndexes voi
@Param(“list”)List documentList);
1
<insert id=“batchInsertForwardIndexes
insert into forward_indexes(title, url, content)values <!–一共有多少条记录，<foreach collection=11ist”
采用动态 SQL
得根据用户
数来决定separato
所以这里’doc 4
tem
(#{doc.title},#{doc\url},#{doc.comtemt

@Param(“list”)
void batchInsertForwardIndexes
ist documentList);
insertinto forwardindexes(title,url,contentvalues(,??),(,?,3),(,?,?),?,?,?),?,?,?),(,?,3),(?,?),?,?,?),?,??),(,?.3)
Parameters:
0(String), 0(String), 0(String), 1(String), 1(String), 1(String), 2(String), 2(String), 2(String), 3(String), 3(String), 3(String), 4(String), 4(String), 4(String),
5(String),5(String),5(String),6(String),6(String),6(String),6(String),7(String),7(String),7(String),8(String),8(String),8(String), 9(String), 8(String), 9(String), 9(String)
“docid”>‘docId’
<insert id=“batchInsertForwardIndexes” useGeneratedKeys=“true” keyProperty
需要把插入后的自增id的值填回我们的Document对象中
@S1f4j @Data
public class Document
List SubList(int fromIndex,
eturn contentBuilder.tostring()
.replaceAll( regex: “<script[^>]*>[<]", replacement:.replaceAll( regex: "<[^>]>”, replacement:" “).replaceAll( regex: “)\s+”, replacement:”“)
.trim();
∶扫描目录结束，一共得到10460个文件。∶构建文档完毕，一共10460篇文档
扫描目录结束，一共得到10460个文件。∶ 构建文档完毕，一共 10460 篇文档
没有parallel（）的情况
最终执行的SQL，使用类似的代码拼接出来String sql=”
insert into forward_indexes (title, url, content) values
list)do
for(Documen
Sql＋=",
"(doc.title, doc.url, doc.content)Sql +=’
keyColum
Edit:De PIPOP
title
content url HULL
private Integer
docId;
private String title;private String url;private String content; /
int toIndex)
Returns a view of the portion of this list between the specified fromIndex,inclusive, and toIndex, exclusive. (If fromIndex and toIndex are equal, the returned list is empty.) The returned list is backed by this list, so non-structural changes in the returned list are reflected in this list, and vice-versa. The returned list supports all of the optional list operations supported by this list. 比如list中有103个元素，一批处理10个from = 0, to = 10 from = 10, to = 20
from = 90, to = 100 from = 100, to = 103
返回原有list的【from，to）的子list出来

一头创死算了

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
搜索引擎项目

classpath∶就代表从 src/main/resources 下进行查找【这个实际上是错误的理解，但现在这么理解关系不大】完整路径∶ src/main/resources/mapper/index-mapper.xml。其中，下标保存在 index （index = “index”）其中，遍历时的每一项保存在item（item = “item”）批量插入的时候，每次记录不用太多（每次插入10条）倒排索引表（id-pk、word、docid、weight）∶ 构建文档完毕，一共 10460 篇文档。.
复制链接

扫一扫