【个人记录】

m0_63187203

已于 2024-06-23 16:19:03 修改

阅读量79

点赞数 1

文章标签：其他

于 2024-06-20 19:24:15 首次发布

本文链接：https://blog.csdn.net/m0_63187203/article/details/139840957

版权

@SpringBootTest
public class ocrtestAndSave {

@Autowired
private ModelFactory modelFactory;

@Resource
private ElasticsearchClient client;

@Autowired
private FileService fileService;

@Resource
private OCRServiceHttp ocrServiceHttp;

@Test
public void ocrTest() throws IOException {
// 获取资源文件路径
URL resource = ResourceUtil.getResource("legal_data/PDF/test01.pdf");
File file = new File(resource.getFile());

// 获取文件路径
String filePath = file.getAbsolutePath();

// 使用OCR服务处理文件
String ocrResult = ocrServiceHttp.processFile(filePath);
System.out.println(ocrResult);

// 创建分割器并分割文本
DocumentSplitter splitter = modelFactory.createDocumentSplitter(NameEnums.DEFAULT_DOCUMENT_SPLITTER.getText());
Document document = new Document(ocrResult);
List<TextSegment> segments = splitter.split(document);

// 将TextSegment转换为String
List<String> chunks = segments.stream()
.map(TextSegment::text)
.collect(Collectors.toList());

// JSON转换工具
ObjectMapper objectMapper = new ObjectMapper();

// 保存分割后的chunks到Elasticsearch
String indexName = "my_index";
for (String chunk : chunks) {
// 创建一个包含文本内容的文档
Map<String, String> jsonDocument = new HashMap<>();
jsonDocument.put("content", chunk);

// 索引文档到Elasticsearch
IndexResponse response = client.index(i -> i
.index(indexName)
.document(jsonDocument)
.refresh(Refresh.True));
System.out.println("Document indexed: ID=" + response.id());
}
}
}

1. 获取资源文件路径

URL resource = ResourceUtil.getResource("legal_data/PDF/test01.pdf");
File file = new File(resource.getFile());

通过 `ResourceUtil.getResource` 获取资源文件的 URL，并创建一个 `File` 对象。

2. 获取文件路径

String filePath = file.getAbsolutePath();

获取文件的绝对路径。

3. 使用OCR服务处理文件

String ocrResult = ocrServiceHttp.processFile(filePath);
System.out.println(ocrResult);

使用 OCR 服务处理文件，得到 OCR 结果，并打印输出。

4. 创建分割器并分割文本

DocumentSplitter splitter = modelFactory.createDocumentSplitter(NameEnums.DEFAULT_DOCUMENT_SPLITTER.getText());
Document document = new Document(ocrResult);
List<TextSegment> segments = splitter.split(document);

使用模型工厂创建一个文档分割器，将 OCR 结果分割成多个文本段。

5. 将TextSegment转换为String

List<String> chunks = segments.stream()
.map(TextSegment::text)
.collect(Collectors.toList());

将每个 `TextSegment` 转换为字符串，并收集到一个列表中。

6. 保存分割后的chunks到Elasticsearch

String indexName = "my_index";
for (String chunk : chunks) {
Map<String, String> jsonDocument = new HashMap<>();
jsonDocument.put("content", chunk);

IndexResponse response = client.index(i -> i
.index(indexName)
.document(jsonDocument)
.refresh(Refresh.True));
System.out.println("Document indexed: ID=" + response.id());
}

将每个分割后的文本块保存到 Elasticsearch 索引中，并打印出每个文档的索引 ID。

m0_63187203

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【个人记录】

通过 `ResourceUtil.getResource` 获取资源文件的 URL，并创建一个 `File` 对象。将每个分割后的文本块保存到 Elasticsearch 索引中，并打印出每个文档的索引 ID。// 保存分割后的chunks到Elasticsearch。// 索引文档到Elasticsearch。// 将TextSegment转换为String。// 创建一个包含文本内容的文档。// 使用OCR服务处理文件。// 创建分割器并分割文本。// 获取资源文件路径。// JSON转换工具。
复制链接

扫一扫