ES实现附件搜索方式:
1、基于attachment pipeline插件
插件下载地址: 版本根据自己需求改动
https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-5.6.9.zip
① 建立附件解析pipeline
/_ingest/pipeline/attachment/
{
"description": "附件解析",
"processors": [
{
"attachment": {
"field": "data",
"properties": [
"content"
,
"title"
,
"author"
,
"keywords"
,
"date"
,
"content_length"
,
"content_type"
]
}
}
,
{
"remove": {
"field": "data"
}
}
]
}
② document构建
pipeline指定attachment字段为data,字段内容为base64编码的inputstream,java inputstream转base64字符串相关代码如下:
private String encodeStream(InputStream inputStream){
ByteArrayOutputStream byteArrayOutputStream = null;
try{
byteArrayOutputStream = new ByteArrayOutputStream();
IOUtils.copy(inputStream, byteArrayOutputStream);
byte[] bytes = byteArrayOutputStream.toByteArray();
return Base64.encodeBase64String(bytes);
}catch (Exception e){
logger.error("文件解析失败", e);
}finally {
IOUtils.closeQuietly(byteArrayOutputStream);
}
return "";
}
③ 将解析出的文本数据写入ES
2、代码实现文件内容解析写入
因上传文档类型较多,包括office、pdf、txt等,因此使用第三方包作为解析工具apache.tika,attachment插件本身也是使用apache.tika,依赖如下:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-app</artifactId>
<version>1.24</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
简单解析代码如下,以zip压缩文件为例:
public static String getFileContent(ZipFile zipFile){
try{
StringBuilder sb = new StringBuilder();
Enumeration<? extends ZipEntry> entries = zipFile.entries();
while (entries.hasMoreElements()){
ZipEntry zipEntry = entries.nextElement();
if(zipEntry.isDirectory()){
continue;
}
InputStream input = zipFile.getInputStream(zipEntry);
BodyContentHandler textHandler=new BodyContentHandler(100*1024*1024);
Metadata matadata=new Metadata();//Metadata对象保存了作者,标题等元数据
Parser parser=new AutoDetectParser();
ParseContext context=new ParseContext();
parser.parse(input, textHandler, matadata, context);//执行解析过程
input.close();
sb.append(textHandler.toString());
}
return sb.toString();
}catch (Exception e){
LOGGER.error("读取文件文本内容失败", e);
return "";
}finally {
if(zipFile != null){
try{
zipFile.close();
}catch (Exception e){
LOGGER.error("读取zip文件内容,文件流关闭失败", e);
}
}
}
}
文件内容解析后以字段形式导入es,同时可通过template指定字段分词类型。