基于pdf的样例,1.2两部只有在新增的时候才会用到,查询不需要
1.需要配置支持这一功能的requestHandler。编辑solrconfig.xml
,加入
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="fmap.content">content</str>
</lst>
</requestHandler>
<str name="uprefix">ignored_</str> 部分是把读取文件时不需要映射的字段忽略掉
<str name="fmap.content">content</str> 是把读取的fmap.content字段映射为solr的 text字段,content为managed-schema里面东西。
solr.extraction.ExtractingRequestHandler就是solr中用来处理富文本的handler。为了使用这个类我们我们需要拷贝jar包:solr-dataimporthandler-extras.jar到lib目录,并确认solrconfig.xml中的lib配置包含它。
2.修改managed-schema 文件,增加
<dynamicField name="ignored_*" type="ignored" multiValued="true"/>
<fieldType name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
这个是生成一个动态字段,类型为ignored,承接忽略的那些字段,solr8已经有这两个字段,无须再添加。
3.上代码
增加:
public static void indexFilesSolrCell1(String solrId, String fileName)
throws IOException, SolrServerException {
List<String> url=new ArrayList<String>();
url.add("10.217.37.30:2181");
url.add("10.217.37.28:2182");
Optional<String> zkChoot=Optional.of("/");
CloudSolrClient.Builder builder=new CloudSolrClient.Builder(url,zkChoot);
CloudSolrClient solrClient=builder.build();
solrClient.setZkClientTimeout(100000);
solrClient.setZkConnectTimeout(1000000);
solrClient.setDefaultCollection("pdfcore");
ContentStreamUpdateRequest up
= new ContentStreamUpdateRequest("/update/extract");
up.addFile(new File(fileName),"pdf");
up.setParam("literal.id", solrId);
up.setParam("uprefix", "ignored_");
//up.setParam("fmap.content", "content");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrClient.request(up);
System.out.println("insert ok! \r\n");
}
//根据文件拓展名获取文件类型
public static String getFileContentType(String filename) {
String contentType = "";
String prefix = filename.substring(filename.lastIndexOf(".") + 1);
if (prefix.equals("xlsx")) {
contentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";
} else if (prefix.equals("pdf")) {
contentType = "application/pdf";
} else if (prefix.equals("doc")) {
contentType = "application/msword";
} else if (prefix.equals("txt")) {
contentType = "text/plain";
} else if (prefix.equals("xls")) {
contentType = "application/vnd.ms-excel";
} else if (prefix.equals("docx")) {
contentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
} else if (prefix.equals("ppt")) {
contentType = "application/vnd.ms-powerpoint";
} else if (prefix.equals("pptx")) {
contentType = "application/vnd.openxmlformats-officedocument.presentationml.presentation";
}
else {
contentType = "othertype";
}
return contentType;
}
扩展对多个文件:
for(File file:files){
request=new ContentStreamUpdateRequest("/update/extract");
request.addFile(new File("mailing_lists.pdf"));
request.setParam("literal.id", "mailing_lists.pdf");
//request.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);//注释这行代码。
client.request(request);
}
client.commit();
client.commit();
操作应该放在最外层,即最后提交一次。- 不设置action。
- 一个文件一个ContentStreamUpdateRequest对象,否则会造成contentStream递增,从而影响效率。
作者:熊颀
链接:https://www.jianshu.com/p/d96d07c28a14
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
查询(包括高亮):
// 查询测试
public static void findIndex1() throws IOException, SolrServerException {
List<String> url=new ArrayList<String>();
url.add("10.217.37.30:2181");
url.add("10.217.37.28:2182");
Optional<String> zkChoot=Optional.of("/");
CloudSolrClient.Builder builder=new CloudSolrClient.Builder(url,zkChoot);
CloudSolrClient solrClient=builder.build();
solrClient.setZkClientTimeout(10000);
solrClient.setZkConnectTimeout(1000000);
solrClient.setDefaultCollection("pdfcore");
SolrQuery query = new SolrQuery(); // 创建搜索对象
//query.set("q","*:*"); // 设置搜索条件
query.setRows(10); //设置每页显示多少条
//item_ittle:"苹果" 包含的意思
query.setQuery("content:关怀礼品平台积分使用手册");//为创建体哦阿健对象开始添加条件,页面模糊查询,并且可以分析器
// query.setSort("roleId", SolrQuery.ORDER.asc);//按照id升序
query.setStart(0);//设置起始页
//返回哪些字段,以逗号分隔
// query.set("fl","id,roleId,roleName");//只查询这两个areaName,id
//打开高亮开关
query.setHighlight(true);//打开高亮开关
query.addHighlightField("content");//设置高亮域
query.setHighlightSimplePre("<em style='color:red;'>");//设置高亮前缀
query.setHighlightSimplePost("</em>");//设置高亮后缀
QueryResponse response = solrClient.query(query); //发起搜索请求
SolrDocumentList docs = response.getResults(); // 查询结果
long cnt = docs.getNumFound(); // 查询结果总数
System.out.println("总条数为"+cnt+"条");
//高亮容器,最外面是id:v,域名和List集合,第一个map是文档id,第二个是高亮显示的字段名称
Map<String, Map<String, List<String>>> highlighting= response.getHighlighting();//获取高亮的结果,也可以说是高亮的容器
for (SolrDocument doc : docs) {
// System.out.println(doc);
System.out.println("-------------\r\n");
System.out.println("id:"+ doc.get("id") + ",autor:"+ doc.get("author") + ",content:"+ doc.get("content"));
System.out.println("查询符合条件的个数有:"+docs.getNumFound());
String id= (String)doc.get("id");
System.out.println("id:"+id);
String name=(String)doc.getFieldValue("content");
System.out.println("name:"+name);
Map<String, List<String>> map=highlighting.get(id);//获取id为*的map
List<String> highlist=map.get("content");
System.out.println("map里面pdf内容为:"+highlist.get(0));
}
solrClient.close();
}
删除:
//删除测试
public static void deleteIndexById() throws IOException, SolrServerException {
List<String> url=new ArrayList<String>();
url.add("10.217.37.30:2181");
url.add("10.217.37.28:2182");
Optional<String> zkChoot=Optional.of("/");
CloudSolrClient.Builder builder=new CloudSolrClient.Builder(url,zkChoot);
CloudSolrClient solrClient=builder.build();
solrClient.setZkClientTimeout(10000);
solrClient.setZkConnectTimeout(1000000);
solrClient.setDefaultCollection("pdfcore");
//全删
//solrClient.deleteByQuery("*:*");
//模糊匹配删除(带有分词效果的删除)
solrClient.deleteByQuery("id:solr-word.pdf");
//指定id删除 //solrClient.deleteById("1");
solrClient.commit();
solrClient.close();
System.out.println("delete ok! \r\n");
}
问题:https://my.oschina.net/3iVgTIG4E/blog/468439
java只需要maven引进
<dependency> <groupId>org.apache.solr</groupId> <artifactId>solr-solrj</artifactId> <version>8.5.2</version> </dependency>
solr8的lib需要引入dist下面的solr-cell-8.5.2.jar