分析器

TWOFOUR_

于 2020-02-25 14:49:55 发布

阅读量301

点赞数

本文链接：https://blog.csdn.net/qq_43868329/article/details/104490631

版权

1.分析器

1.1 默认标准分析器：StandardAnalyzer
在外卖创建索引的时候，我们使用到了IndexWriterConfig对象，在我们创建索引的过程中，会经历分析文档步骤，就是分词的步骤，默认采用的标准分析器自动分析
1.2 中午分析器：

2.索引维护

2.1 Fieid域属性分类：
2.2 索引添加：
2.3 索引修改：
2.4 索引删除：

3.查询

Query子类：
TermQuery:根
1.2 查看分析器的分析效果

public static void main(String[] args) throws IOException {
//1.创建一个Analyzer对象
Analyzer analyzer=new StandardAnalyzer();
//2.调用Analyzer对象的tokenStream方法获取TokenStream对象，此对象包含了所有的分词结果
TokenStream tokenStream = analyzer.tokenStream("", "The spring Framework provides a comprehensive programming and configuration model.");
//3.给tokenStream对象设置一个指针，指针在哪当前就在哪一个分词上
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
//4.调用tokenStream对象的reset方法，重置指针，不调用会报错
tokenStream.reset();
//5.利用while循环，拿到分词列表的结果  incrementToken方法返回值如果为false代表读取完毕  true代表没有读取完毕
while (tokenStream.incrementToken()){
System.out.println(charTermAttribute.toString());
}
//6.关闭
tokenStream.close();

}

默认标准分析器分析英文没有问题，但是他是怎么样分析中文的？

1.2 中文分析器
第三方中文分析器：IKAnalyzer
IKAnalyzer的使用步骤：
1.导入依赖

<dependency>
<groupId>com.jianggujin</groupId>
<artifactId>IKAnalyzer-lucene</artifactId>
<version>8.0.0</version>
</dependency>

2.配置IKAnalyzer，导入配置文件
hotword.dic 扩展词典，可以将时尚的网络名词放入到该词典当中，这样就能根据扩展词典进行分词
stopword.dic 停用词词典，可以将无意义的词和敏感词汇放入到该词典当中，这样在分析的时候就会忽略这些内容

在自定义扩展词典和停用词词典的过程当中，千万不要使用windows记事本编辑，因为windows记事本是UTF-8+BOM编码

3.使用IKAnalyzer进行分词

public static void main(String[] args) throws IOException {
//1.创建一个Analyzer对象
Analyzer analyzer=new IKAnalyzer();
//2.调用Analyzer对象的tokenStream方法获取TokenStream对象，此对象包含了所有的分词结果
TokenStream tokenStream = analyzer.tokenStream("", "五道口课工场安装mysql-5.7.22-winx64后数据库服务启动报错：本地计算机上的mysql服务启动停止后，某些服务未由其他服务或程序使用时将自动停止而且mysql官网下载的压缩包解压出来没有网线上安装教... 博文 来自： 测试菜鸟在路上，呵呵");
//3.给tokenStream对象设置一个指针，指针在哪当前就在哪一个分词上
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
//4.调用tokenStream对象的reset方法，重置指针，不调用会报错
tokenStream.reset();
//5.利用while循环，拿到分词列表的结果  incrementToken方法返回值如果为false代表读取完毕  true代表没有读取完毕
while (tokenStream.incrementToken()){
	System.out.println(charTermAttribute.toString());
}
//6.关闭
tokenStream.close();

}
4.程序当中使用IKAnalyzer
IndexWriter indexWriter=new IndexWriter(directory,new IndexWriterConfig(new IKAnalyzer()));

2.索引维护：添加，删除，修改
2.1 Field域属性分类
添加文档的时候，我们文档当中包含多个域，那么域的类型是我们自定义的，上个案例使用的TextField域，那么这个域他会自动分词，然后存储
我们要根据数据类型和数据的用途合理的选择合适的域

Field类：
StringField(fieldName,fieldValue,Stroe.YES/NO) 存储的数据类型为字符串，包含索引，是否存储根据Stroe定义，不会经过分析器
StroeField(fieldName,fieldValue) 支持多种数据类型，不分析，不建立索引，默认保存到索引库当中
LongPoint(name,value) 会进行分析，会创建索引，但是不会保存到索引库当中
TextField(fieldName,fieldValue,Stroe.YES/NO) 会分析，会创建索引，是否保存取决Stroe

2.2 索引添加
@Test

public void createDocument() throws IOException {
//创建IndexWriter对象   参数一:索引库位置   参数二：指定配置
IndexWriter indexWriter=new IndexWriter(FSDirectory.open(new File("C:\\Users\\FLC\\Desktop\\授课内容\\授课资料\\Y2170\\Luncen\\Index").toPath()),
new IndexWriterConfig(new IKAnalyzer()));
//创建一个文档对象
Document document=new Document();
document.add(new TextField("fieldName","hehe.txt", Field.Store.YES));
document.add(new StoredField("fieldPath","c://hehe.txt"));
document.add(new LongPoint("fieldSize",123));
document.add(new StoredField("fieldSize",123));
document.add(new TextField("fieldContent","ojdbc14和ikanalyzer的maven找不到的解决办法,手动发布oJdbc14到maven仓库,手动发布ikanalyzer到maven,同时本教程适用于所有jar包发布 下载 IKAnalyzer结合Lucene使用和单独使用例子 简单性能测试 11-26 阅读数 1890 IKAnalyzer是一个开源基于JAVA语言的 .", Field.Store.YES));
//创建索引，将文档添加到索引库当中
indexWriter.addDocument(document);
//关闭
indexWriter.close();
}

2.3 索引修改：原理-先删除在添加
/**

索引修改，修改fieldName域中关键词匹配到全文检索的文档
@throws IOException
*/
@Test

public void updateDocument() throws IOException {
//创建IndexWriter对象   参数一:索引库位置   参数二：指定配置
IndexWriter indexWriter=new IndexWriter(FSDirectory.open(new File("C:\\Users\\FLC\\Desktop\\授课内容\\授课资料\\Y2170\\Luncen\\Index").toPath()),
new IndexWriterConfig(new IKAnalyzer()));

//创建文档
Document document=new Document();
document.add(new TextField("fieldName","new.txt", Field.Store.YES));
document.add(new StoredField("fieldPath","c://new.txt"));
document.add(new LongPoint("fieldSize",456));
document.add(new StoredField("fieldSize",456));
document.add(new TextField("fieldContent","修改fieldName为全文检索的文档，进行文档替换，先删除掉fieldName为全文检索的两个文档，再添加一个fileName为new的新文档", Field.Store.YES));

//修改  参数一为条件  参数二为修改的文档值
indexWriter.updateDocument(new Term("fieldName","全文检索"),document);

//关闭
indexWriter.close();
}

2.4 索引删除
2.4.1 删除全部：慎用

@Test
public void deleteAllDocument() throws IOException {
//创建IndexWriter对象   参数一:索引库位置   参数二：指定配置
IndexWriter indexWriter=new IndexWriter(FSDirectory.open(new File("C:\\Users\\FLC\\Desktop\\授课内容\\授课资料\\Y2170\\Luncen\\Index").toPath()),
	new IndexWriterConfig(new IKAnalyzer()));

//删除索引
indexWriter.deleteAll();
//关闭
indexWriter.close();
}

2.4.2 根据域和关键词删除

/**
* 根据域和关键词进行删除
* @throws IOException
*/
@Test
public void deleteByFieldAndTermDocument() throws IOException {
//创建IndexWriter对象   参数一:索引库位置   参数二：指定配置
IndexWriter indexWriter=new IndexWriter(FSDirectory.open(new File("C:\\Users\\FLC\\Desktop\\授课内容\\授课资料\\Y2170\\Luncen\\Index").toPath()),
	new IndexWriterConfig(new IKAnalyzer()));

//定义一个删除条件，定义一个查询对象
Query query=new TermQuery(new Term("fieldName","全文检索"));
//删除
indexWriter.deleteDocuments(query);

//关闭
indexWriter.close();
}

3.查询索引

Query子类：
TermQuery：根据域和关键词进行搜索

/**
* termQuery根据域和关键词进行搜索
* @throws IOException
*/
@Test
public void termQuery() throws IOException {
//创建查询条件
Query query=new TermQuery(new Term("fieldName","new"));
//执行查询
TopDocs topDocs = indexSearcher.search(query, 10);
System.out.println("返回的文档个数："+topDocs.totalHits);

//获取到文档集合
ScoreDoc [] scoreDocs=topDocs.scoreDocs;
for (ScoreDoc doc:scoreDocs) {
//获取到文档
Document document = indexSearcher.doc(doc.doc);
//获取到文档域中数据
System.out.println("fieldName:"+document.get("fieldName"));
System.out.println("fieldPath:"+document.get("fieldPath"));
System.out.println("fieldSize:"+document.get("fieldSize"));
System.out.println("fieldContent:"+document.get("fieldContent"));
System.out.println("==============================================================");
}
//关闭
indexReader.close();
}

RangeQuery：范围搜索
前提：创建文档时保存范围
document.add(new LongPoint("fieldSize",456));
document.add(new StoredField("fieldSize",456));

/**
* RangeQuery范围搜素
* @throws IOException
*/
@Test
public void RangeQuery() throws IOException {
//设置范围搜索的条件 参数一范围所在的域
Query query=LongPoint.newRangeQuery("fieldSize",0,50);
//查询
TopDocs topDocs = indexSearcher.search(query, 10);
System.out.println("返回的文档个数："+topDocs.totalHits);

//获取到文档集合
ScoreDoc [] scoreDocs=topDocs.scoreDocs;
for (ScoreDoc doc:scoreDocs) {
//获取到文档
Document document = indexSearcher.doc(doc.doc);
//获取到文档域中数据
System.out.println("fieldName:"+document.get("fieldName"));
System.out.println("fieldPath:"+document.get("fieldPath"));
System.out.println("fieldSize:"+document.get("fieldSize"));
System.out.println("fieldContent:"+document.get("fieldContent"));
System.out.println("==============================================================");
}

//关闭
indexReader.close();
}

QueryParser：匹配一行数据,这一行数据会自动进行分词
查询：Lucene是一个开源的基于Java的搜索库 Lucene 一个开源基于…
使用方案：
1.导入依赖，如果当前没有QueryParser依赖则需要导入依赖

<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>7.4.0</version>
</dependency>

2.测试编码：

/**
* queryparser搜素,会将搜索条件分词
* @throws IOException
*/
@Test
public void queryparser() throws IOException, ParseException {
//创建一个QueryParser对象 参数一：查询的域  参数二：使用哪种分析器
QueryParser parser=new QueryParser("fieldContent",new IKAnalyzer());
//设置匹配的数据条件
Query query = parser.parse("Lucene是一个开源的基于Java的搜索库");
//查询
TopDocs topDocs = indexSearcher.search(query, 10);
System.out.println("返回的文档个数："+topDocs.totalHits);

//获取到文档集合
ScoreDoc [] scoreDocs=topDocs.scoreDocs;
for (ScoreDoc doc:scoreDocs) {
	//获取到文档
	Document document = indexSearcher.doc(doc.doc);
	//获取到文档域中数据
	System.out.println("fieldName:"+document.get("fieldName"));
	System.out.println("fieldPath:"+document.get("fieldPath"));
	System.out.println("fieldSize:"+document.get("fieldSize"));
	System.out.println("fieldContent:"+document.get("fieldContent"));
	System.out.println("==============================================================");
}


//关闭
indexReader.close();
}

TWOFOUR_

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
分析器

1.分析器1.1 默认标准分析器：StandardAnalyzer在外卖创建索引的时候，我们使用到了IndexWriterConfig对象，在我们创建索引的过程中，会经历分析文档步骤，就是分词的步骤，默认采用的标准分析器自动分析1.2 中午分析器：2.索引维护2.1 Fieid域属性分类：2.2 索引添加：2.3 索引修改：2.4 索引删除：3.查询Query子类：TermQ...
复制链接

扫一扫