今天开始研究发表自己对Lucene源码的一些研究,一来是为了检验自己学习的深入情况,二来是为了有一个记录以方便后续的研究查阅。其实对Lucene研究已经有一些日子,随着研究的模块越来越多。涉及的内容也越来越多。所以萌生了撰写博客的念头。Lucene的主要模块有
- Lucene的analysis模块主要负责词法分析及语言处理而形成Term。
- Lucene的index模块主要负责索引的创建,里面有IndexWriter。
- Lucene的store模块主要负责索引的读写。
- Lucene的QueryParser主要负责语法分析。
- Lucene的search模块主要负责对索引的搜索。
- Lucene的similarity模块主要负责对相关性打分的实现。
第一系列的模式是对analysis源码中的Analyzer分析。Analyzer,文本分析的过程,实质上是将输入文本转化为文本特征向量的过程。Analyzer包含两个核心组件,Tokenizer以及TokenFilter。两者的区别在于,前者在字符级别处理流,而后者则在词语级别处理流。Tokenizer是Analyzer的第一步,其构造函数接收一个Reader作为参数,而TokenFilter则是一个过滤器。
Analyzer的类图如下:
下面结合代码进行详细分解,就按从上进行分析吧。
1,TokenStreamComponents类:
TokenStreamComponents类的一部分功能类比于c++中的结构体。其就是将Tokenizer类,TokenStream类,ReusableStringReader类进行了组合()。
public static class TokenStreamComponents {
/**
* Original source of the tokens.
*/
protected final Tokenizer source;
/**
* Sink tokenstream, such as the outer tokenfilter decorating
* the chain. This can be the source if there are no filters.
*/
protected final TokenStream sink;
/** Internal cache only used by {@link Analyzer#tokenStream(String, String)}. */
transient ReusableStringReader reusableStringReader;
...
...
...
}
2、Analyzer类:
public abstract class Analyzer implements Closeable {
private final ReuseStrategy reuseStrategy;
//这是一个重要的类。在复用策略的时候会用到。
CloseableThreadLocal<Object> storedValue = new CloseableThreadLocal<Object>();
public Analyzer() {
//在默认情况下使用了GLOBAL_REUSR_STRATRGY进行初始化
this(GLOBAL_REUSE_STRATEGY);
}
//这是一个比较重要的虚函数,子类通过继承它来完成真正Tokenizer,和TokenStream的组合。
protected abstract TokenStreamComponents createComponents(String fieldName,
Reader reader);
//通过调用该函数来实现组合的流程。这是一个策略模式。
public final TokenStream tokenStream(final String fieldName,
final Reader reader) throws IOException {
//首先获取当前的组合策略。在第一次调用的时候,返回为空。(后续会详细解释复用策略类的实现)
TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
final Reader r = initReader(fieldName, reader);
if (components == null) {
//如果是第一次调用,则调用createComponents完成真正的Tokenizer和TokenStream的组合,该类是在子类中具体实现的。
components = createComponents(fieldName, r);
//保存当前的组合策略。(后续在复用策略类中详解)
reuseStrategy.setReusableComponents(this, fieldName, components);
} else {
components.setReader(r);
}
return components.getTokenStream();
}
//与上一个函数的区别就是支持了String的直接读取。
public final TokenStream tokenStream(final String fieldName, final String text) throws IOException {
TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
@SuppressWarnings("resource") final ReusableStringReader strReader =
(components == null || components.reusableStringReader == null) ?
new ReusableStringReader() : components.reusableStringReader;
strReader.setValue(text);
final Reader r = initReader(fieldName, strReader);
if (components == null) {
components = createComponents(fieldName, r);
reuseStrategy.setReusableComponents(this, fieldName, components);
} else {
components.setReader(r);
}
components.reusableStringReader = strReader;
return components.getTokenStream();
}
protected Reader initReader(String fieldName, Reader reader) {
return reader;
}
}
//这是一个重要的类。在复用策略的时候会用到。
CloseableThreadLocal<Object> storedValue = new CloseableThreadLocal<Object>();
public Analyzer() {
//在默认情况下使用了GLOBAL_REUSR_STRATRGY进行初始化
this(GLOBAL_REUSE_STRATEGY);
}
//这是一个比较重要的虚函数,子类通过继承它来完成真正Tokenizer,和TokenStream的组合。
protected abstract TokenStreamComponents createComponents(String fieldName,
Reader reader);
//通过调用该函数来实现组合的流程。这是一个策略模式。
public final TokenStream tokenStream(final String fieldName,
final Reader reader) throws IOException {
//首先获取当前的组合策略。在第一次调用的时候,返回为空。(后续会详细解释复用策略类的实现)
TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
final Reader r = initReader(fieldName, reader);
if (components == null) {
//如果是第一次调用,则调用createComponents完成真正的Tokenizer和TokenStream的组合,该类是在子类中具体实现的。
components = createComponents(fieldName, r);
//保存当前的组合策略。(后续在复用策略类中详解)
reuseStrategy.setReusableComponents(this, fieldName, components);
} else {
components.setReader(r);
}
return components.getTokenStream();
}
//与上一个函数的区别就是支持了String的直接读取。
public final TokenStream tokenStream(final String fieldName, final String text) throws IOException {
TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
@SuppressWarnings("resource") final ReusableStringReader strReader =
(components == null || components.reusableStringReader == null) ?
new ReusableStringReader() : components.reusableStringReader;
strReader.setValue(text);
final Reader r = initReader(fieldName, strReader);
if (components == null) {
components = createComponents(fieldName, r);
reuseStrategy.setReusableComponents(this, fieldName, components);
} else {
components.setReader(r);
}
components.reusableStringReader = strReader;
return components.getTokenStream();
}
protected Reader initReader(String fieldName, Reader reader) {
return reader;
}
}
3、ReuseStratery类及其子类GlobalReuseStrategy(复用策略类,PerFieldReuseStrategy不做分析):
先看看ReuseStratery:
public static abstract class ReuseStrategy {
//真正有价值的接口方法就是这两个。看看其在子类中是怎么实现的?
public abstract TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName);
public abstract void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components);
protected final Object getStoredValue(Analyzer analyzer) {
if (analyzer.storedValue == null) {
throw new AlreadyClosedException("this Analyzer is closed");
}
return analyzer.storedValue.get();
}
protected final void setStoredValue(Analyzer analyzer, Object storedValue) {
if (analyzer.storedValue == null) {
throw new AlreadyClosedException("this Analyzer is closed");
}
analyzer.storedValue.set(storedValue);
}
}
//真正有价值的接口方法就是这两个。看看其在子类中是怎么实现的?
public abstract TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName);
public abstract void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components);
protected final Object getStoredValue(Analyzer analyzer) {
if (analyzer.storedValue == null) {
throw new AlreadyClosedException("this Analyzer is closed");
}
return analyzer.storedValue.get();
}
protected final void setStoredValue(Analyzer analyzer, Object storedValue) {
if (analyzer.storedValue == null) {
throw new AlreadyClosedException("this Analyzer is closed");
}
analyzer.storedValue.set(storedValue);
}
}
再看看其子类:
public final static class GlobalReuseStrategy extends ReuseStrategy {
//通过分析发现又调用了父类中的getStorValue和setStoredValue方法。
@Override
public TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
return (TokenStreamComponents) getStoredValue(analyzer);
}
@Override
public void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components) {
setStoredValue(analyzer, components);
}
}
//通过分析发现又调用了父类中的getStorValue和setStoredValue方法。
@Override
public TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
return (TokenStreamComponents) getStoredValue(analyzer);
}
@Override
public void setReusableComponents(Analyzer analyzer, String fieldName, TokenStreamComponents components) {
setStoredValue(analyzer, components);
}
}
通过分析发现最终调用到的是Analyzer中的storedValue ,他的全貌是:
CloseableThreadLocal<Object> storedValue = new CloseableThreadLocal<Object>();
揭开CloseableThreadLocal的面纱,则Analyzer的整体框架就分析完了。这是最后一步,但是却包含着大量的知识点。
少啰嗦!看代码!
public class CloseableThreadLocal<T> implements Closeable {
//知识点一和二
private ThreadLocal<WeakReference<T>> t = new ThreadLocal<WeakReference<T>>();
private Map<Thread,T> hardRefs = new WeakHashMap<Thread,T>();
private static int PURGE_MULTIPLIER = 20;
//知识点三
private final AtomicInteger countUntilPurge = new AtomicInteger(PURGE_MULTIPLIER);
//用来让子类继承。
protected T initialValue() {
return null;
}
public T get() {
WeakReference<T> weakRef = t.get();
if (weakRef == null) {
T iv = initialValue();
if (iv != null) {
set(iv);
return iv;
} else {
return null;
}
} else {
maybePurge();
return weakRef.get();
}
}
public void set(T object) {
t.set(new WeakReference<T>(object));
synchronized(hardRefs) {
hardRefs.put(Thread.currentThread(), object);
maybePurge();
}
}
private void maybePurge() {
if (countUntilPurge.getAndDecrement() == 0) {
purge();
}
}
// Purge dead threads
private void purge() {
synchronized(hardRefs) {
int stillAliveCount = 0;
for (Iterator<Thread> it = hardRefs.keySet().iterator(); it.hasNext();) {
final Thread t = it.next();
if (!t.isAlive()) {
it.remove();
} else {
stillAliveCount++;
}
}
int nextCount = (1+stillAliveCount) * PURGE_MULTIPLIER;
if (nextCount <= 0) {
// defensive: int overflow!
nextCount = 1000000;
}
countUntilPurge.set(nextCount);
}
}
@Override
public void close() {
hardRefs = null;
if (t != null) {
t.remove();
}
t = null;
}
}
//知识点一和二
private ThreadLocal<WeakReference<T>> t = new ThreadLocal<WeakReference<T>>();
private Map<Thread,T> hardRefs = new WeakHashMap<Thread,T>();
private static int PURGE_MULTIPLIER = 20;
//知识点三
private final AtomicInteger countUntilPurge = new AtomicInteger(PURGE_MULTIPLIER);
//用来让子类继承。
protected T initialValue() {
return null;
}
public T get() {
WeakReference<T> weakRef = t.get();
if (weakRef == null) {
T iv = initialValue();
if (iv != null) {
set(iv);
return iv;
} else {
return null;
}
} else {
maybePurge();
return weakRef.get();
}
}
public void set(T object) {
t.set(new WeakReference<T>(object));
synchronized(hardRefs) {
hardRefs.put(Thread.currentThread(), object);
maybePurge();
}
}
private void maybePurge() {
if (countUntilPurge.getAndDecrement() == 0) {
purge();
}
}
// Purge dead threads
private void purge() {
synchronized(hardRefs) {
int stillAliveCount = 0;
for (Iterator<Thread> it = hardRefs.keySet().iterator(); it.hasNext();) {
final Thread t = it.next();
if (!t.isAlive()) {
it.remove();
} else {
stillAliveCount++;
}
}
int nextCount = (1+stillAliveCount) * PURGE_MULTIPLIER;
if (nextCount <= 0) {
// defensive: int overflow!
nextCount = 1000000;
}
countUntilPurge.set(nextCount);
}
}
@Override
public void close() {
hardRefs = null;
if (t != null) {
t.remove();
}
t = null;
}
}
关于知识点一: 点击打开链接
关于知识点二:点击打开链接
关于知识点三:点击打开链接
声明:
最后知识点引用的博客有:
1,https://blog.csdn.net/u013803262/article/details/72452932
2,https://blog.csdn.net/qq_33663983/article/details/78349641
3,https://blog.csdn.net/mccand1234/article/details/54173084
如有侵权请联系我删除。