Lucene系列八:分析器详解

分析器负责对文本进行分词、语言处理得到词条,建索引搜索的时候都需要用到分析器。Lucene的分析器往往包括一个分词器(Tokenizer)和多个过滤器(TokenFilter),过滤器负责对切出来的词进行处理,如去掉敏感词、转换大小写、转换单复数等。

目录

1. 执行过程

1.1 Analyzer

1.2 TokenStream

1.3 Tokenizer

1.4 TokenFilter

1.5 停用词

1.6 示例

2. 中文分析器

3. 自定义分词器


1. 执行过程

从一个Reader字符流开始,创建一个基于Reader的Tokenizer分词器,经过2个TokenFilter生成语汇最小单元Token。

1.1 Analyzer

分析器(Analyzer)它主要包含Tokenizer和TokenFilter,它的功能是将分词器和过滤器进行合理的组合,使之产生对文本分词和过滤的效果。因此,分析器就像linux的管道,文本经过管道处理之后,就可以获取一系列Token。Lucene中的分词器有StandardAnalyzer。

它是一个抽象类,主要包含两个接口,用于生成TokenStream

  • public final TokenStream tokenStream(String fieldName, Reader reader)
  • public final TokenStream tokenStream(String fieldName, String text)

其中,fieldName,也就是你建索引的时候对应的字段名,比如:Field f = new Field("title","hello",Field.Store.YES, Field.Index.TOKENIZED);这句中的"title"

为了提高性能,使得在同一个线程中无需再生成新的TokenStream对象,老的可以被重用,所以有reusableTokenStream一说。Analyzer中有CloseableThreadLocal< Object > tokenStreams = new CloseableThreadLocal< Object >();成员变量,保存当前线程原来创建过的TokenStream,可用函数setPreviousTokenStream设定,用函数getPreviousTokenStream得到。

感受一下核心源代码

//引自org.apache.lucene.analysis
public abstract class Analyzer implements Closeable {
    private final Analyzer.ReuseStrategy reuseStrategy;
    private Version version;
    CloseableThreadLocal<Object> storedValue;
    //全局可重用TokenStream组件
    public static final Analyzer.ReuseStrategy GLOBAL_REUSE_STRATEGY = new Analyzer.ReuseStrategy() {
        public Analyzer.TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
            return (Analyzer.TokenStreamComponents)this.getStoredValue(analyzer);
        }

        public void setReusableComponents(Analyzer analyzer, String fieldName, Analyzer.TokenStreamComponents components) {
            this.setStoredValue(analyzer, components);
        }
    };
    //最近的Field级可重用TokenStream组件
    public static final Analyzer.ReuseStrategy PER_FIELD_REUSE_STRATEGY = new Analyzer.ReuseStrategy() {
        public Analyzer.TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
            Map componentsPerField = (Map)this.getStoredValue(analyzer);
            return componentsPerField != null?(Analyzer.TokenStreamComponents)componentsPerField.get(fieldName):null;
        }

        public void setReusableComponents(Analyzer analyzer, String fieldName, Analyzer.TokenStreamComponents components) {
            Object componentsPerField = (Map)this.getStoredValue(analyzer);
            if(componentsPerField == null) {
                componentsPerField = new HashMap();
                this.setStoredValue(analyzer, componentsPerField);
            }

            ((Map)componentsPerField).put(fieldName, components);
        }
    };

    public Analyzer() {
        this(GLOBAL_REUSE_STRATEGY);
    }

    public Analyzer(Analyzer.ReuseStrategy reuseStrategy) {
        this.version = Version.LATEST;
        this.storedValue = new CloseableThreadLocal();
        this.reuseStrategy = reuseStrategy;
    }
    //此次很关键
    protected abstract Analyzer.TokenStreamComponents createComponents(String var1);

    protected TokenStream normalize(String fieldName, TokenStream in) {
        return in;
    }

    public final TokenStream tokenStream(String fieldName, Reader reader) {
        Analyzer.TokenStreamComponents components = this.reuseStrategy.getReusableComponents(this, fieldName);
        Reader r = this.initReader(fieldName, reader);
        if(components == null) {
            components = this.createComponents(fieldName);
            this.reuseStrategy.setReusableComponents(this, fieldName, components);
        }

        components.setReader(r);
        return components.getTokenStream();
    }

    public final TokenStream tokenStream(String fieldName, String text) {
        Analyzer.TokenStreamComponents components = this.reuseStrategy.getReusableComponents(this, fieldName);
        // text reader 很重要
        ReusableStringReader strReader = components != null && components.reusableStringReader != null?components.reusableStringReader:new ReusableStringReader();
        strReader.setValue(text);
        Reader r = this.initReader(fieldName, strReader);
        if(components == null) {
            components = this.createComponents(fieldName);
            this.reuseStrategy.setReusableComponents(this, fieldName, components);
        }

        components.setReader(r);
        components.reusableStringReader = strReader;
        return components.getTokenStream();
    }

    public final BytesRef normalize(String fieldName, String text) {
        try {
            String e;
            try {
                StringReader attributeFactory = new StringReader(text);
                Throwable ts = null;

                try {
                    Reader filterReader = this.initReaderForNormalization(fieldName, attributeFactory);
                    char[] termAtt = new char[64];
                    StringBuilder term = new StringBuilder();

                    while(true) {
                        int read = filterReader.read(termAtt, 0, termAtt.length);
                        if(read == -1) {
                            e = term.toString();
                            break;
                        }

                        term.append(termAtt, 0, read);
                    }
                } catch (Throwable var37) {
                    ts = var37;
                    throw var37;
                } finally {
                    if(attributeFactory != null) {
                        if(ts != null) {
                            try {
                                attributeFactory.close();
                            } catch (Throwable var34) {
                                ts.addSuppressed(var34);
                            }
                        } else {
                            attributeFactory.close();
                        }
                    }

                }
            } catch (IOException var39) {
                throw new IllegalStateException("Normalization threw an unexpected exception", var39);
            }

            AttributeFactory attributeFactory1 = this.attributeFactory(fieldName);
            TokenStream ts1 = this.normalize(fieldName, (TokenStream)(new Analyzer.StringTokenStream(attributeFactory1, e, text.length())));
            Throwable filterReader1 = null;

            BytesRef read1;
            try {
                TermToBytesRefAttribute termAtt1 = (TermToBytesRefAttribute)ts1.addAttribute(TermToBytesRefAttribute.class);
                ts1.reset();
                if(!ts1.incrementToken()) {
                    throw new IllegalStateException("The normalization token stream is expected to produce exactly 1 token, but got 0 for analyzer " + this + " and input \"" + text + "\"");
                }

                BytesRef term1 = BytesRef.deepCopyOf(termAtt1.getBytesRef());
                if(ts1.incrementToken()) {
                    throw new IllegalStateException("The normalization token stream is expected to produce exactly 1 token, but got 2+ for analyzer " + this + " and input \"" + text + "\"");
                }

                ts1.end();
                read1 = term1;
            } catch (Throwable var35) {
                filterReader1 = var35;
                throw var35;
            } finally {
                if(ts1 != null) {
                    if(filterReader1 != null) {
                        try {
                            ts1.close();
                        } catch (Throwable var33) {
                            filterReader1.addSuppressed(var33);
                        }
                    } else {
                        ts1.close();
                    }
                }

            }

            return read1;
        } catch (IOException var40) {
            throw new IllegalStateException("Normalization threw an unexpected exception", var40);
        }
    }

    protected Reader initReader(String fieldName, Reader reader) {
        return reader;
    }

    protected Reader initReaderForNormalization(String fieldName, Reader reader) {
        return reader;
    }

    protected AttributeFactory attributeFactory(String fieldName) {
        return TokenStream.DEFAULT_TOKEN_ATTRIBUTE_FACTORY;
    }

    public int getPositionIncrementGap(String fieldName) {
        return 0;
    }

    public int getOffsetGap(String fieldName) {
        return 1;
    }

    public final Analyzer.ReuseStrategy getReuseStrategy() {
        return this.reuseStrategy;
    }

    public void setVersion(Version v) {
        this.version = v;
    }

    public Version getVersion() {
        return this.version;
    }

    public void close() {
        if(this.storedValue != null) {
            this.storedValue.close();
            this.storedValue = null;
        }

    }

    private static final class StringTokenStream extends TokenStream {
        private final String value;
        private final int length;
        private boolean used = true;
        private final CharTermAttribute termAttribute = (CharTermAttribute)this.addAttribute(CharTermAttribute.class);
        private final OffsetAttribute offsetAttribute = (OffsetAttribute)this.addAttribute(OffsetAttribute.class);

        StringTokenStream(AttributeFactory attributeFactory, String value, int length) {
            super(attributeFactory);
            this.value = value;
            this.length = length;
        }

        public void reset() {
            this.used = false;
        }

        public boolean incrementToken() {
            if(this.used) {
                return false;
            } else {
                this.clearAttributes();
                this.termAttribute.append(this.value);
                this.offsetAttribute.setOffset(0, this.length);
                this.used = true;
                return true;
            }
        }

        public void end() throws IOException {
            super.end();
            this.offsetAttribute.setOffset(this.length, this.length);
        }
    }

    public abstract static class ReuseStrategy {
        public ReuseStrategy() {
        }

        public abstract Analyzer.TokenStreamComponents getReusableComponents(Analyzer var1, String var2);

        public abstract void setReusableComponents(Analyzer var1, String var2, Analyzer.TokenStreamComponents var3);

        protected final Object getStoredValue(Analyzer analyzer) {
            if(analyzer.storedValue == null) {
                throw new AlreadyClosedException("this Analyzer is closed");
            } else {
                return analyzer.storedValue.get();
            }
        }

        protected final void setStoredValue(Analyzer analyzer, Object storedValue) {
            if(analyzer.storedValue == null) {
                throw new AlreadyClosedException("this Analyzer is closed");
            } else {
                analyzer.storedValue.set(storedValue);
            }
        }
    }

    public static class TokenStreamComponents {
        protected final Tokenizer source;
        protected final TokenStream sink;
        transient ReusableStringReader reusableStringReader;

        public TokenStreamComponents(Tokenizer source, TokenStream result) {
            this.source = source;
            this.sink = result;
        }

        public TokenStreamComponents(Tokenizer source) {
            this.source = source;
            this.sink = source;
        }

        protected void setReader(Reader reader) {
            this.source.setReader(reader);
        }

        public TokenStream getTokenStream() {
            return this.sink;
        }

        public Tokenizer getTokenizer() {
            return this.source;
        }
    }
}

public final class StandardAnalyzer extends StopwordAnalyzerBase {
    public static final CharArraySet ENGLISH_STOP_WORDS_SET;
    public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
    private int maxTokenLength;
    public static final CharArraySet STOP_WORDS_SET;

    public StandardAnalyzer(CharArraySet stopWords) {
        super(stopWords);
        this.maxTokenLength = 255;
    }

    public StandardAnalyzer() {
        this(STOP_WORDS_SET);
    }

    public StandardAnalyzer(Reader stopwords) throws IOException {
        this(loadStopwordSet(stopwords));
    }

    public void setMaxTokenLength(int length) {
        this.maxTokenLength = length;
    }

    public int getMaxTokenLength() {
        return this.maxTokenLength;
    }

    protected TokenStreamComponents createComponents(String fieldName) {
        final StandardTokenizer src = new StandardTokenizer();
        src.setMaxTokenLength(this.maxTokenLength);
        StandardFilter tok = new StandardFilter(src);
        LowerCaseFilter tok1 = new LowerCaseFilter(tok);
        final StopFilter tok2 = new StopFilter(tok1, this.stopwords);
        return new TokenStreamComponents(src, tok2) {
            protected void setReader(Reader reader) {
                src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
                super.setReader(reader);
            }
        };
    }

    protected TokenStream normalize(String fieldName, TokenStream in) {
        StandardFilter result = new StandardFilter(in);
        LowerCaseFilter result1 = new LowerCaseFilter(result);
        return result1;
    }

    static {
        List stopWords = Arrays.asList(new String[]{"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"});
        CharArraySet stopSet = new CharArraySet(stopWords, false);
        ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet);
        STOP_WORDS_SET = ENGLISH_STOP_WORDS_SET;
    }
}

1.2 TokenStream

分词流(TokenStream),它是一个由分词后的Token结果组成的流,能够不断的得到下一个分成的Token。在Lucene 3.0以后,next方法改成了incrementToken,并增加了end方法,TokenStream是继承于AttributeSource,其包含Map,保存从class到对象的映射,从而可以保存不同类型的对象的值。

tokenStream是一个抽象类,它是所有分析器的基类,分词器和分词过滤器继承它

它主要包含以下几个方法:

  • public abstract boolean incrementToken()      //得到下一个Token
  • public void reset()                                          //设置tokenStream初始状态,否则会抛异常
  • public void end()                                            //调用endAttribute(),把termlength设置为0,意思是没有词汇了为close()做准备
  • public void close()                                          //结束

它经常用的对象有:

  • TermAttributeImpl                      //用来保存Token字符串
  • PositionIncrementAttributeImpl  //用来保存位置信息
  • OffsetAttributeImpl                     //用来保存偏移量信息

用法要将上面常用的对象添加到tokenStream中,比如:tokenStream.addAttribute(CharTermAttribute.class),然后获取对象中的值。

1.3 Tokenizer

分词器它主要负责接收字符流Reader,将Reader进行分词操作。实现类有StandardTokenizer

1.4 TokenFilter

TokenFilter主要是对分词器切分的最小单位进入索引进行预处理,如:大写转小写、复数转单数、也可以根据语义进行错误单词的纠错。

1.5 停用词

什么是停用词?停用词是为节省存储空间和提高搜索效率,搜索引擎在索引页面或处理搜索请求时会自动忽略某些字或词,这些字或词即被称为Stop Words(停用词)。比如语气助词、副词、介词、连接词等,通常自身并无明确的意义,只有将其放入一个完整的句子中才有一定作用,如常见的“的”、“在”、“是”、“啊”等。

1.6 示例

@Test
public void go_splitToken() throws IOException {
    Analyzer analyzer = new StandardAnalyzer();
    String str = "hello lucene,I am alex,我是 中国人";
    TokenStream tokenStream = analyzer.tokenStream("", new StringReader(str));
    //设置要获取分词的偏移量
    OffsetAttribute offsetAttr = tokenStream.addAttribute(OffsetAttribute.class);
    //设置要获取分词的项
    CharTermAttribute charTerm = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();设置tokenStream初始状态,否则会抛异常
    while (tokenStream.incrementToken()) {
        System.out.print("-->" + offsetAttr.startOffset());
        System.out.print("-->" + offsetAttr.endOffset());
        System.out.println("-->" + new String(charTerm.toString()));
    }
    tokenStream.end();
    tokenStream.close();
    System.out.println();
}
/***output:
-->0-->5-->hello
-->6-->12-->lucene
-->13-->14-->i
-->15-->17-->am
-->18-->22-->alex
-->23-->24-->我
-->24-->25-->是
-->26-->27-->中
-->27-->28-->国
-->28-->29-->人
**/

2. 中文分析器

开源的中文分词库比较多,比如有:StandardAnalyzer、ChineseAnalyzer、CJKAnalyzer、IK_CAnalyzer、MIK_CAnalyzer、MMAnalyzer(JE分词)、PaodingAnalyzer。单纯的中文分词的实现一般为按字索引或者按词索引。按字索引顾名思义,就是按单个字建立索引。按词索引就是按词喽,根据词库中的词,将文字进行切分。车东的交叉双字分割或者叫二元分词我觉得应该算是按字索引的改进,应该还是属于字索引的范畴吧。


按字

StandardAnalyzer

Lucene自带的标准分析器。

 

ChineseAnalyzer

Lucene contrib中附带的分析器,与StandardAnalyzer类似。注意是类似啊,还是有区别的。

 

CJKAnalyzer

Lucene contrib中附带的二元分词

按词

IK_CAnalyzer、MIK_CAnalyzer

http://lucene-group.group.iteye.com/group/blog/165287。使用版本为2.0.2

 

MMAnalyzer

现在能找到的最新版本是1.5.3。不过在原始网站已经找不到下载了,而且据说声明为不提供维护和支持。因为谈论的人比较多,所以列出来。但在使用中感觉不太稳定。

 

PaodingAnalyzer

庖丁解牛。http://code.google.com/p/paoding/downloads/list。使用版本为2.0.4beta。

对于一般性的应用,采用二元分词法应该就可以满足需求。如果需要分词的话,从分词效果、性能、扩展性、还是可维护性来综合考虑,建议使用选用 IK 分词,原因有以下几点:

  • IK 分细粒度ikmaxword和粗粒度ik_smart两种分词方式。
  • IK 更新字典只需要在词典末尾添加关键词即可,支持本地和远程词典两种方式。
  • IK 分词插件的更新速度更快,和最新版本保持高度一致。

3. 自定义分词器

了解源码TokenStream的创建过程的同学,会发现Analyzer.createComponents是个拓展入口。

1. 建立自己的Attribute接口及实现类

public interface MyCharAttribute extends Attribute {
    void setChars(char[] buffer, int length);

    char[] getChars();

    int getLength();

    String getString();
}
public class MyCharAttributeImpl extends AttributeImpl implements MyCharAttribute{
    private char[] chatTerm = new char[255];
    private int length = 0;
    @Override
    public void setChars(char[] buffer, int length) {
        this.length = length;
        if (length > 0) {
            System.arraycopy(buffer, 0, this.chatTerm, 0, length);
        }
    }

    public char[] getChars() {
        return this.chatTerm;
    }

    public int getLength() {
        return this.length;
    }

    @Override
    public String getString() {
        if (this.length > 0) {
            return new String(this.chatTerm, 0, length);
        }
        return null;
    }
    @Override
    public void clear() {
        this.length = 0;
    }

    @Override
    public void reflectWith(AttributeReflector attributeReflector) {

    }

    @Override
    public void copyTo(AttributeImpl attribute) {

    }
}

2. 建立分词器MyWhitespaceTokenizer,实现对英文按空白字符进行分词

public final class MyWhitespaceTokenizer extends Tokenizer {
    // 需要记录的属性
    private MyCharAttribute charAttr = this.addAttribute(MyCharAttribute.class);
    // 存词的出现位置,存放词的偏移
    char[] buffer = new char[255];
    int length = 0;
    int c;
    @Override
    public boolean incrementToken() throws IOException {
        // 清除所有的词项属性
        clearAttributes();
        length = 0;
        while (true) {
            c = this.input.read();
            if (c == -1) {
                if (length > 0) {
                    // 复制到charAttr
                    this.charAttr.setChars(buffer, length);
                    return true;
                } else {
                    return false;
                }
            }
            if (Character.isWhitespace(c)) {
                if (length > 0) {
                    // 复制到charAttr
                    this.charAttr.setChars(buffer, length);
                    return true;
                }
            }
            buffer[length++] = (char) c;
        }
    }
}

3. 建立过滤器,把大写字母转换为小写字母

public final class MyLowerCaseTokenFilter extends TokenFilter {
    private MyCharAttribute charAttr = this.addAttribute(MyCharAttribute.class);
    public MyLowerCaseTokenFilter(TokenStream input) {
        super(input);
    }
    @Override
    public boolean incrementToken() throws IOException {
        boolean res = this.input.incrementToken();
        if (res) {
            char[] chars = charAttr.getChars();
            int length = charAttr.getLength();
            if (length > 0) {
                for (int i = 0; i < length; i++) {
                    chars[i] = Character.toLowerCase(chars[i]);
                }
            }
        }
        return res;
    }
}

4. 建立分析器

public final class MyWhitespaceAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String s) {
        Tokenizer source = new MyWhitespaceTokenizer();
        TokenStream filter = new MyLowerCaseTokenFilter(source);
        return new TokenStreamComponents(source, filter);
    }
}

5. 运行分析器

@Test
public void go_customAnalyzer() throws IOException {
    Analyzer analyzer = new MyWhitespaceAnalyzer();
    TokenStream tokenStream = analyzer.tokenStream("aa", str);
    MyCharAttribute myCharAttr = tokenStream.getAttribute(MyCharAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        System.out.println(myCharAttr.getString());
    }
    tokenStream.end();
    tokenStream.close();
}

总结,lucene的分析器是由分词器和过滤器构成,需掌握这些对象的核心API,本质上它是个工具,内部实现也不是特别复杂。

  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值