补充hadoop和email archiving
1本站主要内容均为原创,转帖需注明出处www.alexclouds.net
博主之前写了好几篇BLOG记录和分析了怎么使用HADOOP处理大量的EMAIL。思路还是比较明确的,但是如果要完美分析EMAIL的ARCHIVE恐怕思路还是要到搜索引擎上来,WHY? 因为搜索引擎的主要功能实现在EMAIL ARCHIVING上正是我们所需要的。比如需要按照键值搜索、索引、比如对检索速度的要求等等,所以博主要告诉众人把目光回到搜索上来。
我之前研究和看LUCENCE的相关书籍已经有一阵子了,对于LUCENCE和SOLR相对比较了解, 之前我们解压MSG类型邮件的内容用到的TIKA和POI就是其中的库文件。solr是基于LUCENCE的文本搜索服务器,让我们简单看一下LUCENCE的工作原理:
HDFS+LUCENCE+SOLR = FEEL GOOD. 将RAW DOCUMENT放到HDFS中以后,索引和搜索分析的功能部分就交给LUCENCE和SOLR来完成。
那么MAPPER的主要功能实现代码就基本是这样:
//initialize indexWriter..
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_33, analyzer);
//if we are writing to hdfs, then use RAMDirectory
if (toHDFS){
iwc.setOpenMode(OpenMode.CREATE);
idx = new RAMDirectory();
writer = new IndexWriter(idx, iwc);
} else {
//use CREATE_OR_APPEND so if index already exists it will simply be appended to
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
idx = FSDirectory.open(new File(outputDir));
writer = new IndexWriter(idx, iwc);
}
public void map(LongWritable key, Text value, OutputCollector output,
Reporter reporter) throws IOException {
try {
Document doc = new Document();
//add email file path
String path = key.toString();
Fieldable field = new Field("path", path,
Field.Store.YES, Field.Index.ANALYZED);
doc.add(field);
//convert content into MapiMessage
InputStream input = new
ByteArrayInputStream(value.getBytes());
MAPIMessage msg = new MAPIMessage(input);
//add recipient as stored and analyzed field so we can
//search based on recipient and display recipient name in the results
String recipient = msg.getRecipientEmailAddress();
field = new Field("receipient",
recipient,Field.Store.YES, Field.Index.ANALYZED );
doc.add(field);
String subject = msg.getSubject();
field = new Field("subject", subject, Field.Store.YES,
Field.Index.ANALYZED );
doc.add(field);
String content = msg.getTextBody();
field = new Field("content", content, Field.Store.YES,
Field.Index.ANALYZED );
doc.add(field);
//add more fine grained fields based on search criteria needed
...
writer.addDocument(doc);
} catch (Exception e) {
e.printStackTrace();
}
}