补充hadoop和email archiving

最新推荐文章于 2023-06-12 22:43:39 发布

lionzl

最新推荐文章于 2023-06-12 22:43:39 发布

阅读量1k

点赞数

分类专栏：大数据和数据挖掘

大数据和数据挖掘专栏收录该内容

89 篇文章 1 订阅

订阅专栏

补充hadoop和email archiving

QQ空间新浪微博腾讯微博更多

2011 年 6 月 21 日 249 0

1本站主要内容均为原创，转帖需注明出处www.alexclouds.net

博主之前写了好几篇BLOG记录和分析了怎么使用HADOOP处理大量的EMAIL。思路还是比较明确的，但是如果要完美分析EMAIL的ARCHIVE恐怕思路还是要到搜索引擎上来，WHY? 因为搜索引擎的主要功能实现在EMAIL ARCHIVING上正是我们所需要的。比如需要按照键值搜索、索引、比如对检索速度的要求等等，所以博主要告诉众人把目光回到搜索上来。

我之前研究和看LUCENCE的相关书籍已经有一阵子了，对于LUCENCE和SOLR相对比较了解，之前我们解压MSG类型邮件的内容用到的TIKA和POI就是其中的库文件。solr是基于LUCENCE的文本搜索服务器，让我们简单看一下LUCENCE的工作原理：

HDFS+LUCENCE+SOLR = FEEL GOOD. 将RAW DOCUMENT放到HDFS中以后，索引和搜索分析的功能部分就交给LUCENCE和SOLR来完成。

那么MAPPER的主要功能实现代码就基本是这样：

//initialize indexWriter..
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_33, analyzer);
//if we are writing to hdfs, then use RAMDirectory
if (toHDFS){
iwc.setOpenMode(OpenMode.CREATE);
idx = new RAMDirectory();
writer = new IndexWriter(idx, iwc);
} else {
//use CREATE_OR_APPEND so if index already exists it will simply be appended to
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
idx = FSDirectory.open(new File(outputDir));
writer = new IndexWriter(idx, iwc);
}

public void map(LongWritable key, Text value, OutputCollector output,
Reporter reporter) throws IOException {
try {
Document doc = new Document();
//add email file path
String path = key.toString();
Fieldable field = new Field("path", path,
Field.Store.YES, Field.Index.ANALYZED);
doc.add(field);
//convert content into MapiMessage
InputStream input = new
ByteArrayInputStream(value.getBytes());
MAPIMessage msg = new MAPIMessage(input);
//add recipient as stored and analyzed field so we can
//search based on recipient and display recipient name in the results
String recipient = msg.getRecipientEmailAddress();
field = new Field("receipient",
recipient,Field.Store.YES, Field.Index.ANALYZED );

doc.add(field);
String subject = msg.getSubject();
field = new Field("subject", subject, Field.Store.YES,
Field.Index.ANALYZED );
doc.add(field);
String content = msg.getTextBody();
field = new Field("content", content, Field.Store.YES,
Field.Index.ANALYZED );
doc.add(field);
//add more fine grained fields based on search criteria needed
...
writer.addDocument(doc);
} catch (Exception e) {
e.printStackTrace();
}

}