Hadoop for Archiving Email

转载 2011年10月14日 20:03:20

Hadoop for Archiving Email

This post will explore a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.

Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary.  The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation.  For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.

Big files are ideal for Hadoop. It can store them, provide fault tolerance, and process them in parallel when needed. As such, the first step in the journey to an email archival solution is to convert your email messages to large files. In this case we will convert them to flat files called sequence files, composed of binary key-value pairs.  One way to accomplish this is to:

  • Put all the individual email message files into a folder in HDFS
  • Use something like WholeInputFileFormat/WholeFileRecordReader (as described in Tom White’s book for small file conversion) to read the contents of the file as its value, and the file name as the key (see Figure 1. Message Files to Sequence File),
  • Use IdentityReducer (outputs input values directly to output) to write it into sequence files.

(Note: Depending on where your data is and bandwidth bottlenecks, simply spawning standalone JVMS to create sequence files directly in HDFS could also be an option. )

If you are dealing with millions of files, one way of sharing (partitioning them) would be to create sequence files by day/week/month, depending on how many email messages there are in your organization. This will limit the number of message files you need to put into HDFS to something that is more suitable, 1-2 million at a time given the NameNode memory footprint of each file. Once in HDFS, and converted into sequence files, you can delete the original files and proceed to the next batch.  Here is what a very basic map method in a Mapper class could look like; all it does is emit the filename as key and binary bytes as value.

public void map(NullWritable key, BytesWritable value,
       OutputCollector output, Reporter reporter)
       throws IOException {
      String filename = conf.get("map.input.file");
      output.collect(new Text(filename), value);

Here is what the main driver looks like,





public int run(String[] args) throws IOException {
JobConf conf = new (SequenceFileCreator.class );
FileInputFormat.setInputPaths(jobConf,new Path(args[0]));
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
//You could also compress the key value pairs using the following:
//SequenceFileOutputFormat.setOutputCompressionType(conf, //CompressionType.BLOCK);
//SequenceFileOutputFormat.setCompressOutput(conf, true);
//SequenceFileOutputFormat.setOutputCompressorClass(conf, GzipCodec.class); 

    return 0;

Code Walkthrough:

  1. The mapper emits the filename as the key and the file content BytesWritable as value.
  2. Sets the input path (where message files are stored) and output path (where output sequence file will be saved).
  3. WholeFileInputFormat is used as input format, which in turn uses WholeFileRecordReader to read the file content in its entirety. The content of the file gets sent to mapper as the value.
  4. This block is to enable compression. Using gzip, you could get a 10:1 ratio but the file would have to be processed as a whole. With lzo the compression is about 4:1 but files can be split and processed in parallel.
  5. This is where we set the output key for mapper as Text, output value for mapper as BytesWritable, the mapper class that we created and the reducer class as IdentityReducer. Since we do not need to sort, IdentityReducer works well in this case.

Figure 1: Message Files to Sequence File

With compression turned on, you can get I/O as well as storage benefits.

So far, we have taken our message files and converted them into sequence files, taking care of the conversion as well as storage portion.  In terms of searching those files, if the email messages are Outlook messages, we can use Apache POI for Microsoft Documents to parse them to Java Objects, and search the contents of the messages to output the results, as needed.

A quick example of code to perform the search and output the results looks like this, with the mapper first, followed by driver class:




public void map(Text key, BytesWritable value, OutputCollector output, Reporter reporter) throws IOException {
	String keyText = key.toString(); 		 	InputStream input = new
 	MAPIMessage msg = new MAPIMessage(input);
	try {
		if (msg.getRecipientEmailAddress().contains("sales")) {
			//write the file directly to local file system
			//NFS mount if you want
			String fileName = keyText.toString();
			FileOutputStream fos = new
      			  FileOutputStream("/tmp/" + fileName);
	} catch (ChunkNotFoundException e) {
	} catch (IOException ioe) {

public int run(String[] args) throws Exception {
	JobConf conf = new JobConf(SeqSearch.class); 		conf.setJobName("SearchEmails"); 		conf.setMapperClass(Map.class); 		conf.setInputFormat(SequenceFileInputFormat.class); 		conf.setOutputFormat(NullOutputFormat.class);
	FileInputFormat.setInputPaths(conf, new Path(args[0]));
	return 0;

Code Walkthrough:

  1. BytesWritable comes in to the Mapper as the value from the sequence file, and is converted to MAPIMessage from Apache POI for MS Documents.
  2. The actual search is performed here.  In this example, it just searches for emails that contain “sales” in recipient email addresses, all of the sample emails have sales in recipient address so the results should have them all back. Of course, this would be where the bulk of the logic would go should one require extensive search features including parsing through various attachments.  In this example we write the results — the complete email message — directly to the Local Filesystem so that it can be viewed via messaging applications such as Outlook.
  3. This is the job configuration where the input format is set to SequenceFileInputFormat created above. In this case, since there is no mapper output we set the output format to NullOutputFormat.

In this example, we write the complete email out to Local Filesystem one at a time. As an alternative, we could pass the results to reducers and write all the results as Text into HDFS as well. This would depend on what the need is.

In this post I have described the conversion of email files into sequence files and store them using HDFS. I have looked at how to search through them to output results. Given the “simply add a node” scalability feature of Hadoop, it is very straightforward to add more storage as well as search capacity. Furthermore, given that Hadoop clusters are built using commodity hardware, that the software itself is open source, and that the framework makes it simple to implement specific use cases. This leads to an overall solution that is very cost effective compared to a number of existing software products that provide similar capabilities. The search portion of the solution, however, is very rudimentary. In part 2, I will look at using Lucene/Solr for indexing and searching in a more standard and robust way.

补充hadoop和email archiving

补充hadoop和email archiving QQ空间新浪微博腾讯微博更多 2011 年 6 月 21 日2490 1本站主要内容均为原创,转帖需注明出处www.alex...
  • lionzl
  • lionzl
  • 2014年01月07日 10:26
  • 846

无法 enable archiving

环境:    OS:Windows    9.2的Desktop,ArcSDE    DBMS: oracle10G 症状: 在CATALOG中连接SDE的DATASET时点击右键先re...
  • liufeng1980423
  • liufeng1980423
  • 2010年12月15日 08:25
  • 818

72.You want to enable archiving on your database. Presently, the database is running in NOARCHIVELOG

72.You want to enable archiving on your database. Presently, the database is running in NOARCHIVELOG...
  • dwj19830118
  • dwj19830118
  • 2016年07月26日 18:21
  • 574

log buffer —— log file switch (archiving needed)、log file switch (checkpoint incomplete)

log file switch (archiving needed) SQL> select event#,name,parameter1,parameter2,parameter3 from ...
  • zq9017197
  • zq9017197
  • 2011年11月03日 16:19
  • 1308


72. You want  to  enable  archiving  on  your  database. Presently,  the  database  is  running  in ...
  • rlhua
  • rlhua
  • 2013年09月17日 14:49
  • 7157


SAP Archiving(1)--OverView     所谓的SAP Archiving中文解释就是归档。这是一个什么样的概念呢?就比如,纸质办公的企业,它每年都会有很多的资料和数据,过几年就需...
  • vwvwvwvw
  • vwvwvwvw
  • 2010年11月20日 09:51
  • 1674

data backup vs archiving

data backup 和 archiving 的根本区别在于设计的理念是不一样的:     data backup的设计目标在于“快速地恢复正在使用的数据”,当数据因为删除、损坏不能用的时候,bac...
  • qinglinsan
  • qinglinsan
  • 2016年05月15日 22:17
  • 293

在Centos下部署Solr出现Failed archiving old GC logs错误解决方案

根据以上提示  进入到Cloud目录下  输入命令 :ll 就会发现目录的拥有者是 root用户 因此 我们需要将cloud目录的拥有者和组更换为我们当前的登录用户  solr默认是不允许使用root...
  • zhengxiang_liao
  • zhengxiang_liao
  • 2017年01月28日 14:22
  • 159

postgres HA 主备环境搭建

  • huguangshanse00
  • huguangshanse00
  • 2014年07月13日 07:46
  • 5585

ABAP---SAP Data Archiving

SAP Data ArchivingOverviewCurrently, a large number of enterprises use SAP R/3 as a platform for i...
  • techweb
  • techweb
  • 2007年12月14日 17:10
  • 666
您举报文章:Hadoop for Archiving Email