现有架构中通过Scribe直接向HDFS中写入数据,大部分的对数据的操作都是通过Hive来进行的,所以需要在数据进入HDFS之后就能通过Hive来访问到具体的数据,这就需要以数据驱动来添加元数据。以前使用的方式是通过按照固定的时间间隔来执行一个并行批量添加元数据的Java程序,不过那样做可能会漏掉一些没有过来的partition的添加,为此我们还必须在第二天再一次执行,确保所有的数据都被映射到了Hive中。这样以时间驱动执行的方式在实现上很简单:批量扫描原始数据目录,根据目录名称添加数据分区。但是在实时性上没法保证,如果一次执行结束之后,这是一个新的partition下的数据才开始收到,这样就需要等到下次批量处理时才能将这些新数据映射到Hive中。
所以,我们需要一种以数据驱动的方式来添加映射,新的partition中的数据一到,我们就加上相应的partition,这时自然想到了队列。在这里,我们采用的是监控Scribe输出日志,使用RabbitMQ和Python的pika模块来实现这一功能的。不过即使是将Scribe换成Flume,使用其他的消息队列和语言API都可以适应于这种处理方式。
以下是发送数据封装好的API:
'''
Created on 2013-4-22
@author: panfei
'''
import logging
import pika
import json
LOG_FORMAT = ('%(levelname) -10s %(asctime)s %(name) -30s %(funcName) ''-35s %(lineno) -5d: %(message)s')
LOGGER = logging.getLogger(__name__)
class Sender(object):
EXCHANGE = 'category_exchange'
QUEUE = 'category_q'
ROUTING_KEY = 'category_rk'
def __init__(self, host):
self._connection = None
self._channel = None
self._host = host
self._message_number = 0
self.connect()
def connect(self):
LOGGER.info('Connect to %s', self._host)
self._connection = pika.BlockingConnection(pika.ConnectionParameters(host=self._host))
self._channel = self._connection.channel()
self._channel.queue_declare(queue=self.QUEUE)
self._channel.exchange_declare(exchange=self.EXCHANGE)
self._channel.queue_bind(self.QUEUE, self.EXCHANGE, self.ROUTING_KEY)
def close_connection(self):
LOGGER.info('Closing the connection')
self._connection.close()
def publish_message(self, message): # message is a dict type variable
try:
self._channel.basic_publish(exchange=self.EXCHANGE, routing_key=self.ROUTING_KEY, body=json.dumps(message))
self._message_number += 1
LOGGER.info('Published message # %i', self._message_number)
except Exception, e:
LOGGER.critical('Exception <%s> happened when publishing message' % e)
self.connect()
def close_channel(self):
LOGGER.info('Closing the channel')
if self._channel:
self._channel.close()
def stop(self):
LOGGER.info('Stopping')
self.close_channel()
self.close_connection()
LOGGER.info('Stopped')
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT, filename='/data/log/rabbit/sender.log')
sender = Sender('aggr01')
try:
sender.connect()
sender.publish_message(message='dafdafd')
sender.close_channel()
sender.close_connection()
except KeyboardInterrupt:
sender.stop()
在发送消息的时候,使用了简单的路由机制;也在其中实现了简单的重连机制,不过,由于刚刚使用pika,不知道是否还有更好的连接方式,还需再作研究。
下面是接收消息方,也就是消费者:
'''
Created on 2013-4-22
@author: panfei
'''
import hashlib
from subprocess import Popen, PIPE
import datetime
import logging
import pika
import json
LOG_FORMAT = ('%(levelname) -10s %(asctime)s %(name) -30s %(funcName) ''-35s %(lineno) -5d: %(message)s')
LOGGER = logging.getLogger(__name__)
class Receiver(object):
QUEUE = 'category_q'
ROUTING_KEY = 'category_rk'
CMD_TEMPLATE = '''hive -S -e "ALTER TABLE ht_%(metric)s ADD IF NOT EXISTS PARTITION(snid=%(snid)s, clientid=%(clientid)s, gameid=%(gameid)s, ds='%(clientdate)s') LOCATION 'hdfs://pnn:9000/user/hive/warehouse/rawdata/%(metric)s/%(metric)s_%(snid)s_%(clientid)s_%(gameid)s_%(clientdate)s'"'''
def __init__(self, host):
self._connection = None
self._channel = None
self._host = host
self._message_number = 0
self.current_date = datetime.date.today()
self.buf = set()
self.connect()
def connect(self):
LOGGER.info('Connect to %s', self._host)
self._connection = pika.BlockingConnection(pika.ConnectionParameters(self._host))
self._channel = self._connection.channel()
self._channel.queue_declare(self.QUEUE)
def callback(self, ch, method, properties, body):
md5body = hashlib.md5(body).hexdigest()
try:
now = datetime.date.today()
delta = now - self.current_date
if delta.days >= 1:
LOGGER.warning('Clearing the buffer...')
self.buf.clear()
self.current_date = now
if md5body not in self.buf:
LOGGER.info('%s receive %r, Now process it!' % (self.QUEUE, body))
kwargs = json.loads(body)
cmd = self.CMD_TEMPLATE % kwargs
LOGGER.info('Now Exec <%s> ...' % cmd)
p = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, close_fds=True)
ret_code = p.wait()
if ret_code == 0:
LOGGER.info('return Code<%s>, Success! Now add this message md5body to buffer', ret_code)
self.buf.add(md5body)
else:
LOGGER.error('return Code<%s>, Failed! Output:<%s> when execute <%s>', ret_code, p.stderr.read(), cmd)
except Exception, e:
LOGGER.exception('Exception <%s> when processing <%s>', e, body)
def consume(self):
try:
self._channel.basic_consume(self.callback, queue=self.QUEUE, no_ack=True)
LOGGER.info('start consuming messages')
self._channel.start_consuming()
except Exception, e:
LOGGER.exception('Exception <%s> in the process of consuming, Now reconnect and start new consume process', e)
self.connect()
self.consume()
def run(self):
self.consume()
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT, filename='/data/log/rabbit/receiver.log')
receiver = Receiver('aggr01')
receiver.run()
其中,为了避免重复添加partition添加了一个简单的缓存机制(self.buf),这个缓存会每天一清,防止无限扩大;再者由于body比较大,所以做了md5。这里也实现了简单的重连机制,不过总感觉有些奇怪,但是测试中确实能够实现断线重连的功能,程序在线上的这几天,工作还是比较平稳的。