一般大公司的数据中心都是集群环境,数据放在hdfs上,通过数仓进行管理,访问数仓的工具一般是hive。所以在这种环境下搭建机器学习流水线,可以分为以下几步:
1、从hive中获得模型输入
2、模型预测
3、将预测结果写入本地文件系统
4、从本地文件系统将数据写入hive表供下游调用
import os
import pandas as pd
import subprocess
os.chdir('your_path')
#read data from hive
cmd=shell_command
ds=os.popen(cmd).readlines()
#hive data to df
dtype={'uid':str,'attr':str}
df=pd.DataFrame([s.strip().replace('\n','').split('\t') for s in ds, columns=['uid','attr'])
#model process
#......
#write model result to local file system
df.to_csv('file_name',
header=False,index=False)
#write local file back to hdfs
cmd='your_shell_command'
#The underlying process creation and management in this module is handled by the Popen class.
p=subprocess.Popen(cmd,shell=True,
stdout=subprocess.PIPE,stdin=subprocess.PIPE,stderr=subprocess.PIPE)
status=p.wait() # If large data is in stdout, child process may wait for father to read data. Dead Lock. Be careful here.
#If child process has no output, p.wait() is safe.
logs=p.stderr.read() #only could be read once
print('logs: %s'%logs)
print('status:%d'%status)
#release memory
if status==0:
subprocess.call('rm your_local_file',shell=True)
gc.collect()