集群环境下的机器学习建模pipeline

最新推荐文章于 2022-10-15 23:19:54 发布

yftadyz

最新推荐文章于 2022-10-15 23:19:54 发布

阅读量232

点赞数

分类专栏：数据挖掘文章标签：数据挖掘机器学习

本文链接：https://blog.csdn.net/yftadyz/article/details/109065120

版权

数据挖掘专栏收录该内容

6 篇文章 0 订阅

订阅专栏

本文介绍了在企业级数据中心环境中，如何利用HDFS存储数据，通过Hive进行管理和查询，然后构建一个从Hive获取数据、模型预测、结果存储到Hive的机器学习流水线流程，涉及Python操作Hive和文件系统的关键步骤。

摘要由CSDN通过智能技术生成

一般大公司的数据中心都是集群环境，数据放在hdfs上，通过数仓进行管理，访问数仓的工具一般是hive。所以在这种环境下搭建机器学习流水线，可以分为以下几步：
1、从hive中获得模型输入
2、模型预测
3、将预测结果写入本地文件系统
4、从本地文件系统将数据写入hive表供下游调用

import os
import pandas as pd
import subprocess
os.chdir('your_path')

#read data from hive
cmd=shell_command
ds=os.popen(cmd).readlines()

#hive data to df
dtype={'uid':str,'attr':str}
df=pd.DataFrame([s.strip().replace('\n','').split('\t') for s in ds, columns=['uid','attr'])

#model process
#......

#write model result to local file system
df.to_csv('file_name',
            header=False,index=False)
            
#write local file back to hdfs             
cmd='your_shell_command'
#The underlying process creation and management in this module is handled by the Popen class.
p=subprocess.Popen(cmd,shell=True,
                   stdout=subprocess.PIPE,stdin=subprocess.PIPE,stderr=subprocess.PIPE)
status=p.wait() # If large data is in stdout, child process may wait for father to read data. Dead Lock. Be careful here.
#If child process has no output, p.wait() is safe.

logs=p.stderr.read()  #only could be read once
print('logs: %s'%logs)
print('status:%d'%status)

#release memory
if status==0:
    subprocess.call('rm your_local_file',shell=True)
gc.collect()