python实现迭代计算_DPark: DPark 是 Spark 的 Python 克隆，是一个Python实现的分布式计算框架，可以非常方便地实现大规模数据处理和迭代计算...

最新推荐文章于 2024-08-07 09:28:46 发布

weixin_39974882

最新推荐文章于 2024-08-07 09:28:46 发布

阅读量211

点赞数

文章标签： python实现迭代计算

DPark

DPark is a Python clone of Spark, MapReduce(R) alike computing framework

supporting iterative computation.

Installation

## Due to the use of C extensions, some libraries need to be installed first.

$ sudo apt-get install libtool pkg-config build-essential autoconf automake

$ sudo apt-get install python-dev

$ sudo apt-get install libzmq-dev

## Then just pip install dpark (``sudo`` maybe needed if you encounter permission problem).

$ pip install dpark

Example

for word counting (wc.py):

from dpark import DparkContext

ctx = DparkContext()

file = ctx.textFile("/tmp/words.txt")

words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))

wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()

print wc

This script can run locally or on a Mesos cluster without any

modification, just using different command-line arguments:

$ python wc.py

$ python wc.py -m process

$ python wc.py -m host[:port]

See examples/ for more use cases.

Configuration

DPark can run with Mesos 0.9 or higher.

If a $MESOS_MASTER environment variable is set, you can use a

shortcut and run DPark with Mesos just by typing

$ python wc.py -m mesos

$MESOS_MASTER can be any scheme of Mesos master, such as

$ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master

In order to speed up shuffling, you should deploy Nginx at port 5055 for

accessing data in DPARK_WORK_DIR (default is /tmp/dpark), such

as:

server {

listen 5055;

server_name localhost;

root /tmp/dpark/;

}

UI

2 DAGs:

stage graph: stage is a running unit, contain a set of task, each run same ops for a split of rdd.

use api callsite graph

UI when running

Just open the url from log like start listening on Web UI http://server_01:40812 .

UI after running

before run, config LOGHUB & LOGHUB_PATH_FORMAT in dpark.conf, pre-create LOGHUB_DIR.

get log hubdir from log like logging/prof to LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8754, which in clude mesos framework id.

run dpark_web.py -p 9999 -l LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8728/, dpark_web.py is in tools/

UI examples for features

show sharing shuffle map output

rdd = DparkContext().makeRDD([(1,1)]).map(m).groupByKey()

rdd.map(m).collect()

rdd.map(m).collect()

combine nodes iff with same lineage, form a logic tree inside stage, then each node contain a PIPELINE of rdds.

rdd1 = get_rdd()

rdd2 = dc.union([get_rdd() for i in range(2)])

rdd3 = get_rdd().groupByKey()

dc.union([rdd1, rdd2, rdd3]).collect()

More docs (in Chinese)

weixin_39974882

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python实现迭代计算_DPark: DPark 是 Spark 的 Python 克隆，是一个Python实现的分布式计算框架，可以非常方便地实现大规模数据处理和迭代计算...

DPark DPark is a Python clone of Spark, MapReduce(R) alike computing frameworksupporting iterative computation.Installation## Due to the use of C extensions, some libraries need to be installed first...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。