python实现迭代计算_DPark: DPark 是 Spark 的 Python 克隆,是一个Python实现的分布式计算框架,可以非常方便地实现大规模数据处理和迭代计算...

DPark

DPark is a Python clone of Spark, MapReduce(R) alike computing framework

supporting iterative computation.

Installation

## Due to the use of C extensions, some libraries need to be installed first.

$ sudo apt-get install libtool pkg-config build-essential autoconf automake

$ sudo apt-get install python-dev

$ sudo apt-get install libzmq-dev

## Then just pip install dpark (``sudo`` maybe needed if you encounter permission problem).

$ pip install dpark

Example

for word counting (wc.py):

from dpark import DparkContext

ctx = DparkContext()

file = ctx.textFile("/tmp/words.txt")

words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))

wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()

print wc

This script can run locally or on a Mesos cluster without any

modification, just using different command-line arguments:

$ python wc.py

$ python wc.py -m process

$ python wc.py -m host[:port]

See examples/ for more use cases.

Configuration

DPark can run with Mesos 0.9 or higher.

If a $MESOS_MASTER environment variable is set, you can use a

shortcut and run DPark with Mesos just by typing

$ python wc.py -m mesos

$MESOS_MASTER can be any scheme of Mesos master, such as

$ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master

In order to speed up shuffling, you should deploy Nginx at port 5055 for

accessing data in DPARK_WORK_DIR (default is /tmp/dpark), such

as:

server {

listen 5055;

server_name localhost;

root /tmp/dpark/;

}

UI

2 DAGs:

stage graph: stage is a running unit, contain a set of task, each run same ops for a split of rdd.

use api callsite graph

UI when running

Just open the url from log like start listening on Web UI http://server_01:40812 .

UI after running

before run, config LOGHUB & LOGHUB_PATH_FORMAT in dpark.conf, pre-create LOGHUB_DIR.

get log hubdir from log like logging/prof to LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8754, which in clude mesos framework id.

run dpark_web.py -p 9999 -l LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8728/, dpark_web.py is in tools/

UI examples for features

show sharing shuffle map output

rdd = DparkContext().makeRDD([(1,1)]).map(m).groupByKey()

rdd.map(m).collect()

rdd.map(m).collect()

combine nodes iff with same lineage, form a logic tree inside stage, then each node contain a PIPELINE of rdds.

rdd1 = get_rdd()

rdd2 = dc.union([get_rdd() for i in range(2)])

rdd3 = get_rdd().groupByKey()

dc.union([rdd1, rdd2, rdd3]).collect()

More docs (in Chinese)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值