2016年11月_d4shman

11月 07月 06月 03月 01月

转载 Spark mapPartitions()操作

原文地址：http://apachesparkbook.blogspot.com/2015/11/mappartition-example.html---mapPartitions() can be used as an alternative to map() & foreach(). mapPartitions() is called once for each Partition

2016-11-21 12:11:47 2684

原创 Spark数据分区

Spark程序可以通过分区来减少网络通信开销。分区并非对于所有场景都是有好处的：比如，如果给定RDD只被扫描一遍，那么完全没有必要做分区，只有当数据多次在诸如连接这种基于键的操作时，分区才会有帮助。假设我们有一份不变的大文件userData, 以及每5分钟产生的小数据events, 现要求在每5分钟产出events数据后， userData对events做一次join操作。该过程的代码

2016-11-20 00:55:02 2451 1

原创 Spark Pair RDD操作

Spark Pair RDD操作1. 创建Pair RDDval pairs = lines.map(x => (x.split(" ")(0), x)2. Pair RDD的转化方法表1 Pair RDD的转化方法(以键值对集合{(1,2), (3,4), (3, 6)}为例) 函数名目的示例结果 reduceByKey() 合并具有相同键的值 rdd.reduce

2016-11-19 12:14:04 1264

转载 Spark内存参数调节

原文地址： http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/－－In the conclusion to this series, learn how resource tuning, parallelism, and data representation affect

2016-11-15 10:29:55 1985 2

转载 Hive VS HBase

原文地址：https://www.xplenty.com/blog/2014/05/hive-vs-hbase/---Comparing Hive with HBase is like comparing Google with Facebook - although they compete over the same turf (our private information)

2016-11-14 13:54:09 1426

原创 Spark RDD基本操作

Spark RDD Scala语言编程RDD（Resilient Distributed Dataset）是一个不可变的分布式对象集合，每个rdd被分为多个分区，这些分区运行在集群的不同节点上。rdd支持两种类型的操作：转化(trainsformation)和行动(action)， Spark只会惰性计算rdd, 也就是说，转化操作的rdd不会立即计算，而是在其第一次遇到行动操作时才去计算，

2016-11-13 20:32:19 5035