Spark 学习笔记

Spark configuration

SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> airports = sc.textFile("in/airports.text");

Creating RDD’s

RDD = parallelize([1, 2, 3, 4])
RDD = sc.textFile("file:///c:/users/frank/gobs-o-text.txt")
	- or s3n:// hdfs://
hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users")

can also create from:

  • JDBC
  • Cassandra
  • HBase
  • Elastisearch
  • JSON, CSV, sequence filee…

Transformation and Action

##Transformations

Transformations will return a new RDD

map and filter

  • f i l t e r \color{red}{filter} filter
JavaRDD<String> cleanedLines = lines.filter(line -> !line.isEmpty())
  • m a p \color{red}{map} map
JavaRDD<String> URLs = sc.textFile("in/urls.text");
URLs.map(url -> makeHttpRequest(url));

// the return type of the map function is not necessary the same as its input type
JavaRDD<String> lines = sc.textFile("in/uppercase.text");
JavaRDD<Integer> lengths = lines.map(line -> line.length());

airport example

public static void main(String[] args) throws Exception {
   
    SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
    JavaSparkContext sc = new JavaSparkContext(conf);
	JavaRDD<String> airports = sc.textFile("in/airports.text");
	JavaRDD<String> airportsInUSA = airports.filter(line -> line.split(",")[3].equals("USA"));
	JavaRDD<String> airportsNameAndCityNames = airportsInUSA.map(line -> {
   
            String[] splits = line.split(",");
            return StringUtils.join(new String[]{
   splits[1], splits[2]}, ",");
        }
	);
	airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text");
}
  • f l a t m a p \color{red}{flatmap} flatmap: first map then flaten
JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Map<String, Long> wordCounts = words.countByValue();
for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
   
    System.out.println(entry.getKey() + " : " + entry.getValue());
}

function type

funciton: one input and one output => map and filter
function2: two inputs and one output => aggregate and reduce
flatmapfunction: one input, 0 or more outputs => flatmap

Example

lines.filter({
   
    public Boolean call(String line) throws Exception {
   
        return line.startsWith("Friday");
    }
});

lines.filter(line -> line.startWith("Friday"));

lines.filter(new StartsWithFriday());
static class StartsWithFriday implements Funciton<String, Boolean> {
   
    public Boolean call(String line) {
   
        return line.startsWith("Friday");
    }
}

set operations

  • s a m p l e \color{red}{sample} sample: the sample operation will create a random sample form an RDD.
sample(boolean withRepleacement, double fraction)
  • d i s t i n c t \color{red}{distinct} distinct: returns the distinct rows from the input RDD. Expensive since it requires shuffling all the data across partitions to ensure that we receive only one copy of each element.

  • u n i o n \color{red}{union} union: returns an RDD consisting of the data from both input RDDs. wil keep the duplicates.

  • i n t e r s e c t i o n \color{red}{intersection} intersection: removes all duplicates

  • s u b t r a c t \color{red}{subtract} subtract

  • c a r t e s i a n P r o d u c t \color{red}{cartesian Product} cartesianProduct: returns all possible pairs of a and b where a is in the source RDD and b is in another RDD.

JavaRDD<String>aggreagagtedLogLines = julyFirstLogs.union(augustFirstLogs);

##Actions

  • c o l l e c t \color{red}{collect} collect: retrieves the entire RDD and returns it to the driver program in the form of a regular collection or value. String RDD to String list.
List<String> inputWords = Arrays.asList("spark", "hadoop", "hive", "pig");
JavaRDD<String> wordRDD = sc.parallelize(inputWords);
List<String> words = wordRDD.collect
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值