Spark 学习笔记

最新推荐文章于 2021-08-31 11:14:58 发布

MgLiu

最新推荐文章于 2021-08-31 11:14:58 发布

阅读量507

点赞数

分类专栏： Spark 文章标签： Spark Big Data

本文链接：https://blog.csdn.net/qq_16055159/article/details/83067508

版权

Spark configuration

SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> airports = sc.textFile("in/airports.text");

Creating RDD’s

RDD = parallelize([1, 2, 3, 4])
RDD = sc.textFile("file:///c:/users/frank/gobs-o-text.txt")
	- or s3n:// hdfs://
hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users")

can also create from:

JDBC
Cassandra
HBase
Elastisearch
JSON, CSV, sequence filee…

Transformation and Action

##Transformations

Transformations will return a new RDD

map and filter

$\color{red}{filter}$

JavaRDD<String> cleanedLines = lines.filter(line -> !line.isEmpty())

$\color{red}{map}$

JavaRDD<String> URLs = sc.textFile("in/urls.text");
URLs.map(url -> makeHttpRequest(url));

// the return type of the map function is not necessary the same as its input type
JavaRDD<String> lines = sc.textFile("in/uppercase.text");
JavaRDD<Integer> lengths = lines.map(line -> line.length());

airport example

public static void main(String[] args) throws Exception {
   
    SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
    JavaSparkContext sc = new JavaSparkContext(conf);
	JavaRDD<String> airports = sc.textFile("in/airports.text");
	JavaRDD<String> airportsInUSA = airports.filter(line -> line.split(",")[3].equals("USA"));
	JavaRDD<String> airportsNameAndCityNames = airportsInUSA.map(line -> {
   
            String[] splits = line.split(",");
            return StringUtils.join(new String[]{
   splits[1], splits[2]}, ",");
        }
	);
	airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text");
}

$\color{red}{flatmap}$ : first map then flaten

JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Map<String, Long> wordCounts = words.countByValue();
for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
   
    System.out.println(entry.getKey() + " : " + entry.getValue());
}

function type

funciton: one input and one output => map and filter
function2: two inputs and one output => aggregate and reduce
flatmapfunction: one input, 0 or more outputs => flatmap

Example

lines.filter({
   
    public Boolean call(String line) throws Exception {
   
        return line.startsWith("Friday");
    }
});

lines.filter(line -> line.startWith("Friday"));

lines.filter(new StartsWithFriday());
static class StartsWithFriday implements Funciton<String, Boolean> {
   
    public Boolean call(String line) {
   
        return line.startsWith("Friday");
    }
}

set operations

$\color{red}{sample}$ : the sample operation will create a random sample form an RDD.

sample(boolean withRepleacement, double fraction)

$\color{red}{distinct}$ : returns the distinct rows from the input RDD. Expensive since it requires shuffling all the data across partitions to ensure that we receive only one copy of each element.
$\color{red}{union}$ : returns an RDD consisting of the data from both input RDDs. wil keep the duplicates.
$\color{red}{intersection}$ : removes all duplicates
$\color{red}{subtract}$
$\color{red}{cartesian Product}$ : returns all possible pairs of a and b where a is in the source RDD and b is in another RDD.

JavaRDD<String>aggreagagtedLogLines = julyFirstLogs.union(augustFirstLogs);

##Actions

$\color{red}{collect}$ : retrieves the entire RDD and returns it to the driver program in the form of a regular collection or value. String RDD to String list.

List<String> inputWords = Arrays.asList("spark", "hadoop", "hive", "pig");
JavaRDD<String> wordRDD = sc.parallelize(inputWords);
List<String> words = wordRDD.collect

最低0.47元/天解锁文章

MgLiu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark 学习笔记

Spark configurationSparkConf conf = new SparkConf().setAppName(&quot;airports&quot;).setMaster(&quot;local[2]&quot;);JavaSparkContext sc = new JavaSparkContext(conf);JavaRDD&amp;lt;String&amp;gt; airports = sc.textFile(&quot;
复制链接

扫一扫