Spark configuration
SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> airports = sc.textFile("in/airports.text");
Creating RDD’s
RDD = parallelize([1, 2, 3, 4])
RDD = sc.textFile("file:///c:/users/frank/gobs-o-text.txt")
- or s3n:// hdfs://
hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users")
can also create from:
- JDBC
- Cassandra
- HBase
- Elastisearch
- JSON, CSV, sequence filee…
Transformation and Action
##Transformations
Transformations will return a new RDD
map and filter
- f i l t e r \color{red}{filter} filter
JavaRDD<String> cleanedLines = lines.filter(line -> !line.isEmpty())
- m a p \color{red}{map} map
JavaRDD<String> URLs = sc.textFile("in/urls.text");
URLs.map(url -> makeHttpRequest(url));
// the return type of the map function is not necessary the same as its input type
JavaRDD<String> lines = sc.textFile("in/uppercase.text");
JavaRDD<Integer> lengths = lines.map(line -> line.length());
airport example
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setAppName("airports").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> airports = sc.textFile("in/airports.text");
JavaRDD<String> airportsInUSA = airports.filter(line -> line.split(",")[3].equals("USA"));
JavaRDD<String> airportsNameAndCityNames = airportsInUSA.map(line -> {
String[] splits = line.split(",");
return StringUtils.join(new String[]{
splits[1], splits[2]}, ",");
}
);
airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text");
}
- f l a t m a p \color{red}{flatmap} flatmap: first map then flaten
JavaRDD<String> lines = sc.textFile("in/word_count.text");
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
Map<String, Long> wordCounts = words.countByValue();
for (Map.Entry<String, Long> entry : wordCounts.entrySet()) {
System.out.println(entry.getKey() + " : " + entry.getValue());
}
function type
funciton: one input and one output => map and filter
function2: two inputs and one output => aggregate and reduce
flatmapfunction: one input, 0 or more outputs => flatmap
Example
lines.filter({
public Boolean call(String line) throws Exception {
return line.startsWith("Friday");
}
});
lines.filter(line -> line.startWith("Friday"));
lines.filter(new StartsWithFriday());
static class StartsWithFriday implements Funciton<String, Boolean> {
public Boolean call(String line) {
return line.startsWith("Friday");
}
}
set operations
- s a m p l e \color{red}{sample} sample: the sample operation will create a random sample form an RDD.
sample(boolean withRepleacement, double fraction)
-
d i s t i n c t \color{red}{distinct} distinct: returns the distinct rows from the input RDD. Expensive since it requires shuffling all the data across partitions to ensure that we receive only one copy of each element.
-
u n i o n \color{red}{union} union: returns an RDD consisting of the data from both input RDDs. wil keep the duplicates.
-
i n t e r s e c t i o n \color{red}{intersection} intersection: removes all duplicates
-
s u b t r a c t \color{red}{subtract} subtract
-
c a r t e s i a n P r o d u c t \color{red}{cartesian Product} cartesianProduct: returns all possible pairs of a and b where a is in the source RDD and b is in another RDD.
JavaRDD<String>aggreagagtedLogLines = julyFirstLogs.union(augustFirstLogs);
##Actions
- c o l l e c t \color{red}{collect} collect: retrieves the entire RDD and returns it to the driver program in the form of a regular collection or value. String RDD to String list.
List<String> inputWords = Arrays.asList("spark", "hadoop", "hive", "pig");
JavaRDD<String> wordRDD = sc.parallelize(inputWords);
List<String> words = wordRDD.collect