Spark是什么
Spark是一个无所不能的分布式计算框架
官网:http://spark.apache.org/
Spark核心概念
RDD是分布式数据集合的抽象(结构上看是个行结构)
DataFrames and Datasets是对RDD的结构化抽象(结构上看是个二维表)
DStreams是对时间片的RDD集合的抽象(结构上看是个空间上的行结构)
Structured Streaming是对时间片的DataFrames and Datasets的结构化抽象(结构上看是个空间上的二维表)
Spark SQL是对抽象的结构化数据提供一种SQL查询的能力,是对计算的抽象
Spark的各种入口
//sc
JavaSparkContext sc=new JavaSparkContext(new SparkConf().setAppName("sc-test").setMaster("loacal"));
//sparksql
SparkSession spark=SparkSession.builder().master("local").appName("sparksql-test").getOrCreate();
SparkContext sc_sql=spark.sparkContext();
//streaming
JavaStreamingContext ssc=new JavaStreamingContext(new SparkConf().setAppName("sc-test").setMaster("loacal"), Durations.seconds(5));
JavaSparkContext sc_streaming=ssc.sparkContext();
kafka+spark的案例
1. 前提:
1.1.数据源由https://blog.csdn.net/weixin_42509545/article/details/81675027中产生
1.2.spark作为kafka的消费者处理数据并打印到屏幕,后续下一个例子再深化存储到redis中。
2. overview:
2.1. 利用com.typesafe.config.ConfigFactory加载kafka和spark的配置
2.2. 从加载好的config里边取出kafka的配置,并做成api
2.3. 从加载好的config里边去除spark的配置,并做成api
2.4. 利用2.2和2.3.生成相应的对象,并使用他们生成spark streaming,并消费打印到屏幕
3.实际操作
3.1.1. 配置config文件
streaming {
name = "Java Streaming Analysis"
interval = 5 # batch interval, unit seconds
master="local[4]"
topic = "user_pay" #在https://blog.csdn.net/weixin_42509545/article/details/81675027生成的topic的名字
}
kafka {
metadata.broker.list = "bigdata:9092" #kafka启动的borker的监听端口,配置的地方是在:$kafka_home/config/server.properties
auto.offset.reset = "smallest"
group.id = "test-consumer-group" #在$kafka_home/config/consumer.properties文件里边配置的group.id
}
redis {
server = "bigdata"
port = "6379"
}
3.1.2.读取配置文件
public class ConfigApi {
public static Config getTypesafeConfig(){
return ConfigFactory.parseResources("conf/kafka-spark-redis.conf");
}
}
3.2.做成kafka配置的api
import com.typesafe.config.Config;
import java.util.HashMap;
import java.util.Map;
public class KafkaApi {
public Map<String,String> getKafkaParms(){
Config kafkaConfig= ConfigApi.getTypesafeConfig().getConfig("kafka");
Map<String,String> kafkaParms=new HashMap<String,String>();
kafkaParms.put("metadata.broker.list",kafkaConfig.getString("metadata.broker.list"));
kafkaParms.put("auto.offset.reset",kafkaConfig.getString("auto.offset.reset"));
kafkaParms.put("group.id",kafkaConfig.getString("group.id"));
return kafkaParms;
}
}
3.3. 做成JavaStreamingContext的api
import com.typesafe.config.Config;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
public class SparkStreamingApi {
public JavaStreamingContext getSsc(){
JavaStreamingContext ssc=null;
Config sparkStreamingConfig= ConfigApi.getTypesafeConfig().getConfig("streaming");
SparkConf conf = new SparkConf();
conf.setAppName(sparkStreamingConfig.getString("name"));
conf.set("spark.streaming.stopGracefullyOnShutdown", "true");
conf.setMaster(sparkStreamingConfig.getString("master"));
Duration batchInterval = Durations.seconds(sparkStreamingConfig.getLong("interval"));
ssc = new JavaStreamingContext(conf, batchInterval);
return ssc;
}
3.4.1. 链接上并处理并打印
import com.google.common.collect.Sets;
import kafka.serializer.StringDecoder;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;
import java.util.Map;
/**
* Created by zhangxianchao on 2018/8/18.
*/
public class SparkStreamingAna {
public static void main(String[] args) {
runAna();
}
public static void runAna(){
//得到kafka的配置
KafkaApi kafkaApi=new KafkaApi();
Map<String,String> kafkaParms=kafkaApi.getKafkaParms();
//得到JavaStreamingContext
SparkStreamingApi sparkStreamingApi=new SparkStreamingApi();
JavaStreamingContext ssc = sparkStreamingApi.getSsc();
//设置JavaStreamingContext的logLevel
ssc.sparkContext().setLogLevel("WARN");
//获得kafka的topic名字
String topic=ConfigApi.getTypesafeConfig().getString("streaming.topic");
//利用KafkaUtils.createDirectStream得到JavaPairInputDStream
JavaPairInputDStream<String,String> input=KafkaUtils.createDirectStream(
ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParms,
Sets.newHashSet(topic));
//执行分析
processByShop(input);
//spark是lazy加载的,必须通过ssc.start()启动spark的job,要不然永远不会执行
ssc.start();
try {
ssc.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
//over
System.out.printf("over");
}
//处理JavaPairInputDStream的数据
public static void processByShop(JavaPairInputDStream input){
//从JavaPairInputDStream转换成JavaDStream
JavaDStream<String> valudeDStream=input.map(
new Function<Tuple2<String,String>,String>(){
@Override
public String call(Tuple2<String, String> v1) throws Exception {
return v1._2();
}
}
);
System.out.println("start print");
//从JavaDStream.foreachRDD进行打印数据到屏幕
valudeDStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
stringJavaRDD.foreach(new VoidFunction<String>() {
@Override
public void call(String s) throws Exception {
System.out.println("haha1");
System.out.println(s);
System.out.println("haha2");
}
});
}
}
);
System.out.println("stop print");
}
}
}
4.真正运行
4.1. mvn package 打包成jar包
4.2. spark-submit --master local[3] --class com.arua.spark.SparkStreamingAna --name testshop ./shops-flow-analysis-0.1.0-SNAPSHOT-jar-with-dependencies.jar
---实际出来内容过多,此处省略----
以上已经打通 spark以createDirectStream的方式消费kafka的broker中的topic数据了
接下来,我们让spark写入redis缓存数据库中去
spark+redis的案例
未完待续