Spark
star5610
这个作者很懒,什么都没留下…
展开
-
Sparksql 基本使用
package com.spark.week3import org.apache.spark.sql.SparkSessionobject One { System.setProperty("hadoop.home.dir","D:/soft/hadoop/hadoop-2.7.3") def main(args: Array[String]): Unit = { val spark = SparkSession.builder().master("local").appNam原创 2020-06-09 21:30:57 · 172 阅读 · 0 评论 -
利用sparkStreaming接受kafak中的消息,采用的是低层次Api-- direct (kafka 10版本)
package com.spark.streamingimport org.apache.kafka.common.serialization.StringDeserializerimport org.apache.spark.{SparkConf, SparkContext}import org.apache.spark.streaming.{Seconds, StreamingContext}//todo:利用sparkStreaming接受kafak中的消息,采用的是低层次Api-- d原创 2020-06-05 23:58:20 · 115 阅读 · 0 评论 -
利用sparkStreaming接受kafka中的数据实现单词计数----采用receivers
package com.spark.streamingimport org.apache.spark.streaming.dstream.DStreamimport org.apache.spark.streaming.kafka.KafkaUtilsimport org.apache.spark.streaming.{Seconds, StreamingContext}import org.apache.spark.{SparkConf, SparkContext}import scala.原创 2020-06-05 23:19:03 · 194 阅读 · 0 评论 -
利用sparkStreaming接受kafak中的消息,采用的是低层次Api-- direct
package com.spark.streamingimport kafka.serializer.StringDecoderimport org.apache.spark.streaming.dstream.{DStream, InputDStream}import org.apache.spark.streaming.kafka.KafkaUtilsimport org.apache.spark.streaming.{Seconds, StreamingContext}import org原创 2020-06-05 23:09:47 · 114 阅读 · 0 评论 -
sparkStreaming整合flume 推模式Push
package com.spark.streamingimport java.net.InetSocketAddressimport org.apache.spark.storage.StorageLevelimport org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}imp原创 2020-06-05 23:05:23 · 121 阅读 · 0 评论 -
sparkStreaming整合flume 拉模式Poll
package com.spark.streamingimport java.net.InetSocketAddressimport org.apache.spark.storage.StorageLevelimport org.apache.spark.streaming.flume.FlumeUtilsimport org.apache.spark.streaming.{Seconds, StreamingContext}import org.apache.spark.{SparkConf原创 2020-06-05 23:04:29 · 100 阅读 · 0 评论 -
sparkStreming开窗函数应用----统计一定时间内的热门词汇
package com.spark.streamingimport org.apache.spark.streaming.{Seconds, StreamingContext}import org.apache.spark.{SparkConf, SparkContext}/** * sparkStreming开窗函数应用----统计一定时间内的热门词汇 */object SparkStreamingTCPWindowHotWords { def main(args: Array[原创 2020-06-05 22:36:35 · 214 阅读 · 0 评论 -
sparkStreming开窗函数---统计一定时间内单词出现的次数
package com.spark.streamingimport org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}import org.apache.spark.streaming.{Seconds, StreamingContext}import org.apache.spark.{SparkConf, SparkContext}/** * sparkStreming开窗函数---统计一定时间内单词出现的原创 2020-06-05 22:34:34 · 569 阅读 · 0 评论 -
sparkStreaming流式处理,接受socket数据,实现单词统计并且每个批次数据结果累加
package com.spark.streamingimport org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}import org.apache.spark.streaming.{Seconds, StreamingContext}import org.apache.spark.{SparkConf, SparkContext}/** * sparkStreaming流式处理,接受socket数据,实现原创 2020-06-05 22:33:04 · 443 阅读 · 0 评论 -
Linux中nc的安装和作用
什么是ncnc是netcat的简写,有着网络界的瑞士军刀美誉。因为它短小精悍、功能实用,被设计为一个简单、可靠的网络工具nc的作用(1)实现任意TCP/UDP端口的侦听,nc可以作为server以TCP或UDP方式侦听指定端口(2)端口的扫描,nc可以作为client发起TCP或UDP连接(3)机器之间传输文件(4)机器之间网络测速安装命令:[star@linux-star opt]$ yum install -y nc在一个终端上 输入:[star@linux-star opt]$原创 2020-06-05 22:27:42 · 865 阅读 · 1 评论 -
sparkStreming流式处理接受socket数据,实现单词统计
package com.spark.streamingimport org.apache.spark.streaming.{Seconds, StreamingContext}import org.apache.spark.{SparkConf, SparkContext}/** * sparkStreming流式处理接受socket数据,实现单词统计 */object SparkStreamingTCP { def main(args: Array[String]): Unit原创 2020-06-05 22:20:45 · 134 阅读 · 0 评论 -
Spark 通过Rdd进行倒叙排序
测试数据:1 1603A 952 1603B 853 1603C 754 1603D 965 1604F 946 1604E 957 1604K 918 1604G 899 1501A 7910 1502A 6911 1503A 5912 1504A 8913 1701A 9914 1702A 10015 1703A 65测试结果:(1702A,100)(1701A,99)(1603D,96)(1603A,95)(1604E,95)(1604F,94)(1604原创 2020-06-05 15:46:50 · 1968 阅读 · 0 评论 -
Spark 使用UDAF获取平均值
测试数据:{“name”:“zhangsan”,“age”:20}{“name”:“lisi”,“age”:21}{“name”:“wangwu”,“age”:22}{“name”:“zhaoliu”,“age”:23}{“name”:“tianqi”,“age”:24}测试结果:±----±-----+|count|ageavg|±----±-----+| 5| 22.0|±----±-----+package com.spark.week3import org.apa原创 2020-06-05 15:41:51 · 243 阅读 · 0 评论 -
Spark 通过df操作对sql进行处理
package com.spark.sqlimport org.apache.spark.sql.{DataFrame, Encoder, Row, SparkSession}import org.apache.spark.sql.catalyst.encoders.ExpressionEncoderobject DataOperation { System.setProperty("hadoop.home.dir","D:\\soft\\hadoop\\hadoop-2.7.3")原创 2020-06-05 15:33:12 · 447 阅读 · 0 评论 -
Spark 读取各种文件获得df并写入
package com.spark.sqlimport org.apacheimport org.apache.spark.sql.catalyst.encoders.ExpressionEncoderimport org.apache.spark.sql.types.{StringType, StructField, StructType}import org.apache.spark.sql.{Encoder, Row, SaveMode, SparkSession}object DataS原创 2020-06-05 15:30:07 · 1358 阅读 · 0 评论 -
Spark RDD DataSet 和 DataFrame之间的相互转换
package com.spark.sqlimport org.apache.spark.rdd.RDDimport org.apache.spark.sql.catalyst.encoders.ExpressionEncoderimport org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}import org.apache.spark.sql._object Rdd2DataFrame原创 2020-06-05 15:24:59 · 499 阅读 · 0 评论 -
Spark 的IP 地址查询
package com.spark.coreimport java.sql.{Connection, DriverManager, PreparedStatement}import org.apache.spark.rdd.RDDimport org.apache.spark.{SparkConf, SparkContext}/** * ip地址查询 */object IPLocation { System.setProperty("hadoop.home.dir","D:\\sof原创 2020-06-05 15:20:23 · 399 阅读 · 0 评论 -
Spark 的UV和PV操作
UV:测试数据:192.168.33.16,hunter,2017-09-16 10:30:20,/a192.168.33.16,jack,2017-09-16 10:30:40,/a192.168.33.16,jack,2017-09-16 10:30:40,/a192.168.33.16,jack,2017-09-16 10:30:40,/a192.168.33.16,jack,2017-09-16 10:30:40,/a192.168.33.18,polo,2017-09-16 10:3原创 2020-06-05 15:16:20 · 274 阅读 · 0 评论 -
Spark 的二次排序
package com.spark.coreimport org.apache.spark.sql.SparkSessionimport org.apache.spark.{Partitioner, SparkConf}/** * Spark的二次排序 **/object SparkSecondarySort { System.setProperty("hadoop.home.dir","d://soft/hadoop/hadoop-2.7.3") def main(args: A原创 2020-06-05 15:12:18 · 304 阅读 · 0 评论 -
Spark GroupTopN ( 分组TopN)操作
数据:zhangsan chinese 80zhangsan math 90zhangsan english 85lisi chinese 90lisi math 80lisi english 90wangwu chinese 84wangwu math 89wangwu english 70maliu chinese 82maliu math 75maliu english 100结果:math:908980chinese:908482english:100原创 2020-06-05 15:10:37 · 151 阅读 · 0 评论 -
Spark TopN操作
package com.spark.coreimport org.apache.spark.{SparkConf, SparkContext}//orderid,userid,money,productidobject TopN { System.setProperty("hadoop.home.dir","D:\\soft\\hadoop\\hadoop-2.7.3") def main(args: Array[String]): Unit = { val conf = n原创 2020-06-05 15:06:57 · 206 阅读 · 0 评论 -
Streaming flume 的poll push 的配置信息和启动
poll: 先启动flume 后启动项目 然后向指定的文件中放入东西 控制台输出bin/flume-ng agent -n a1 -c conf -f conf/flume-poll.conf -Dflume.root.logger=INFO,consolepush: (hostnome ip地址:为windows 的ip 地址) 先启动idea项目 然后启动 flume 然后向指定的文件中放入东西 控制台输出bin/flume-ng age原创 2020-06-03 13:40:47 · 116 阅读 · 0 评论 -
Spark 单词统计在linux上运行的两种方式
单词统计shell方式读取本地文件val lineRdd=sc.textFile(“file:/opt/spark-2.4.5/aa.txt”)转换成键值对的方式val wordRdd=lineRdd.flatMap(line => line.split(" "))拆分单词lineRdd.map(line =>line.split(" ")).collect转换成键值对的方式val pairRdd=wordRdd.map(word => (word,1))分组做累加v原创 2020-05-27 13:39:34 · 290 阅读 · 0 评论 -
Spark 求取每年的最高气温
数据:1990-01-01 -51990-06-18 351990-03-20 81989-05-04 231989-11-11 -31989-07-05 381990-07-30 37import org.apache.spark.{SparkConf, SparkContext}object MaxTemp extends App { System.setProperty("hadoop.home.dir","D:\\soft\\hadoop\\hadoop-2.7.3")原创 2020-05-27 13:26:30 · 640 阅读 · 0 评论 -
Spark 自定义高阶函数并调用
object Test2 { def main(args: Array[String]): Unit = { val list=List("kitty","snoopy","scala") superFun(subFun,list) } //方法1 def subFun(x:String): String = { "hello" + x } //方法2 //val subFun=(x:String)=> "hello"+x val s原创 2020-05-27 13:23:05 · 224 阅读 · 0 评论 -
Spark filter的基本使用
import java.util.Calendarimport scala.collection.mutable.ListBufferobject Test1 { def main(args: Array[String]): Unit = { var list:ListBuffer[Tuple3[String,String,Int]]=ListBuffer() list.+=(("张三", "男", 1998)) list.+=(("李四", "女", 1997))原创 2020-05-27 13:20:21 · 11682 阅读 · 0 评论 -
Spark 求最大值和最小值
代码:import org.apache.spark.{SparkConf, SparkContext}import scala.util.Randomobject MaxAndMin extends App { System.setProperty("hadoop.home.dir","D:\\soft\\hadoop\\hadoop-2.7.3") val conf=new SparkConf().setMaster("local[*]").setAppName("MaxAndM原创 2020-05-26 23:36:28 · 2115 阅读 · 0 评论 -
Spark 求平均成绩
数据:zhangsan math 88zhangsan china 78zhangsan english 80lisi math 99lisi china 89lisi english 82wangwu math 66wangwu china 96wangwu english 84zhaoliu math 77zhaoliu china 67zhaoliu english 86.55代码:import org.apache.spark.{SparkConf, SparkCont原创 2020-05-26 15:51:47 · 2346 阅读 · 0 评论 -
Spark 排序+去重
import org.apache.spark.{SparkConf, SparkContext}object SortAndDistinct { System.setProperty("hadoop.home.dir","d://soft/hadoop/hadoop-2.9.2") def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[1]").setAppName("s原创 2020-05-26 15:49:21 · 332 阅读 · 0 评论 -
Spark 基本排序(排序序号)
Spark 基本排序import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}object Sort { System.setProperty("hadoop.home.dir","D:\\soft\\hadoop\\hadoop-2.9.2") def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("lo原创 2020-05-26 15:48:13 · 875 阅读 · 0 评论 -
Spark 单词统计
Spark 单词统计import org.apache.spark.{SparkConf, SparkContext}/** * 单词统计 */object WordCount { //本地运行 //System.setProperty("hadoop.home.dir","D:\\soft\\hadoop\\hadoop-2.9.2") def main(args: Array[String]): Unit = { //1.生成spark core总入口这个对象 v原创 2020-05-26 15:42:11 · 279 阅读 · 0 评论