使用spark写word count

最新推荐文章于 2023-12-29 06:30:00 发布

weixin_45126820

最新推荐文章于 2023-12-29 06:30:00 发布

阅读量315

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/weixin_45126820/article/details/106373661

版权

1.准备

工具准备：这里我直接使用spark-shell写命令行中输入spark-shell进入，前提是配好了环境变量，如果没有，请到spark下使用bin/spark-shell命令进入
文件准备：这里我直接使用spark包下的README.md文件，如果你的spark是单节点的，直接在spark路径下进入spark-shell，使用相对路径引用即可，我使用的是spark-on-yarn模式，所以我的默认文件系统是hdfs，我把它put到hdfs的相应路径下了。这一点可以注意一下。
已经进入spark-shell 这里的两句话

Spark context available as 'sc' .
#Sparkcontext 可以使用'sc'访问
Spark session available as 'spark'.
#Sparksession 使用'spark' 访问

2. 开始写

第一种方式，先使用 SparkContext来读取文件，使用textFile方法


scala> val text = sc.textFile("README.md")
text: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> text.collect
res2: Array[String] = Array(# Apache Spark, "", Spark is a fast and general cluster computing system for Big Data. It provides, high-level APIs in Scala, Java, Python, and R, and an optimized engine that, supports general computation graphs for data analysis. It also supports a, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, MLlib for machine learning, GraphX for graph processing,, and Spark Streaming for stream processing., "", <http://spark.apache.org/>, "", "", ## Online Documentation, "", You can find the latest Spark documentation, including a programming, guide, on the [project web page](http://spark.apache.org/documentation.html)., This README file only contains basic setup instructions., "", ## Building Spark, "", Spark is built using [Apache Maven](...

// 读取文件，生成RDD[String]


scala> val flatMaped = text.flatMap(line => line.split(" "))
flatMaped: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:25
scala> flatMaped.collect
res3: Array[String] = Array(#, Apache, Spark, "", Spark, is, a, fast, and, general, cluster, computing, system, for, Big, Data., It, provides, high-level, APIs, in, Scala,, Java,, Python,, and, R,, and, an, optimized, engine, that, supports, general, computation, graphs, for, data, analysis., It, also, supports, a, rich, set, of, higher-level, tools, including, Spark, SQL, for, SQL, and, DataFrames,, MLlib, for, machine, learning,, GraphX, for, graph, processing,, and, Spark, Streaming, for, stream, processing., "", <http://spark.apache.org/>, "", "", ##, Online, Documentation, "", You, can, find, the, latest, Spark, documentation,, including, a, programming, guide,, on

最低0.47元/天解锁文章

weixin_45126820

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用spark写word count

1.准备工具准备：这里我直接使用spark-shell写命令行中输入spark-shell进入，前提是配好了环境变量，如果没有，请到spark下使用bin/spark-shell命令进入文件准备：这里我直接使用spark包下的README.md文件，如果你的spark是单节点的，直接在spark路径下进入spark-shell，使用相对路径引用即可，我使用的是spark-on-yarn模式，所以我的默认文件系统是hdfs，我把它put到hdfs的相应路径下了。这一点可以注意一下。已经进入spar
复制链接

扫一扫