读learning spark lighting chapter1~chapter2

chapter 1 introduction to the analysis with spark

the conponents of Sparks

  spark core(contains the basic  functionality of sparks. spark Core  is also the  home to the APIs that defines the RDDs),

  spark sql(structured data ) is the package  for working with the structured data.it allow query  data via SQL as well as Apache hive , and it support many sources of data ,including Hive tables ,Parquet And jason.also allow developers to intermix SQL queries with the programatic data manipulation supported By the RDDs in Python ,java And Scala .

  spark streaming(real-time),enables processing the live of streaming of data.

  MLib(machine learning)

  GraphX(graph processing )is a library for manipulating the graph .

A Brief History of Spark

  spark is a open source project that has beed And is maintained By a thriving And diverse community  of developer .

chapter 2 downloading spark And getting started

  walking through the process of downloding And running the sprak on local mode on single computer .

  you don't needmaster Scala,java orPython.Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). To run Spark
on either your laptop or a cluster, all you need is an installation of Java 6 or newer. If you wish to use the Python API you will also need a Python interpreter (version 2.6 or newer).Spark does not yet work with Python 3.

 

downloading spark,select the "pre-build for Hadoop 2.4 And later".

tips:

widows user May run into issues installing .you can use the ziptool untar the .tar file Note :instatll spark in a directionalry with no space (e.g. c:\spark).

after you untar you will get a new directionaru with the same name but without the final .tar suffix .

 damn it:

Most of this book includes code in all of Spark’s languages, but interactive shells are
available only in Python and Scala. Because a shell is very useful for learning the API, we recommend using one of these languages for these examples even if you are a Java
developer. The API is similar in every language.

change the directionaty to the spark,type bin\pyspark,you will see the logo.

 

Introduction  to Core spark concepts

Driver program

  |----your application

  |----distributed datasets that you defined

  usually weapply many operations on thedatasets.

  ***in the preceding example ,the Driver program was the spark she'll intself ,And you can type in the operation that you wanted.

  ***Driver program access the spark through a SparkContext object,which representing the  connection  to  a  computing cluster. what's  more  the  sparkcontext is automatically created for you called as  sc,in the  pyspark ,you can print  the infomation  about this  Object  By  typing "sc".well i think you will  know  the SparkContext have 3 kinds in java ,Python And Scala respectively.

SparkContext  have  many operations ,such as count(),first() and so on. Driver program typically manage a number of nodes called executors. when you call any operation on a cluster different machines might count in different ranges of the file.beacuse we run the spark shell locally,it execute all works on a single machine.

Passing Funtions to Spark

  look at the following example in python:

1 lines = sc.textFile(""README.md);
2 pythonLines = lines.filter(lambda line : "Python" in line);
3 pythoLines.first();

if you are unfamilat with the lambda sytax. it's a shorthand way to define function inline in Python or Scala, then pass the function's name to the Spark. you can do like this :

def hasPython(line): # this function judge that wheter every line contain the "Python" in a file
    
    return "Python" in line. pythonLines = lines.filter(hasPython)

 of course you can write in java.but they are defined as classes, implementing interface Funtion:

JavaRDD<String> pythonLines = filter(new Funtion<String, Bollean>()
    {
        Boolean call(String line)
        {
            return lines.contains(line);
        }
    });    

nowadays java8 have supported lambda.

Spark qutomatically takes your functions (e.g. lines.contains("Python"))and ships it to executors nodes. Thus, you can write code in a single driver program and automatically have parts of it run on mutiple nodes.

 Standalone Applications

Apart from running Spark interactively, Spark can be linked to standalone applications in either Java, Python or Scala. The main difference from using it in the shell is that you need to initialize your SparkContext.After that ,the functions is same.Remember , if you using it in the shell, the SparkContext is created automatically for you called "sc", you can use it direactly.

The proces linked to Spark varies from languages. In Java and Scala , you give your aplliation Maven dependency on the spark-core artifact.Maven is a popular package managerment tool for java-based languages let you link to libaries in public repositories.you can use Maven itself build your projet , or use other tools that can talk to the Maven repositories, including Scala's sbt od Gradle. Popular IDE like Eclipse also allow you to directly add a Maven dependency to a  project.

In Python,you simply write application as Python scripts, but you must run them using the bin\spark-submit script included in Spark. The spark-submit script include the dependency for us in Python,what's more it sets up the enrivonment for Spark's Python API to function. Simply run your scripts like this:

bin\spak-submit my_script.py

(Note that you will have to use backslashes instead of forward slashes on Window)s

 

转载于:https://www.cnblogs.com/OliverZhang/p/6079255.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: Spark SQL是Spark中用于处理结构化数据的模块。它提供了一种基于DataFrame和SQL的编程接口,可以方便地进行数据分析和处理。Spark SQL支持多种数据源,包括Hive、JSON、Parquet等,可以通过SQL语句或DataFrame API进行数据查询和操作。Spark SQL还支持用户自定义函数(UDF)和聚合函数(UDAF),可以满足更复杂的数据处理需求。Spark SQL的优势在于它可以与Spark的其他模块无缝集成,如Spark Streaming、MLlib等,可以构建完整的数据处理和分析流程。 ### 回答2: 本篇笔记主要是介绍Spark SQL的基本概念和编程模型。 Spark SQL是面向Spark计算引擎的一种高性能的分布式数据处理技术,它提供一种基本的高度抽象的编程模型,使得开发大规模的数据仓库和数据分析应用变得容易和高效。 Spark SQL最核心的概念就是DataFrames,DataFrame是RDD的超集,提供了更高层次的抽象和对数据的结构化的处理能力,在数据处理的过程中常常会用到一些基本的操作:过滤、选择、聚合、排序等等,而这些操作都可以一步一步地以DataFrame为基础完成。 在使用Spark SQL的过程中,可以通过DataFrame API和Spark SQL语言两种方式进行编程。DataFrame API是Spark SQL提供的一种编程API,它提供了常见的操作,如选择、过滤和聚合等。而Spark SQL语言则是一种基于SQL的编程语言,和传统的SQL查询语言类似,可以通过SQL查询语句来对数据进行查询和操作。Spark SQL可以支持多种数据源,包括JSON、Parquet、ORC、Hive、JDBC等等,因此可以轻松地取和处理不同类型的数据源。 Spark SQL还提供了高级的功能,如User-Defined Functions(UDFs)、Window Functions和Structured Streaming等等。UDFs允许开发者自定义函数并在Spark SQL中使用,将SQL和代码结合起来,提高了处理数据的灵活性和可扩展性;Window Functions则是一种用来进行滑动窗口操作的函数,常常用于计算数据的局部或全局统计量;Structured Streaming提供了数据流处理的能力,并且实现了端到端的Exactly-Once语义。 总之,Spark SQL提供了很多的功能和便利,特别是在大数据处理和分析领域,它的优势尤为突出。结合Spark的强大计算能力和Spark SQL的抽象编程模型,在大规模的数据分析和仓库方面都具有非常高的可扩展性和灵活性。 ### 回答3: Spark SQL是Spark生态系统中的一个组件,它负责处理结构化数据。它提供了SQL查询和DataFrame API,可以从不同的数据源中取和处理数据。Spark SQL能够理解SQL语言,这使得开发人员可以使用传统的SQL查询方式来处理数据,同时还可以利用Spark的优势,例如分布式计算和内存缓存。 Spark SQL支持许多不同类型的数据源,包括Hive表、传统的RDD、Parquet文件、JSON文件、CSV文件和JDBC数据源等。Spark SQL可以通过使用数据源API将这些数据源加载到Spark中,然后可以在Spark中处理和查询这些数据。 Spark SQL还支持特定于数据源的优化器和执行引擎,这允许Spark SQL针对不同的数据源执行优化操作。例如,使用Hive数据源时,Spark SQL会使用Hive的元数据来优化查询计划。当使用Parquet文件格式时,Spark SQL会使用Parquet文件中的元数据来优化查询计划。 在Spark SQL中,DataFrame是一种非常重要的概念。它是一种强类型的分布式数据集,可以使用DataFrame API进行操作。DataFrame API是一种更面向数据的API,例如过滤数据、聚合数据等。Spark SQL中的DataFrame可以看作是类似于表的对象,它可以和Spark SQL中的SQL查询混合使用。 除了DataFrame API和SQL查询,Spark SQL还支持UDF(用户自定义函数)。UDF允许用户在SQL查询或DataFrame API中定义自己的函数,以实现更复杂的数据操作。使用UDF时,用户可以使用累加器和广播变量等Spark的分布式计算功能,使得UDF具备高性能和可伸缩性。 总之,Spark SQL是大数据处理领域中一种非常方便和强大的处理结构化数据的工具。它可以方便地与其他Spark组件结合使用,例如Spark Streaming、Spark MLlib等。使用Spark SQL,开发人员可以在不同的数据源之间轻松地查询和转换数据,并利用Spark分布式计算的优势,实现高性能和可伸缩性的数据处理。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值