sparkSQL官网简单解析

本文详细介绍了SparkSQL的使用,包括通过不同访问方式执行SQL,创建DataFrame,使用SparkSession,运行SQL查询,数据源操作,保存模式,分区发现,Parquet文件,JSON数据集,Hive表的集成以及JDBC与其他数据库的交互。示例涵盖了各种操作,如数据过滤,分组,转换和持久化。
摘要由CSDN通过智能技术生成
###sparksql
1.  sql:访问方式:python;command-line;JDBC/ODBC;
2. Datasets and DataFrames:datasets只支持java,scala(2.2.0);Dataframe:structured data files, tables in Hive, external databases, or existing RDDs. 
3. SparkSession:可以轻松的执行hiveSQL,hiveUDFs,不需要启动hive;在spark2.0出现;


    ```


    from pyspark.sql import SparkSession


    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    
    ```
4.createDataFrame:create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
5.Untyped Dataset Operations (aka DataFrame Operations)
df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)


# Select only the "name" column
df.select("name").show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+


# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# +-------+---------+
# |   name|(age + 1)|
# +-------+---------+
# |Michael|     null|
# |   Andy|       31|
# | Justin|       20|
# +-------+---------+


# Select people older than 21
df.filter(df['age'] > 21).show()
# +---+----+
# |age|name|
# +---+----+
# | 30|Andy|
# +---+----+


# Count people by age
df.groupBy("age").count().show()
更多操作:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
6. Running SQL Queries Programmatically
df.createOrReplaceTempView("people")


sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
#Global Temporary View
# Register the DataFrame as a global temporary view
df.createGlobalTempView("people")


# Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值