spark编程:DataFrame和SQL编程基础-1

1.spark使用内置的DataFrame类的好处如下:

1)更为简单的语法

2)可以通过SQL语句操作DateFrame

3)操作自动分布在RDD上

 

首先在python中动态配置如下环境:

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME'] = "/usr/local/spark"
os.environ['JAVA_HOME']='/usr/lib/jvm/jdk1.8.0_162'

# You might need to enter your local IP
# os.environ['SPARK_LOCAL_IP']="192.168.2.138"

# Path for pyspark and py4j
sys.path.append("/usr/local/spark/python")
sys.path.append("/usr/local/spark/python/lib/py4j-0.10.7-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    from pyspark.sql import SparkSession


    print ("Successfully imported Spark Modules")
except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)

current_file_path='''file:///root/PycharmProjects/Helloworld/'''    #在pycharm创建个hellowworld文件
# sc = SparkContext('local')
# words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka"])
# print(words.count())

spark=SparkSession.builder.appName("Basics").getOrCreate()
df=spark.read.json(current_file_path+"people.json")      #读取hellowworld工程文件下的people.json文件
print(df.show())

 

people.json文件内容如下3行:

{"name":"Michael"}
{"name":"Andy", "age":30}

{"name":"Justin", "age":19}

 

 

 

 

 

2.Creating a DataFrame

First we need to start a SparkSession:

In [1]:

from pyspark.sql import SparkSession

Then start the SparkSession

In [2]:

# May take a little while on a local computer
spark = SparkSession.builder.appName("Basics").getOrCreate()

You will first need to get the data from a file (or connect to a large distributed file like HDFS, we'll talk about this later once we move to larger datasets on AWS EC2).

In [3]:

# We'll discuss how to read other options later.
# This dataset is from Spark's examples

# Might be a little slow locally
df = spark.read.json('people.json')

Showing the data

In [4]:

# Note how data is missing!
df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

In [5]:

df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

In [6]:

df.columns

Out[6]:

['age', 'name']

In [7]:

df.describe()

Out[7]:

DataFrame[summary: string, age: string]

Some data types make it easier to infer schema (like tabular formats such as csv which we will show later).

However you often have to set the schema yourself if you aren't dealing with a .read method that doesn't have inferSchema() built-in.

Spark has all the tools you need for this, it just requires a very specific structure:

In [8]:

from pyspark.sql.types import StructField,StringType,IntegerType,StructType

Next we need to create the list of Structure fields

* :param name: string, name of the field.
* :param dataType: :class:`DataType` of the field.
* :param nullable: boolean, whether the field can be null (None) or not.

In [9]:

data_schema = [StructField("age", IntegerType(), True),StructField("name", StringType(), True)]

In [10]:

final_struc = StructType(fields=data_schema)

In [11]:

df = spark.read.json('people.json', schema=final_struc)

In [12]:

df.printSchema()
root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

Grabbing the data

In [13]:

df['age']

Out[13]:

Column<b'age'>

In [14]:

type(df['age'])

Out[14]:

pyspark.sql.column.Column

In [15]:

df.select('age')

Out[15]:

DataFrame[age: int]

In [16]:

type(df.select('age'))

Out[16]:

pyspark.sql.dataframe.DataFrame

In [17]:

df.select('age').show()
+----+
| age|
+----+
|null|
|  30|
|  19|
+----+

In [18]:

# Returns list of Row objects
df.head(2)

Out[18]:

[Row(age=None, name='Michael'), Row(age=30, name='Andy')]

Multiple Columns:

In [19]:

df.select(['age','name'])

Out[19]:

DataFrame[age: int, name: string]

In [20]:

df.select(['age','name']).show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

Creating new columns

In [21]:

# Adding a new column with a simple copy
df.withColumn('newage',df['age']).show()
+----+-------+------+
| age|   name|newage|
+----+-------+------+
|null|Michael|  null|
|  30|   Andy|    30|
|  19| Justin|    19|
+----+-------+------+

In [22]:

df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

In [23]:

# Simple Rename
df.withColumnRenamed('age','supernewage').show()
+-----------+-------+
|supernewage|   name|
+-----------+-------+
|       null|Michael|
|         30|   Andy|
|         19| Justin|
+-----------+-------+

More complicated operations to create new columns

In [24]:

df.withColumn('doubleage',df['age']*2).show()
+----+-------+---------+
| age|   name|doubleage|
+----+-------+---------+
|null|Michael|     null|
|  30|   Andy|       60|
|  19| Justin|       38|
+----+-------+---------+

In [25]:

df.withColumn('add_one_age',df['age']+1).show()
+----+-------+-----------+
| age|   name|add_one_age|
+----+-------+-----------+
|null|Michael|       null|
|  30|   Andy|         31|
|  19| Justin|         20|
+----+-------+-----------+

In [26]:

df.withColumn('half_age',df['age']/2).show()
+----+-------+--------+
| age|   name|half_age|
+----+-------+--------+
|null|Michael|    null|
|  30|   Andy|    15.0|
|  19| Justin|     9.5|
+----+-------+--------+

In [27]:

df.withColumn('half_age',df['age']/2)

Out[27]:

DataFrame[age: int, name: string, half_age: double]

We'll discuss much more complicated operations later on!

3.Using SQL

To use SQL queries directly with the dataframe, you will need to register it to a temporary view:

In [28]:

# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

In [29]:

sql_results = spark.sql("SELECT * FROM people")

In [30]:

sql_results

Out[30]:

DataFrame[age: int, name: string]

In [31]:

sql_results.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

In [32]:

spark.sql("SELECT * FROM people WHERE age=30").show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

We won't really be focusing on using the SQL syntax for this course in general, but keep in mind it is always there for you to get you out of bind quickly with your SQL skills!

Alright that is all we need to know for now!

 

参考:https://github.com/tirthajyoti/Spark-with-Python/blob/master/Python-and-Spark-for-Big-Data-master/Spark_DataFrames/DataFrame_Basics.ipynb

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值