spark编程：DataFrame和SQL编程基础-1

最新推荐文章于 2024-01-29 08:00:00 发布

sa726663676

最新推荐文章于 2024-01-29 08:00:00 发布

阅读量215

点赞数

分类专栏： hadoop-spark

本文链接：https://blog.csdn.net/sa726663676/article/details/113547010

版权

hadoop-spark 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

1.spark使用内置的DataFrame类的好处如下：

1）更为简单的语法

2）可以通过SQL语句操作DateFrame

3）操作自动分布在RDD上

首先在python中动态配置如下环境：

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME'] = "/usr/local/spark"
os.environ['JAVA_HOME']='/usr/lib/jvm/jdk1.8.0_162'

# You might need to enter your local IP
# os.environ['SPARK_LOCAL_IP']="192.168.2.138"

# Path for pyspark and py4j
sys.path.append("/usr/local/spark/python")
sys.path.append("/usr/local/spark/python/lib/py4j-0.10.7-src.zip")

try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession

print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)

current_file_path='''file:///root/PycharmProjects/Helloworld/''' #在pycharm创建个hellowworld文件
# sc = SparkContext('local')
# words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka"])
# print(words.count())

spark=SparkSession.builder.appName("Basics").getOrCreate()
df=spark.read.json(current_file_path+"people.json") #读取hellowworld工程文件下的people.json文件
print(df.show())

people.json文件内容如下3行：

{"name":"Michael"}

{"name":"Andy", "age":30}

{"name":"Justin", "age":19}

2.Creating a DataFrame

First we need to start a SparkSession:

In [1]:

from pyspark.sql import SparkSession

Then start the SparkSession

In [2]:

# May take a little while on a local computer
spark = SparkSession.builder.appName("Basics").getOrCreate()

You will first need to get the data from a file (or connect to a large distributed file like HDFS, we'll talk about this later once we move to larger datasets on AWS EC2).

In [3]:

# We'll discuss how to read other options later.
# This dataset is from Spark's examples

# Might be a little slow locally
df = spark.read.json('people.json')

Showing the data

In [4]:

# Note how data is missing!
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

In [5]:

df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

In [6]:

df.columns

Out[6]:

['age', 'name']

In [7]:

df.describe()

Out[7]:

DataFrame[summary: string, age: string]

Some data types make it easier to infer schema (like tabular formats such as csv which we will show later).

However you often have to set the schema yourself if you aren't dealing with a .read method that doesn't have inferSchema() built-in.

Spark has all the tools you need for this, it just requires a very specific structure:

In [8]:

from pyspark.sql.types import StructField,StringType,IntegerType,StructType

Next we need to create the list of Structure fields

* :param name: string, name of the field.
* :param dataType: :class:`DataType` of the field.
* :param nullable: boolean, whether the field can be null (None) or not.

In [9]:

data_schema = [StructField("age", IntegerType(), True),StructField("name", StringType(), True)]

In [10]:

final_struc = StructType(fields=data_schema)

In [11]:

df = spark.read.json('people.json', schema=final_struc)

In [12]:

df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

Grabbing the data

In [13]:

df['age']

Out[13]:

Column<b'age'>

In [14]:

type(df['age'])

Out[14]:

pyspark.sql.column.Column

In [15]:

df.select('age')

Out[15]:

DataFrame[age: int]

In [16]:

type(df.select('age'))

Out[16]:

pyspark.sql.dataframe.DataFrame

In [17]:

df.select('age').show()

+----+
| age|
+----+
|null|
|  30|
|  19|
+----+

In [18]:

# Returns list of Row objects
df.head(2)

Out[18]:

[Row(age=None, name='Michael'), Row(age=30, name='Andy')]

Multiple Columns:

In [19]:

df.select(['age','name'])

Out[19]:

DataFrame[age: int, name: string]

In [20]:

df.select(['age','name']).show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

Creating new columns

In [21]:

# Adding a new column with a simple copy
df.withColumn('newage',df['age']).show()

+----+-------+------+
| age|   name|newage|
+----+-------+------+
|null|Michael|  null|
|  30|   Andy|    30|
|  19| Justin|    19|
+----+-------+------+

In [22]:

df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

In [23]:

# Simple Rename
df.withColumnRenamed('age','supernewage').show()

+-----------+-------+
|supernewage|   name|
+-----------+-------+
|       null|Michael|
|         30|   Andy|
|         19| Justin|
+-----------+-------+

More complicated operations to create new columns

In [24]:

df.withColumn('doubleage',df['age']*2).show()

+----+-------+---------+
| age|   name|doubleage|
+----+-------+---------+
|null|Michael|     null|
|  30|   Andy|       60|
|  19| Justin|       38|
+----+-------+---------+

In [25]:

df.withColumn('add_one_age',df['age']+1).show()

+----+-------+-----------+
| age|   name|add_one_age|
+----+-------+-----------+
|null|Michael|       null|
|  30|   Andy|         31|
|  19| Justin|         20|
+----+-------+-----------+

In [26]:

df.withColumn('half_age',df['age']/2).show()

+----+-------+--------+
| age|   name|half_age|
+----+-------+--------+
|null|Michael|    null|
|  30|   Andy|    15.0|
|  19| Justin|     9.5|
+----+-------+--------+

In [27]:

df.withColumn('half_age',df['age']/2)

Out[27]:

DataFrame[age: int, name: string, half_age: double]

We'll discuss much more complicated operations later on!

3.Using SQL

To use SQL queries directly with the dataframe, you will need to register it to a temporary view:

In [28]:

# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

In [29]:

sql_results = spark.sql("SELECT * FROM people")

In [30]:

sql_results

Out[30]:

DataFrame[age: int, name: string]

In [31]:

sql_results.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

In [32]:

spark.sql("SELECT * FROM people WHERE age=30").show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

We won't really be focusing on using the SQL syntax for this course in general, but keep in mind it is always there for you to get you out of bind quickly with your SQL skills!

Alright that is all we need to know for now!

参考：https://github.com/tirthajyoti/Spark-with-Python/blob/master/Python-and-Spark-for-Big-Data-master/Spark_DataFrames/DataFrame_Basics.ipynb

sa726663676

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark编程：DataFrame和SQL编程基础-1

spark使用内置的DataFrame类的好处如下：1）更为简单的语法2）可以通过SQL语句操作DateFrame3）操作自动分布在RDD上首先在python中动态配置如下环境：import osimport sys# Path for spark source folderos.environ['SPARK_HOME'] = "/usr/local/spark"os.environ['JAVA_HOME']='/usr/lib/jvm/jdk1.8.0_162'# .
复制链接

扫一扫