环境:
python3.6.5 用的是Anaconda3-5.2.0-Windows-x86_64中的python,主要是anaconda自带了许多python库
spark2.3.0 pip install pyspark==2.3.0 安装即可
spark程序的入口点是SparkSession,创建一个基本的SparkSession可以使用 SparkSession.builder
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder \ .appName(name='spark0401') \ .master(master='local[2]') \ .getOrCreate() spark.stop()
读取文件可以有2种读取方式
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder \ .appName(name='spark0401') \ .master(master='local[2]') \ .getOrCreate() # 读取文件,2种读取方式都可以 people_df1 = spark.read.json('E:/Utils/myutils/spark-2.2.1-bin-hadoop2.6/examples/src/main/resources/people.json') people_df2 =spark.read.format("json").load('E:/Utils/myutils/spark-2.2.1-bin-hadoop2.6/examples/src/main/resources/people.json') people_df1.show() people_df2.show() spark.stop()
一些dataframe的基本操作
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder \ .appName(name='spark0401') \ .master(master='local[2]') \ .getOrCreate() # 读取文件,2种读取方式都可以 # people_df1 = spark.read.json('E:/Utils/myutils/spark-2.2.1-bin-hadoop2.6/examples/src/main/resources/people.json') people_df2 = spark.read.format("json").load( 'E:/Utils/myutils/spark-2.2.1-bin-hadoop2.6/examples/src/main/resources/people.json') # people_df1.show() people_df2.show() # 以树状的形式打印出schema信息 people_df2.printSchema() # 选择name字段 people_df2.select('name').show() # 过滤 people_df2.filter(people_df2['age'] < 21).show() # 按年龄分组进行计数 people_df2.groupBy('age').count().show() spark.stop()
执行sparkSql,需要把dataframe注册成临时表
from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder \ .appName(name='spark0401') \ .master(master='local[2]') \ .getOrCreate() # 读取文件,2种读取方式都可以 # people_df1 = spark.read.json('E:/Utils/myutils/spark-2.2.1-bin-hadoop2.6/examples/src/main/resources/people.json') people_df2 = spark.read.format("json").load( 'E:/Utils/myutils/spark-2.2.1-bin-hadoop2.6/examples/src/main/resources/people.json') # people_df1.show() """ people_df2.show() # 以树状的形式打印出schema信息 people_df2.printSchema() # 选择name字段 people_df2.select('name').show() # 过滤 people_df2.filter(people_df2['age'] < 21).show() # 按年龄分组进行计数 people_df2.groupBy('age').count().show() """ # 将dataframe注册成临时表 people_df2.createOrReplaceTempView('people') # 执行sql,返回一个dataframe adult = spark.sql('select * from people where age>18') adult.show() spark.stop()
python版本没有DataSet 的API