Spark核心类:SQLContext和DataFrame

http://blog.csdn.net/pipisorry/article/details/53320669

pyspark.sql.SQLContext

Main entry point for DataFrame and SQL functionality.

[pyspark.sql.SQLContext]

皮皮blog



pyspark.sql.DataFrame

A distributed collection of data grouped into named columns.

spark df和pandas df

spark df的操作基本和pandas df操作一样的[Pandas小记(6) ]

相互转换

从pandas_df转换:

spark_df = SQLContext.createDataFrame(pandas_df)

sc = SparkContext(master='local[8]', appName='kmeans')
sql_ctx = SQLContext(sc)
lldf_rdd = sql_ctx.createDataFrame(lldf)
另外,createDataFrame支持从list转换spark_df,其中list元素可以为tuple,dict,rdd

从spark_df转换:

pandas_df = spark_df.toPandas()

toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

Note that this method should only be used if the resulting Pandas’s DataFrame is expectedto be small, as all the data is loaded into the driver’s memory.

This is only available if Pandas is installed and available.

>>> df.toPandas()  
   age   name
0    2  Alice
1    5    Bob

[Spark与Pandas中DataFrame对比(详细)]

spark df方法

rdd

Returns the content as an pyspark.RDD of Row.

rollup(*cols)

Create a multi-dimensional rollup for the current DataFrame usingthe specified columns, so we can run aggregation on them.

>>> df.rollup("name", df.age).count().orderBy("name", "age").show()
+-----+----+-----+
| name| age|count|
+-----+----+-----+
| null|null|    2|
|Alice|null|    1|
|Alice|   2|    1|
|  Bob|null|    1|
|  Bob|   5|    1|
+-----+----+-----+
select(*cols)

Projects a set of expressions and returns a new DataFrame.

Parameters:cols – list of column names (string) or expressions (Column).If one of the column names is ‘*’, that column is expanded to include all columnsin the current DataFrame.
>>> df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
>>> df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
selectExpr(*expr)

Projects a set of SQL expressions and returns a new DataFrame.

This is a variant of select() that accepts SQL expressions.

>>> df.selectExpr("age * 2", "abs(age)").collect()
[Row((age * 2)=4, abs(age)=2), Row((age * 2)=10, abs(age)=5)]
toDF(*cols)

Returns a new class:DataFrame that with new specified column names

Parameters:cols – list of new column names (string)
>>> df.toDF('f1', 'f2').collect()
[Row(f1=2, f2=u'Alice'), Row(f1=5, f2=u'Bob')]
persist(storageLevel=StorageLevel(False, True, False, False, 1))

Sets the storage level to persist its values across operationsafter the first time it is computed. This can only be used to assigna new storage level if the RDD does not have a storage level set yet.If no storage level is specified defaults to (MEMORY_ONLY).

[pyspark.sql.DataFrame]

from: http://blog.csdn.net/pipisorry/article/details/53320669

ref:


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值