import org.apache.Spark.sql.DataFrameStatFunctions
import org.apache.spark.sql.functions._
相关系数
val df = Range(0,10,step=1).toDF("id").withColumn("rand1", rand(seed=10)).withColumn("rand2", rand(seed=27))
df: org.apache.spark.sql.DataFrame = [id: int, rand1: double ... 1 more field]
df.show
+---+-------------------+-------------------+
| id| rand1| rand2|
+---+-------------------+-------------------+
| 0|0.41371264720975787| 0.714105256846827|
| 1| 0.7311719281896606| 0.8143487574232506|
| 2| 0.9031701155118229| 0.5282207324381174|
| 3|0.09430205113458567| 0.4420100497826609|
| 4|0.38340505276222947| 0.9387162206758006|
| 5| 0.5569246135523511| 0.6398126862647711|
| 6| 0.4977441406613893| 0.9895498513115722|
| 7| 0.2076666106201438| 0.3398720242725498|
| 8| 0.9571919406508957|0.15042237695815963|
| 9| 0.7429395461204413| 0.7302723457066639|
+---+-------------------+-------------------+
df.stat.corr("rand1", "rand2", "pearson")
res24: Double = -0.10993962467082698
查看数据的统计分布情况
val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating")
// 查看数据的统计分布情况
val descrDF = data.describe("age", "yearsmarried", "religiousness", "education", "occupation", "rating")
descrDF: org.apache.spark.sql.DataFrame = [summary: string, age: string ... 5 m
本文展示了如何使用Spark DataFrameStatFunctions进行相关系数计算、数据分布查看、频数统计等探索性数据统计分析。通过实例演示了统计不同列之间的相关性、数据的基本统计信息以及各字段中元素的个数,并对数组元素进行了排序。
最低0.47元/天 解锁文章
5766

被折叠的 条评论
为什么被折叠?



