读取csv文件
创建dataframe的 schema: 获取schema
用.groupby(…)方法分组统计
用 .describe()方法对数值进行描述性统计:
偏态&离散程度
参考:https://blog.csdn.net/weixin_39599711/article/details/79072691
import pyspark.sql.types as typ
Next, we read the data in.
# 按逗号切割,并将每个元素转换为一个整数:
# 读取csv文件
fraud = sc.textFile('ccFraud.csv.gz')
# 获取首行标题
header = fraud.first()
fraud = fraud.filter(lambda row: row != header).map(lambda row: [int(elem) for elem in row.split(',')])
Following, we create the schema for our DataFrame.
# 创建dataframe的 schema: 获取schema
fields = [
*[
typ.StructField(h[1:-1], typ.IntegerType(), True)
for h in header.split(',')
]
]
schema = typ.StructType(fields)
Finally, we create our DataFrame.
# 创建我们的dataframe:
fraud_df = spark.createDataFrame(fraud, schema)
Now that the dataframe is ready we can calculate the basic descriptive statistics for our dataset.
# 查看schema:
fraud_df.printSchema()
root
|-- custID: integer (nullable = true)
|-- gender: integer (nullable =