SparkMLlib之二Basic Stastics

Summary statistics

We provide column summary statistics for RDD[Vector] through the function colStats available in Statistics.

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

val observations: RDD[Vector] = ... // an RDD of Vectors

// 计算列统计值
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) //包含每列均值的dense vector
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column

另外还有

bstract Value Members
abstract def
count: Long
Sample size.
abstract def
max: Vector
Maximum value of each column.
abstract def
mean: Vector
Sample mean vector.
abstract def
min: Vector
Minimum value of each column.
abstract def
normL1: Vector
L1 norm of each column
abstract def
normL2: Vector
Euclidean magnitude of each column
abstract def
numNonzeros: Vector
Number of nonzero elements (including explicitly presented zero values) in each column.
abstract def
variance: Vector
Sample variance vector.

Correlation

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics

val sc: SparkContext = ...

val seriesX: RDD[Double] = ... // a series
val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX

// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
// method is not specified, Pearson's method will be used by default. 
val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")

val data: RDD[Vector] = ... // note that each Vector is a row and not a column

// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
// If a method is not specified, Pearson's method will be used by default. 
val correlMatrix: Matrix = Statistics.corr(data, "pearson")

Stratified sampling分层抽样

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions

val sc: SparkContext = ...

val data = ... // an RDD[(K, V)] of any key value pairs
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

// Get an exact sample from each stratum
val approxSample = data.sampleByKey(withReplacement = false, fractions)
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)

假设检验

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics._

val sc: SparkContext = ...

val vec: Vector = ... // a vector composed of the frequencies of events

// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
// the test runs against a uniform distribution.  
val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, 
                                 // test statistic, the method used, and the null hypothesis.

val mat: Matrix = ... // a contingency matrix

// conduct Pearson's independence test on the input contingency matrix
val independenceTestResult = Statistics.chiSqTest(mat) 
println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...

val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.

// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
// against the label.
val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
var i = 1
featureTestResults.foreach { result =>
    println(s"Column $i:\n$result")
    i += 1
} // summary of the test

Statistics provides methods to run a 1-sample, 2-sided Kolmogorov-Smirnov test

import org.apache.spark.mllib.stat.Statistics

val data: RDD[Double] = ... // an RDD of sample data

// run a KS test for the sample versus a standard normal distribution
val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
println(testResult) // summary of the test including the p-value, test statistic,
                    // and null hypothesis
                    // if our p-value indicates significance, we can reject the null hypothesis

// perform a KS test using a cumulative distribution function of our making
val myCDF: Double => Double = ...
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)

随机数生成

import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._

val sc: SparkContext = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)

核密度估计

import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD

val data: RDD[Double] = ... // an RDD of sample data

// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val kd = new KernelDensity()
  .setSample(data)
  .setBandwidth(3.0)

// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Fama-MacBeth回归是一种用于估计资产定价模型参数的方法,Shanken Correction可以纠正样本内误差和样本选择偏差。下面是一个简单的Python代码实现: ```python import numpy as np import statsmodels.api as sm def fama_macbeth_shanken_correction(y, X, bench_ret): """ Fama-MacBeth回归并进行Shanken correction :param y: 因变量(收益率) :param X: 自变量(因子收益率) :param bench_ret: 市场组合收益率 :return: 估计参数、标准误、t值 """ # 第一步:进行时间序列回归 betas = [] for i in range(X.shape[1]): model = sm.OLS(y, sm.add_constant(X[:, i])) results = model.fit() betas.append(results.params[1]) betas = np.array(betas) # 第二步:进行截面回归 model = sm.OLS(betas, sm.add_constant(X)) results = model.fit() # 第三步:进行Shanken correction num_assets = X.shape[0] inv_cov_mat = np.linalg.inv(np.cov(X.T)) beta = np.zeros(num_assets) for i in range(num_assets): beta[i] = np.dot(X[i], betas) / np.var(bench_ret) alpha = results.params[0] alpha_correction = alpha + (alpha - 1) * np.dot(beta.T, np.dot(inv_cov_mat, beta)) * (np.var(betas) / np.var(bench_ret)) std_error = results.bse[0] # 第四步:计算t值 t_value = alpha_correction / std_error return alpha_correction, std_error, t_value ``` 这个函数需要输入因变量(收益率)、自变量(因子收益率)和市场组合收益率。它会返回估计参数、标准误和Shanken correction后的t值。我们先进行时间序列回归,再进行截面回归,然后计算Shanken correction后的alpha值和标准误,最后计算t值。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值