Big Data Analytics 笔记 2

1 Linear Model

y i = θ 1 x i 1 + θ 2 x i 2 + . . . + θ n p x i n p + ε i ,       i = 1 , 2 , . . . , n d . y_i=\theta_1 x_{i1} + \theta_2 x_{i2} + ... + \theta_{n_p} x_{in_p} + \varepsilon_i,~~~~~i=1,2,...,n_d. yi=θ1xi1+θ2xi2+...+θnpxinp+εi,     i=1,2,...,nd.
or,
y = X θ + ε y = X \theta + \varepsilon y=Xθ+ε
where ε \varepsilon ε is noise, θ \theta θ is the vector of unkown parameters.

  • The linear model is parametric with n p n_p np parameters
  • If adding an interception θ 0 \theta_0 θ0, then the intercept is a column of 1 in the design matrix X X X
  • For dimension of design matrix X :   n d × n p X:~n_d \times n_p X: nd×np or X :   n d × ( n p + 1 ) X:~n_d \times (n_p + 1) X: nd×(np+1) for adding θ 0 \theta_0 θ0, n d > n p n_d > n_p nd>np
  • The traditional assumption of distribution of noise (iid noise, means noise with distribution) ε \varepsilon ε: ε i   ∼   N ( 0 , σ 2 ) ,    i = 1 , 2 , . . . , n d \varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d εi  N(0,σ2),  i=1,2,...,nd

1.1 parameter estimation

For parameter estimation of Linear Model: via the cost function or ** maximum likelihood principle**, these two approaches coincide to the same optimal solution of the parameter θ \theta θ, the difference is: cost function reduces the discrepancy between predictions and data, while MLE maximizes the likelihood of observing data given parameters.

1.2 Cost Function

The goal: reduces the discrepancy between predictions and data
⟹ \Longrightarrow reduce the MSE of predictions

M S E ( y ^ ) = 1 n d ∣ ∣ y − y ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − y ^ i ) 2 = 1 n d ∣ ∣ y − X θ ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − x i T θ ^ ) 2 MSE(\hat{y}) = \frac{1}{n_d}||y-\hat{y}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - \hat{y}_i)^2 \\ = \frac{1}{n_d}||y-X\hat{\theta}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\hat{\theta})^2 MSE(y^)=nd1yy^2=nd1i=1nd(yiy^i)2=nd1yXθ^2=nd1i=1nd(yixiTθ^)2
Let J ( θ ) J(\theta) J(θ) be cost function, find θ ^ \hat{\theta} θ^ of θ \theta θ to minimize J ( θ ) J(\theta) J(θ), where
J ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 J(\theta) = ||y-X\theta||^2 = \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 J(θ)=yXθ2=i=1nd(yixiTθ)2
so θ ^ = arg ⁡ m i n θ J ( θ ) \hat\theta=\arg \underset{\theta} {min} J(\theta) θ^=argθminJ(θ), thia is the same as ordinary least squares (OLS) estimator, because OLS for θ = ∣ ∣ ε ∣ ∣ 2 = ∑ i = 1 n d ε 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 \theta = ||\varepsilon||^2=\sum^{n_d}_{i=1}\varepsilon^2=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 θ=ε2=i=1ndε2=i=1nd(yixiTθ)2, so OLS estimator for θ ^ \hat\theta θ^:
θ ^ = arg ⁡ m i n θ ∣ ∣ ε ( θ ) ∣ ∣ 2 = arg ⁡ m i n θ ( y i − x i T θ ) 2 \hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2 θ^=argθminε(θ)2=argθmin(yixiTθ)2

The ordinary least squares (OLS) θ ^ \hat\theta θ^:

θ ^ = arg ⁡ m i n θ ∣ ∣ ε ( θ ) ∣ ∣ 2 = arg ⁡ m i n θ ( y i − x i T θ ) 2 \hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2 θ^=argθminε(θ)2=argθmin(yixiTθ)2

Cost Function:

J ( θ ) = ∑ i = 1 n d ( y i − h θ ( x i ) ) 2 J(\theta)=\sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2 J(θ)=i=1nd(yihθ(xi))2, where h θ ( x ) h_\theta(x) hθ(x) n is called the hypothesis and h θ ( x i ) = x i T θ h_\theta(x_i)=x_i^\mathsf{T}\theta hθ(xi)=xiTθ for linear models. So, the estimator obtained by minimizing the cost function for h θ ( x i ) = x i T θ h_\theta(x_i)=x_i^\mathsf{T}\theta hθ(xi)=xiTθ and the OLS estimator obtained by minimizing the squared noise norm ∣ ∣ ε ( θ ) ∣ ∣ 2 ||\varepsilon(\theta)||^2 ε(θ)2 coincide

Get θ ^ \hat\theta θ^:

{ g r a d i e n t   d e s c e n t   a p p r o x i m a t e s   θ ^ ,       J ( θ )   i s   c o n v e x o t h e r   n u m e r i c a l   o p t i m i z a t i o n   s c h e m e s   θ ^ ,       J ( θ )   i s   n o t   c o n v e x \left\{ \begin{array}{cc} \mathrm{gradient ~descent ~approximates~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is ~convex} & \\ \mathrm{ other~ numerical~ optimization~ schemes~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is~ not~ convex} & \end{array} \right. {gradient descent approximates θ^,     J(θ) is convexother numerical optimization schemes θ^,     J(θ) is not convex

1.3 Normal equation to get θ ^ \hat\theta θ^

For invertible ( X T X ) − 1 (X^\mathsf{T}X)^{-1} (XTX)1,
θ ^ = ( X T X ) − 1 X T y \hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y θ^=(XTX)1XTy
Gauss-Markov assumptions

  • E ( ε i ) = 0 E(\varepsilon_i) = 0 E(εi)=0
  • V a r ( ε i ) = σ 2 Var(\varepsilon_i)=\sigma^2 Var(εi)=σ2
  • C o v ( ε i , ε j ) = 0 Cov(\varepsilon_i,\varepsilon_j)=0 Cov(εi,εj)=0

Under this assumption:

  • X X X should be non-singular to allow its columns to be linearly independent.
  • E ( θ ^ ) = θ E(\hat\theta)=\theta E(θ^)=θ
  • V a r ( θ ^ ) = σ 2 ( X T X ) − 1 Var(\hat\theta)=\sigma^2(X^\mathsf{T}X)^{-1} Var(θ^)=σ2(XTX)1
  • θ ^   ∼   N ( θ , σ 2 ( X T X ) − 1 ) ,    i = 1 , 2 , . . . , n d \hat\theta~\sim ~ \mathcal{N}(\theta,\sigma^2(X^\mathsf{T}X)^{-1}),~~i=1,2,...,n_d θ^  N(θ,σ2(XTX)1),  i=1,2,...,nd
    ( under the assumption of ε i   ∼   N ( 0 , σ 2 ) ,    i = 1 , 2 , . . . , n d \varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d εi  N(0,σ2),  i=1,2,...,nd)
  • The maximum likelihood is
    L ( y , X ∣ θ , σ 2 ) = ∏ i = 1 n d ( 2 π σ 2 ) − n d / 2 e x p ( − 1 2 σ 2 ∑ i = 1 n d ( y i − x i T θ ) 2 ) \mathcal{L}(y,X|\theta,\sigma^2)=\prod^{n_d}_{i=1}(2\pi\sigma^2)^{-n_d/2} exp\left(-\frac{1}{2\sigma^2}\sum^{n_d}_{i=1}(y_i - x_i^\mathsf{T}\theta)^2\right) L(y,Xθ,σ2)=i=1nd(2πσ2)nd/2exp(2σ21i=1nd(yixiTθ)2)
  • The associated MLE:
    θ ^ = ( X T X ) − 1 X T y \hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y θ^=(XTX)1XTy, coincide with OLS for normal noise, equal to a r g m i n θ ( y i − x i T θ ) 2 \underset{\theta} {argmin} (y_i - x_i^{\mathsf{T}}\theta)^2 θargmin(yixiTθ)2

Limitations of normal equation with large n p n_p np

  • computational cost of ( X T X ) − 1 (X^\mathsf{T}X)^{-1} (XTX)1
  • singularity, with large n p n_p np, its columns might turn to be highly correlated

1.4 Gradient Descent Process to get θ ^ \hat\theta θ^

The Logic: start with θ = ( θ 0 , θ 1 ) \theta = (\theta_0,\theta_1) θ=(θ0,θ1) , then keep to change ( θ 0 , θ 1 ) (\theta_0,\theta_1) (θ0,θ1) to reduce J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J(θ0,θ1) until reach m i n θ 0 , θ 1 J ( θ 0 , θ 1 ) \underset{\theta_0,\theta_1} {min}J(\theta_0,\theta_1) θ0,θ1minJ(θ0,θ1)

1.4.1 Algorithm
  1. Initialize ( θ 1 , θ 2 , . . . , θ n ) (\theta_1,\theta_2,...,\theta_n) (θ1,θ2,...,θn)
  2. ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) = ( θ 1 , θ 2 , . . . , θ n ) (\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n)=(\theta_1,\theta_2,...,\theta_n) (θ~1,θ~2,...,θ~n)=(θ1,θ2,...,θn)
  3. Update θ i \theta_i θi with θ i = θ ~ i − a ⋅ ∂ ∂ θ i J ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) \theta_i=\tilde\theta_i - a \cdot \frac{\partial}{\partial\theta_i}J(\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n) θi=θ~iaθiJ(θ~1,θ~2,...,θ~n), where a a a is the learning rate
  4. Repeat 2, 3 until θ i \theta_i θi converges
Gradient descent algorithm for linear regression

cost function for multiple linear regression model:
J ( θ ) = 1 2 n d ∑ i = 1 n d ( x i T θ − y i ) 2 J(\theta)=\frac{1}{2n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T} \theta - y_i)^2 J(θ)=2nd1i=1nd(xiTθyi)2
The updating step of gradient descent for linear regression with the given cost function:
θ j ( k + 1 ) = θ j ( k ) − a 1 n d ∑ i = 1 n d ( x i T θ ( k ) − y i ) x i j \theta_j^{(k+1)}=\theta_j^{(k)}-a\frac{1}{n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T}\theta^{(k)}-y_i)x_{ij} θj(k+1)=θj(k)and1i=1nd(xiTθ(k)yi)xij

1.4.2 Learning Rate

The learning rate a a a used in the algorithm should be > 0, constant.

  • Learning rate can affect convergence and speed of convergence:
    Too small learning rate might lead to slow convergence, while too large learning rate might lead to non-convergence or divergence.
  • Can be selected using a validation set or cross-validation.
1.4.3 Stoping Convergence

Check the error tolerance close to 0:

  • Absolute error tolerance:
    ε a b s = ∣ J ( θ ( k + 1 ) ) − J ( θ k ) ∣ \varepsilon_{abs}=|J(\theta^{(k+1)})-J(\theta^{k})| εabs=J(θ(k+1))J(θk)
  • Relative error tolerance:
    ε r e l = ∣ J ( θ ( k + 1 ) ) − J ( θ k ) J ( θ ( k + 1 ) ) ∣ \varepsilon_{rel}=\left|\frac{J(\theta^{(k+1)})-J(\theta^{k})}{J(\theta^{(k+1)})}\right| εrel=J(θ(k+1))J(θ(k+1))J(θk)

2 Logistic Regression

Output: { 0 , 1 } \{0,1\} {0,1}
Decision boundary: x i T θ x_i^{\mathsf{T}}\theta xiTθ
Hypothesis: h θ ( x i ) = g ( x i T θ ) = 1 1 + e x p ( − x i T θ ) h_\theta(x_i)=g(x_i^\mathsf{T}\theta) =\frac{1}{1+exp(-x_i^\mathsf{T}\theta)} hθ(xi)=g(xiTθ)=1+exp(xiTθ)1

Cost function:
J ( θ 0 , θ 1 ) = { − l o g ( h ( θ 0 , θ 1 ) ( x ) ) ,             i f   y = 1 − l o g ( 1 − h ( θ 0 , θ 1 ) ( x ) ) ,             i f   y = 0 J(\theta_0,\theta_1)=\left\{ \begin{array}{cc} -log(h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=1 & \\ -log(1-h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=0 & \end{array} \right. J(θ0,θ1)={log(h(θ0,θ1)(x)),           if y=1log(1h(θ0,θ1)(x)),           if y=0

General form for n samples cost function:
J ( θ ) = − 1 n ∑ i = 1 n ( y i l o g ( h θ ( x i ) ) + ( 1 − y i ) l o g ( 1 − h θ ( x i ) ) ) J(\theta)=-\frac{1}{n}\sum^n_{i=1}(y_i log(h_\theta(x_i))+(1-y_i)log(1-h_\theta(x_i))) J(θ)=n1i=1n(yilog(hθ(xi))+(1yi)log(1hθ(xi)))

Classification:

  • x i T θ ≥ 0   ⟹   h θ ( x i ) ≥ 0.5   ⟹   y ^ = 1 x_i^\mathsf{T}\theta \ge 0 ~\Longrightarrow ~h_\theta(x_i) \ge 0.5 ~\Longrightarrow ~ \hat{y}=1 xiTθ0  hθ(xi)0.5  y^=1
    y = 1 y=1 y=1, then J ( θ )   →   0 J(\theta)~\rightarrow~0 J(θ)  0
    y = 0 ,   J ( θ )   →   ∞ y=0,~J(\theta)~\rightarrow~\infty y=0, J(θ)  
  • x i T θ < 0   ⟹   h θ ( x i ) < 0.5   ⟹   y ^ = 0 x_i^\mathsf{T}\theta < 0 ~\Longrightarrow ~h_\theta(x_i) < 0.5 ~\Longrightarrow ~ \hat{y}=0 xiTθ<0  hθ(xi)<0.5  y^=0
    y = 1 y=1 y=1, then J ( θ )   →   ∞ J(\theta)~\rightarrow~\infty J(θ)  
    y = 0 ,   J ( θ )   →   0 y=0,~J(\theta)~\rightarrow~0 y=0, J(θ)  0

3 Avoid Overfitting

higher overfitting ⟹ \Longrightarrow higher variance of the predictions
higher underfitting ⟹ \Longrightarrow higher bias of the predictions

3.1 Penalizing parameters

∑ i = 1 n d ( y i − h θ ( x i ) ) 2 + λ r ( θ ) \sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2+\lambda r(\theta) i=1nd(yihθ(xi))2+λr(θ), where λ r ( θ ) \lambda r(\theta) λr(θ) is the Penalty, r ( θ ) r(\theta) r(θ) is the parameter penalty, λ \lambda λ is the regularization parameter.l

regularization parameter λ \lambda λ

  • λ   →   0 \lambda ~ \rightarrow~0 λ  0 regularized regression tend to linear regression estimates
  • λ   →   ∞ \lambda ~ \rightarrow~\infty λ   parameters get penalized more, the less the over-fitting to the data
  • select λ \lambda λ via cross-validation

parameter penalty functions r ( θ ) r(\theta) r(θ)
Some typical parameter penalty functions (参考distance)

  • d − v a r i a t i o n d-variation dvariation function: r ( θ ) = ( ∑ i = 1 n p ∣ θ i ∣ d ) 1 / d r(\theta)=\left(\sum^{n_p}_{i=1}|\theta_i|^d\right)^{1/d} r(θ)=(i=1npθid)1/d
  • L 1   n o r m :   r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ L_{1}~norm:~r(\theta)=\sum^{n_p}_{i=1}|\theta_i| L1 norm: r(θ)=i=1npθi
  • s q u a r e d   L 2   n o r m   r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ 2 squared~L_2~norm~r(\theta)=\sum^{n_p}_{i=1}|\theta_i|^2 squared L2 norm r(θ)=i=1npθi2

3.2 Regression model with penalty function

3.2.1 Lasso regression

Lasso regression uses L 1   n o r m L_1~norm L1 norm as penalty function. Lasso regression “zeroes out” coefficients, so it performs variable selection and to some extent parameter shrinkage.

  • Cost function
    J L ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ ∣ ∣ θ ∣ ∣ 1 = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ∣ θ j ∣ J_L(\theta)=||y-X\theta||^2+\lambda||\theta||_1= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2+\lambda\sum^{n_d}_{j=1}|\theta_j| JL(θ)=yXθ2+λθ1=i=1nd(yixiTθ)2+λj=1ndθj

  • Gradient of Cost function
    ∂ J L ( θ ) ∂ θ 1 , . . . ∂ J L ( θ ) ∂ θ n d \frac{\partial J_L{(\theta)}}{\partial \theta_1},... \frac{\partial J_L{(\theta)}}{\partial \theta_{nd}} θ1JL(θ),...θndJL(θ)

note:
∂ ∑ i = 1 n d ( y i − x i T θ ) 2 ∂ θ 1 = 2 x 1 i ∑ i = 1 n d ( y i − x i T θ ) \frac{\partial \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2}{\partial \theta_1} = 2x_{1i}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta) θ1i=1nd(yixiTθ)2=2x1ii=1nd(yixiTθ)

  • Lasso estimate
    θ ^ L = a r g m i n θ ( ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ∣ θ j ∣ ) \hat\theta_L=\underset{\theta}{argmin}\left(\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 + \lambda\sum^{n_d}_{j=1}|\theta_j| \right) θ^L=θargmin(i=1nd(yixiTθ)2+λj=1ndθj)
    Can’t be solved, but J L ( θ ) J_L(\theta) JL(θ) is convex, using the least angle regression algorithm to solve.
3.2.2 Ridge regression

Ridge regression uses L 2   n o r m L_2~norm L2 norm as penalty function. So it does not “zero out” coefficients, i.e. it can not perform variable selection, but rather shrinks parameter values.

  • Cost function
    J R ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ ∣ ∣ θ ∣ ∣ 2 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ( θ j ) 2 J_R(\theta)=||y-X\theta||^2+\lambda||\theta||_2^2= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2+\lambda\sum^{n_d}_{j=1}(\theta_j)^2 JR(θ)=yXθ2+λθ22=i=1nd(yixiTθ)2+λj=1nd(θj)2

  • ridge estimate
    θ ^ R = ( X T X + λ I ) − 1 X T y \hat\theta_R = (X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T}y θ^R=(XTX+λI)1XTy
    This is the closed form solution.

The ridge estimate is biased because:

E ( θ ^ R ) = E ( ( X T X + λ I ) − 1 X T y ) = ( X T X + λ I ) − 1 X T E ( y ) = ( X T X + λ I ) − 1 X T X θ = ( X T X + λ I ) − 1 ( ( X T X ) − 1 ) − 1 θ = [ ( X T X ) − 1 ( X T X + λ I ) ] − 1 θ = [ I + λ ( X T X ) − 1 ] − 1 θ \begin{aligned} E(\hat\theta_R) &= E\left( (X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T}y \right) \\ &=(X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T} E(y)\\ &=(X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T}X \theta\\ &=(X^\mathsf{T} X + \lambda I)^{-1}((X^\mathsf{T}X)^{-1})^{-1}\theta\\ &=\left[(X^\mathsf{T}X)^{-1}(X^\mathsf{T} X + \lambda I)\right]^{-1} \theta\\ &=\left[ I + \lambda(X^\mathsf{T}X)^{-1}\right]^{-1}\theta \end{aligned} E(θ^R)=E((XTX+λI)1XTy)=(XTX+λI)1XTE(y)=(XTX+λI)1XTXθ=(XTX+λI)1((XTX)1)1θ=[(XTX)1(XTX+λI)]1θ=[I+λ(XTX)1]1θ
X T X X^\mathsf{T}X XTX is positive definite, λ > 0 \lambda > 0 λ>0 by definition, so E ( θ ^ R ) ≠ θ E(\hat\theta_R)\neq \theta E(θ^R)=θ

Tips:
Although ridge regression does not perform variable selection, it performs grouped selection. So if one variable amongst a group of correlated ones is selected, ridge regression automatically includes the whole group. Ridge regression can resolve near multicollinearity.

3.2.3 Elastic net

The elastic net penalty is a compromise between Lasso and ridge.

  • Cost function
    J E ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ 1 ∣ ∣ θ ∣ ∣ 1 + λ 2 ∣ ∣ θ ∣ ∣ 2 2 J_E(\theta)=||y-X\theta||^2+\lambda_1||\theta||_1+\lambda_2||\theta||_2^2 JE(θ)=yXθ2+λ1θ1+λ2θ22
    if X is an n d × n p n_d\times n_p nd×np design matrix
    J E ( θ ) = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ 1 ∑ j = 1 n p ∣ θ j ∣ + λ 2 ∑ j = 1 n p θ j 2 J_E(\theta)=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2+\lambda_1\sum^{n_p}_{j=1}|\theta_j|+\lambda_2\sum^{n_p}_{j=1}\theta_j^2 JE(θ)=i=1nd(yixiTθ)2+λ1j=1npθj+λ2j=1npθj2
  • elastic net estimate
    θ ^ E = a r g m i n θ ( ∣ ∣ y − X θ ∣ ∣ 2 + λ 1 ∣ ∣ θ ∣ ∣ 1 + λ 2 ∣ ∣ θ ∣ ∣ 2 2 ) \hat\theta_E =\underset{\theta}{argmin}\left(||y-X\theta||^2+\lambda_1||\theta||_1+\lambda_2||\theta||_2^2\right) θ^E=θargmin(yXθ2+λ1θ1+λ2θ22)

no closed form solution for θ E \theta_E θE, but J E ( θ ) J_E(\theta) JE(θ) is convex to have a unique minimum

4 Learning curve

Increase prediction accuracy:

  • variable selection: increase or reduce number of covariates
  • add polynomial features: { x 2 , x 1 x 2 } \{x^2, x_1x_2\} {x2,x1x2}
  • regularized regression (Lasso, ridge regression and elastic nets):
  • Collect more data

Learning curve:

  • A learning curve is a metric of prediction accuracy, like cost function J ( θ ) J(\theta) J(θ)
  • Or error metric (function of a parameter that affects the metric)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Data Science and Big Data Analytics is about harnessing the power of data for new insights. The book covers the breadth of activities and methods and tools that Data Scientists use. The content focuses on concepts, principles and practical applications that are applicable to any industry and technology environment, and the learning is supported and explained with examples that you can replicate using open-source software. This book will help you: Become a contributor on a data science team Deploy a structured lifecycle approach to data analytics problems Apply appropriate analytic techniques and tools to analyzing big data Learn how to tell a compelling story with data to drive business action Prepare for EMC Proven Professional Data Science Certification Corresponding data sets are available at www.wiley.com/go/9781118876138. Get started discovering, analyzing, visualizing, and presenting data in a meaningful way today! Table of Contents Chapter 1 Introduction to Big Data Analytics Chapter 2 Data Analytics Lifecycle Chapter 3 Review of Basic Data Analytic Methods Using R Chapter 4 Advanced Analytical Theory and Methods: Clustering Chapter 5 Advanced Analytical Theory and Methods: Association Rules Chapter 6 Advanced Analytical Theory and Methods: Regression Chapter 7 Advanced Analytical Theory and Methods: Classification Chapter 8 Advanced Analytical Theory and Methods: Time Series Analysis Chapter 9 Advanced Analytical Theory and Methods: Text Analysis Chapter 10 Advanced Analytics—Technology and Tools: MapReduce and Hadoop Chapter 11 Advanced Analytics—Technology and Tools: In-Database Analytics Chapter 12 The Endgame, or Putting It All Together

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值