# 协同过滤算法+相似度度量+交替最小二乘法

1.简介

(1).通过考察具有相同爱好的用户对相同物品的评分标准进行计算。

(2).考察具有相同特质的物品从而推荐给选择了某件物品的用户。

A和B是“志同道合”的基友（相似度很高），将A喜欢的物品推荐给B是合理的

2.问题

1.基于欧几里得距离的相似度计算

(1)欧几里得距离(Euclidean distance)：表示三维空间中两个点的真实距离。是一种基于用户之间直线距离的计算方式。

2.基于余弦角度的相似度计算

3.欧几里得相似度与余弦相似度的比较

4.余弦相似度实战

/**
* Created by Administrator on 2017/4/17.
*/

import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.Map
object testVector {
val conf = new SparkConf()
.setMaster("local")
.setAppName("testVector");
val sc = new SparkContext(conf);
//将内存数据读入Spark系统中
val users = sc.parallelize(Array("aaa","bbb","ccc","ddd","eee"));
val films = sc.parallelize(Array("smzdm","ylxb","znh","nhsc","fcwr"));
//使用一个source嵌套map作为姓名，电影名和分值的评价
var source = Map[String,Map[String,Int]]();
//设置一个用于存放电影分的map
val filmSource = Map[String,Int]();
//设置电影评分
def getSource(): Map[String,Map[String,Int]] = {
val user1FilmSource = Map("smzdm" -> 2,"ylxb" -> 3,"znh" -> 1,"nhsc" -> 0,"fcwr" -> 1);
val user2FilmSource = Map("smzdm" -> 1,"ylxb" -> 2,"znh" -> 2,"nhsc" -> 1,"fcwr" -> 4);
val user3FilmSource = Map("smzdm" -> 2,"ylxb" -> 1,"znh" -> 0,"nhsc" -> 1,"fcwr" -> 4);
val user4FilmSource = Map("smzdm" -> 3,"ylxb" -> 2,"znh" -> 0,"nhsc" -> 5,"fcwr" -> 3);
val user5FilmSource = Map("smzdm" -> 5,"ylxb" -> 3,"znh" -> 1,"nhsc" -> 1,"fcwr" -> 2);
source += ("aaa" -> user1FilmSource);
source += ("bbb" -> user2FilmSource);
source += ("ccc" -> user3FilmSource);
source += ("ddd" -> user4FilmSource);
source += ("eee" -> user5FilmSource);
return source;
}
//两两计算分值，采用余弦相似性
def getCollaborateSource(user1:String,user2:String):Double = {
val user1FilmSource = source.get(user1).get.values.toVector;//获得第一个用户的评分
val user2FilmSource = source.get(user2).get.values.toVector;//获得第二个用户的评分
val member = user1FilmSource.zip(user2FilmSource).map(d => d._1 * d._2).reduce(_ + _).toDouble;//对分子部分进行计算
//求出分母第一个变量值
val temp1 = math.sqrt(user1FilmSource.map(num => {
math.pow(num,2);
}).reduce(_ + _));//进行叠加
//求出分母第二个变量值
val temp2 = math.sqrt(user2FilmSource.map(num => {
math.pow(num,2);
}).reduce(_ + _));//进行叠加
val denominator = temp1 * temp2;
return member / denominator;
}
def main(args: Array[String]): Unit = {
getSource();//初始化分数
val name = "bbb";
users.foreach(user =>{
println(name + "相对于" + user + "的相似性分数是：" + getCollaborateSource(name,user));
})
}
}


bbb相对于aaa的相似性分数是：0.7089175569585667
bbb相对于bbb的相似性分数是：1.0000000000000002
bbb相对于ccc的相似性分数是：0.8780541105074453
bbb相对于ddd的相似性分数是：0.6865554812287477
bbb相对于eee的相似性分数是：0.6821910402406466

1.最小二乘法

(1)概念：最小二乘法多项式曲线拟合，根据给定的m个点,并不要求这条曲线精确地经过这些点，而是曲线y=f(x)的近似曲线y= φ(x)。

(2)常见的曲线拟合方法：

1.使偏差绝对值之和最小

2.使偏差绝对值最大的最小

3.使偏差平方和最小

按偏差平方和最小的原则选取拟合曲线，并且采取二项式方程为拟合曲线的方法,称为最小二乘法。

(3)推导过程

1. 设拟合多项式为：

2. 各点到这条曲线的距离之和，即偏差平方和如下：

3. 为了求得符合条件的a值，对等式右边求ai偏导数，因而我们得到了：

.......

4. 将等式左边进行一下化简，然后应该可以得到下面的等式：

.......

5. 把这些等式表示成矩阵的形式，就可以得到下面的矩阵：

6. 将这个范德蒙得矩阵化简后可得到:

7. 也就是说X*B=Y，那么

，便得到了系数矩阵B，同时，我们也就得到了拟合曲线。

(4)java实现

public class TheleastSquareMethod {
//定义系数
private double[] x;
private double[] y;
private double[] weight;
private int n;
private double[] coefficient;

/**
* Constructor method.
*
* @param x
*            Array of x
* @param y
*            Array of y
* @param n
*            The order of polynomial
*/
public TheleastSquareMethod(double[] x, double[] y, int n) {
if (x == null || y == null || x.length < 2 || x.length != y.length
|| n < 2) {
throw new IllegalArgumentException(
"IllegalArgumentException occurred.");
}
this.x = x;
this.y = y;
this.n = n;
weight = new double[x.length];
for (int i = 0; i < x.length; i++) {
weight[i] = 1;
}
compute();
}

/**
* Constructor method.
*
* @param x
*            Array of x
* @param y
*            Array of y
* @param weight
*            Array of weight
* @param n
*            The order of polynomial
*/
public TheleastSquareMethod(double[] x, double[] y, double[] weight, int n) {
if (x == null || y == null || weight == null || x.length < 2
|| x.length != y.length || x.length != weight.length || n < 2) {
throw new IllegalArgumentException(
"IllegalArgumentException occurred.");
}
this.x = x;
this.y = y;
this.n = n;
this.weight = weight;
compute();
}

/**
* Get coefficient of polynomial.
*
* @return coefficient of polynomial
*/
public double[] getCoefficient() {
return coefficient;
}

/**
* Used to calculate value by given x.
*
* @param x
*            x
* @return y
*/
public double fit(double x) {
if (coefficient == null) {
return 0;
}
double sum = 0;
for (int i = 0; i < coefficient.length; i++) {
sum += Math.pow(x, i) * coefficient[i];
}
return sum;
}

/**
* Use Newton's method to solve equation.
*
* @param y
*            y
* @return x
*/
public double solve(double y) {
return solve(y, 1.0d);
}

/**
* Use Newton's method to solve equation.
*
* @param y
*            y
* @param startX
*            The start point of x
* @return x
*/
public double solve(double y, double startX) {
final double EPS = 0.0000001d;
if (coefficient == null) {
return 0;
}
double x1 = 0.0d;
double x2 = startX;
do {
x1 = x2;
x2 = x1 - (fit(x1) - y) / calcReciprocal(x1);
} while (Math.abs((x1 - x2)) > EPS);
return x2;
}

/*
* Calculate the reciprocal of x.
*
* @param x x
*
* @return the reciprocal of x
*/
private double calcReciprocal(double x) {
if (coefficient == null) {
return 0;
}
double sum = 0;
for (int i = 1; i < coefficient.length; i++) {
sum += i * Math.pow(x, i - 1) * coefficient[i];
}
return sum;
}

/*
* This method is used to calculate each elements of augmented matrix.
*/
private void compute() {
if (x == null || y == null || x.length <= 1 || x.length != y.length
|| x.length < n || n < 2) {
return;
}
double[] s = new double[(n - 1) * 2 + 1];
for (int i = 0; i < s.length; i++) {
for (int j = 0; j < x.length; j++) {
s[i] += Math.pow(x[j], i) * weight[j];
}
}
double[] b = new double[n];
for (int i = 0; i < b.length; i++) {
for (int j = 0; j < x.length; j++) {
b[i] += Math.pow(x[j], i) * y[j] * weight[j];
}
}
double[][] a = new double[n][n];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
a[i][j] = s[i + j];
}
}

// Now we need to calculate each coefficients of augmented matrix
coefficient = calcLinearEquation(a, b);
}

/*
* Calculate linear equation.
*
* The matrix equation is like this: Ax=B
*
* @param a two-dimensional array
*
* @param b one-dimensional array
*
* @return x, one-dimensional array
*/
private double[] calcLinearEquation(double[][] a, double[] b) {
if (a == null || b == null || a.length == 0 || a.length != b.length) {
return null;
}
for (double[] x : a) {
if (x == null || x.length != a.length)
return null;
}

int len = a.length - 1;
double[] result = new double[a.length];

if (len == 0) {
result[0] = b[0] / a[0][0];
return result;
}

double[][] aa = new double[len][len];
double[] bb = new double[len];
int posx = -1, posy = -1;
for (int i = 0; i <= len; i++) {
for (int j = 0; j <= len; j++)
if (a[i][j] != 0.0d) {
posy = j;
break;
}
if (posy != -1) {
posx = i;
break;
}
}
if (posx == -1) {
return null;
}

int count = 0;
for (int i = 0; i <= len; i++) {
if (i == posx) {
continue;
}
bb[count] = b[i] * a[posx][posy] - b[posx] * a[i][posy];
int count2 = 0;
for (int j = 0; j <= len; j++) {
if (j == posy) {
continue;
}
aa[count][count2] = a[i][j] * a[posx][posy] - a[posx][j]
* a[i][posy];
count2++;
}
count++;
}

// Calculate sub linear equation
double[] result2 = calcLinearEquation(aa, bb);

// After sub linear calculation, calculate the current coefficient
double sum = b[posx];
count = 0;
for (int i = 0; i <= len; i++) {
if (i == posy) {
continue;
}
sum -= a[posx][i] * result2[count];
result[i] = result2[count];
count++;
}
result[posy] = sum / a[posx][posy];
return result;
}

public static void main(String[] args) {
TheleastSquareMethod eastSquareMethod = new TheleastSquareMethod(
new double[] { 0.5, 1.0, 1.5, 2.0, 2.5, 3.0 }, new double[] {
1.75, 2.45, 3.81, 4.8, 7.0, 8.6 }, 3);
/*double[] coefficients = eastSquareMethod.getCoefficient();
for (double c : coefficients) {
System.out.println(c);
}*/

System.out.println(eastSquareMethod.fit(4));

TheleastSquareMethod eastSquareMethod2 = new TheleastSquareMethod(
new double[] { 0.5, 1.0, 1.5, 2.0, 2.5, 3.0 }, new double[] {
1.75, 2.45, 3.81, 4.8, 7.0, 8.6 }, 2);
System.out.println(eastSquareMethod2.solve(100));

}
}

(1)数据集

1 11 2
1 12 3
1 13 1
1 14 0
2 11 1
2 12 2
2 13 2
2 14 1
2 15 4
3 11 2
3 12 3
3 13 1
3 14 0
3 15 1
4 11 1
4 12 2
4 13 2
4 14 1
4 15 4
5 11 1
5 12 2
5 13 2
5 14 1
5 15 4
(2)建立ALS数据模型

MLlib中ALS算法固定格式：case class Rating(user: Int,product: Int,rating: Double)

ALS.tran码源：

def train(
ratings: RDD[Rating];
rank: Int;
iterations: Int;
lambda: Double;
blocks: Int;
seed: Long;
): MatrixFactorizationModel = {
new ALS(blocks,rank,iterations,lambda,false,1.0,seed).run(ratings)
}
(3)基于ALS算法的协同过滤推荐

import org.apache.spark.mllib.recommendation.{ALS, Rating}
import org.apache.spark.{SparkConf, SparkContext}
object testVector {
def main(args: Array[String]): Unit = {
var conf = new SparkConf().setMaster("local").setAppName("testVector");//设置环境变量
val sc = new SparkContext(conf);//实例化环境
val data = sc.textFile("kimi.txt");//设置数据集
val ratings = data.map(_.split(' ')match{//处理数据
case Array(user,item,rate) => //将数据集转化
Rating(user.toInt,item.toInt,rate.toDouble);//将数据集转化成专用rating
})
val rank = 2;//设置隐藏因子
val numIterations = 2;//设置迭代次数
val model = ALS.train(ratings,rank,numIterations,0.01);//进行模型训练
var rs = model.recommendProducts(2,1);//为用户2推荐一个商品
rs.foreach(println);//打印结果
}
}

/Rating(2,15,3.9713808775549495)，为用户 2 推荐一个编号 15 的商品，预测评分 3.97 与实际的 4 接近．