1.简介
Pearson 相似度是两个n维向量的协方差除以它们的标准差的乘积。
使用以下公式计算 Pearson 相似度:
值范围在 -1 和 1 之间,其中 -1 完全不同,1 完全相似。
该库包含计算数据集之间相似性的过程和函数。在计算少量集合之间的相似性时,最好使用该函数。这些过程使计算并行化,因此更适合在更大的数据集上计算相似性
2.使用场景
我们可以使用 Pearson Similarity 算法来计算两个事物之间的相似性。然后,我们可能会将计算出的相似度用作推荐查询的一部分。例如,根据对您看过的其他电影给予相似评分的用户的偏好来获得电影推荐
3.Pearson Similarity算法函数示例
Pearson Similarity 函数计算两个数字列表的相似度。Pearson 相似度仅在非 NULL 维度上计算。调用该函数时,我们应该提供包含重叠项的列表。我们可以用它来计算两个硬编码列表的相似度。
示例:以下将返回两个数字列表的皮尔逊相似度:
RETURN algo.similarity.pearson([5,8,7,5,4,9], [7,8,6,6,4,5]) AS similarity
我们还可以使用它来计算基于 Cypher 查询计算的列表的节点的相似性。
下面将创建一个示例图:
MERGE (home_alone:Movie {name:'Home Alone'}) MERGE (matrix:Movie {name:'The Matrix'}) MERGE (good_men:Movie {name:'A Few Good Men'}) MERGE (top_gun:Movie {name:'Top Gun'}) MERGE (jerry:Movie {name:'Jerry Maguire'}) MERGE (gruffalo:Movie {name:'The Gruffalo'}) MERGE (zhen:Person {name: "Zhen"}) MERGE (praveena:Person {name: "Praveena"}) MERGE (michael:Person {name: "Michael"}) MERGE (arya:Person {name: "Arya"}) MERGE (karin:Person {name: "Karin"}) MERGE (zhen)-[:RATED {score: 2}]->(home_alone) MERGE (zhen)-[:RATED {score: 2}]->(good_men) MERGE (zhen)-[:RATED {score: 3}]->(matrix) MERGE (zhen)-[:RATED {score: 6}]->(jerry) MERGE (praveena)-[:RATED {score: 6}]->(home_alone) MERGE (praveena)-[:RATED {score: 7}]->(good_men) MERGE (praveena)-[:RATED {score: 8}]->(matrix) MERGE (praveena)-[:RATED {score: 9}]->(jerry) MERGE (michael)-[:RATED {score: 7}]->(home_alone) MERGE (michael)-[:RATED {score: 9}]->(good_men) MERGE (michael)-[:RATED {score: 3}]->(jerry) MERGE (michael)-[:RATED {score: 4}]->(top_gun) MERGE (arya)-[:RATED {score: 8}]->(top_gun) MERGE (arya)-[:RATED {score: 1}]->(matrix) MERGE (arya)-[:RATED {score: 10}]->(jerry) MERGE (arya)-[:RATED {score: 10}]->(gruffalo) MERGE (karin)-[:RATED {score: 9}]->(top_gun) MERGE (karin)-[:RATED {score: 7}]->(matrix) MERGE (karin)-[:RATED {score: 7}]->(home_alone) MERGE (karin)-[:RATED {score: 9}]->(gruffalo)
示例一:以下将返回 Arya 和 Karin 的 Pearson 相似度:
MATCH (p1:Person {name: 'Arya'})-[rated:RATED]->(movie) WITH p1, algo.similarity.asVector(movie, rated.score) AS p1Vector MATCH (p2:Person {name: 'Karin'})-[rated:RATED]->(movie) WITH p1, p2, p1Vector, algo.similarity.asVector(movie, rated.score) AS p2Vector RETURN p1.name AS from, p2.name AS to, algo.similarity.pearson(p1Vector, p2Vector, {vectorType: "maps"}) AS similarity
执行结果:
在这个例子中,我们vectorType: "maps"
作为一个额外的参数传入,并使用该algo.similarity.asVector
函数构造一个包含每部电影和相应评级的地图向量。我们这样做是因为 Pearson Similarity 算法需要计算用户评论过的所有电影的平均值,而不仅仅是他们与我们比较的用户有共同点的电影。因此,我们不能只传递双方都审查过的电影评分的集合。
示例2:以下将返回 Arya 和其他至少对一部电影进行过评分的人的 Pearson 相似度
MATCH (p1:Person {name: 'Arya'})-[rated:RATED]->(movie) WITH p1, algo.similarity.asVector(movie, rated.score) AS p1Vector MATCH (p2:Person)-[rated:RATED]->(movie) WHERE p2 <> p1 WITH p1, p2, p1Vector, algo.similarity.asVector(movie, rated.score) AS p2Vector RETURN p1.name AS from, p2.name AS to, algo.similarity.pearson(p1Vector, p2Vector, {vectorType: "maps"}) AS similarity ORDER BY similarity DESC
执行结果:
4.皮尔逊相似度算法程序示例
Pearson 相似性过程计算所有项目对之间的相似性。它是一种对称算法,即计算 Item A 与 Item B 的相似度的结果与计算 Item B 与 Item A 的相似度的结果相同。因此我们可以为每对节点计算一次分数。我们不计算物品与其自身的相似度
计算次数是((# items)^2 / 2) - # items
,如果我们有很多节点,这在计算上可能非常昂贵
Pearson 相似度仅在非 NULL 维度上计算。这些过程期望为所有项目接收相同长度的列表,因此我们需要algo.NaN()
在必要时填充这些列表
示例3:以下将返回节点对流以及它们的 Pearson 相似性:
MATCH (p:Person), (m:Movie) OPTIONAL MATCH (p)-[rated:RATED]->(m) WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData WITH collect(userData) as data CALL algo.similarity.pearson.stream(data) YIELD item1, item2, count1, count2, similarity RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity ORDER BY similarity DESC
执行结果:
我们可以看到那些没有相似性的用户已经被过滤掉了。如果我们正在实现 k-Nearest Neighbors 类型的查询,我们可能希望k
为给定用户找到最相似的用户。我们可以通过传入topK
参数来做到这一点
示例4:以下将返回用户流以及与他们最相似的用户(即k=1
):
MATCH (p:Person), (m:Movie) OPTIONAL MATCH (p)-[rated:RATED]->(m) WITH {item:id(p), weights: collect(coalesce(rated.score, algo.NaN()))} as userData WITH collect(userData) as data CALL algo.similarity.pearson.stream(data, {topK:1, similarityCutoff: 0.0}) YIELD item1, item2, count1, count2, similarity RETURN algo.asNode(item1).name AS from, algo.asNode(item2).name AS to, similarity ORDER BY similarity DESC
5.源码解析
@UserFunction("algo.similarity.pearson")
public double pearsonSimilarity(@Name("vector1") Object rawVector1, @Name("vector2") Object rawVector2, @Name(value = "config", defaultValue = "{}") Map<String, Object> config) {
ProcedureConfiguration configuration = ProcedureConfiguration.create(config);
String listType = configuration.get("vectorType", "numbers");
if (listType.equalsIgnoreCase("maps")) {
List<Map<String, Object>> vector1 = (List<Map<String, Object>>) rawVector1;
List<Map<String, Object>> vector2 = (List<Map<String, Object>>) rawVector2;
LongSet ids = new LongHashSet();
LongDoubleMap v1Mappings = new LongDoubleHashMap();
for (Map<String, Object> entry : vector1) {
Long id = (Long) entry.get(CATEGORY_KEY);
ids.add(id);
v1Mappings.put(id, (Double) entry.get(WEIGHT_KEY));
}
LongDoubleMap v2Mappings = new LongDoubleHashMap();
for (Map<String, Object> entry : vector2) {
Long id = (Long) entry.get(CATEGORY_KEY);
ids.add(id);
v2Mappings.put(id, (Double) entry.get(WEIGHT_KEY));
}
double[] weights1 = new double[ids.size()];
double[] weights2 = new double[ids.size()];
double skipValue = Double.NaN;
int index = 0;
for (long id : ids.toArray()) {
weights1[index] = v1Mappings.getOrDefault(id, skipValue);
weights2[index] = v2Mappings.getOrDefault(id, skipValue);
index++;
}
return Intersections.pearsonSkip(weights1, weights2, ids.size(), skipValue);
} else {
List<Number> vector1 = (List<Number>) rawVector1;
List<Number> vector2 = (List<Number>) rawVector2;
if (vector1.size() != vector2.size() || vector1.size() == 0) {
throw new RuntimeException("Vectors must be non-empty and of the same size");
}
int len = vector1.size();
double[] weights1 = new double[len];
double[] weights2 = new double[len];
for (int i = 0; i < len; i++) {
weights1[i] = vector1.get(i).doubleValue();
weights2[i] = vector2.get(i).doubleValue();
}
return Intersections.pearson(weights1, weights2, len);
}
}
结论:1.皮尔逊相似度支持以map和list等多种传参
2.支持config可以传空