k表示与neo4j聚类

When faced with a set of many items, we often want to organize them into groups. Simplification from many individual examples to higher-level groups of similar items can help us wrap our heads around what might otherwise be an overwhelming amount of data. If we already know the different categories that we want to group our data into, and we have examples of each category, we can use supervised machine learning to classify our data. If we don’t know the categories in advance, we can use unsupervised machine learning to discover new groupings from our data.

当面对许多项目时,我们经常希望将它们组织成组。 从许多单独的示例简化为类似项目的更高级别的组,可以帮助我们绕开原本可能是大量数据的头脑。 如果我们已经知道要将数据分组的不同类别,并且我们有每个类别的示例,则可以使用监督机器学习对数据进行分类。 如果我们事先不知道类别,则可以使用无监督机器学习从数据中发现新的分组。

The community detection algorithms that come in Neo4j’s Graph Data Science library are one way to apply unsupervised machine learning. We can use them to find groupings based upon relationships among items. If we need to group items based on similar properties instead of relationships, we can use clustering algorithms such as k-means. In this article, we’ll look at how we could implement k-means and the k-means++ seeding algorithm in Neo4j’s Cypher query language. We’ll also compare the results of k-means to clusters we could generate through the Louvain community detection algorithm.

Neo4j的Graph Data Science库中提供的社区检测算法是应用无监督机器学习的一种方法。 我们可以使用它们根据项目之间的关系找到分组。 如果需要基于相似属性而不是关系对项目进行分组,则可以使用聚类算法,例如k-means 。 在本文中,我们将研究如何在Neo4j的Cypher查询语言中实现k-means和k-means ++种子算法。 我们还将k-均值的结果与可以通过Louvain社区检测算法生成的聚类进行比较。

I like k-means because it is a powerful algorithm that I also find easy to understand. It has a long history in data analytics. Hugo Steinhaus was one of the first to propose the idea in the 1950s.

我喜欢k-means,因为它是一种功能强大的算法,我也很容易理解。 它在数据分析方面拥有悠久的历史。 雨果·斯坦豪斯 ( Hugo Steinhaus)是1950年代最早提出这一想法的人之一。

You can use k-means when you believe that your data could be grouped into a known number of natural clusters. The algorithm assigns each example in the data to a cluster of items with similar values for the properties that you are considering. A very simple example could be your data points are the weights and heights of some volunteers, and you want to determine which categories they would map to, based on these data point values.

当您认为数据可以分组为已知数量的自然簇时,可以使用k-means。 该算法将数据中的每个示例分配给具有与您所考虑的属性相似的值的项目集群。 一个非常简单的示例可能是您的数据点是一些志愿者的体重和身高,并且您想根据这些数据点值确定他们将映射到哪些类别。

We decide in advance how many clusters we want to divide the data into. The number of clusters that we pick is called “k”, and that’s the “k” in k-means. Next, we select one example at random from our data set for each of the k clusters. These randomly-selected points become the centroids of our clusters. Each example in the data set is assigned to the cluster that corresponds to the nearest centroid. Then we move the centroid so that it sits at the average value along each dimension for the items in the cluster. This averaging step is where we get the “means” in k-means. We repeat this process of assigning examples to their closest centroid, and then shifting the centroids to respond to their new cluster members. In time the algorithm should converge to a stable state where no examples (or at most a very few examples) are switching clusters at the assignment step.

我们预先确定要将数据划分为多少个群集。 我们选择的簇数称为“ k”,即k均值中的“ k”。 接下来,我们从我们的数据集中为k个聚类中的每一个随机选择一个示例。 这些随机选择的点成为我们聚类的质心。 数据集中的每个示例都分配给与最近的质心相对应的聚类。 然后,我们移动质心,使质心沿着群集中每个项目的每个维度位于平均值处。 在此平均步骤中,我们获得k均值的“均值”。 我们重复此过程,将示例分配给它们最接近的质心,然后移动质心以响应其新的群集成员。 随着时间的流逝,算法应该收敛到一个稳定的状态,在该状态下,没有任何示例(或最多几个示例)在分配步骤切换集群。

Implementing k-means in Neo4j sounds like a fun challenge. However, there are efficient, easy-to-use implementations of k-means like this one in Python and this one in R. Do we really need to build one in Neo4j? Let’s not be like the scientists in Jurassic Park — so preoccupied with whether or not they could, they didn’t stop to think if they should.

在Neo4j中实现k-means听起来是一个有趣的挑战。 但是,有一些k均值的高效,易于使用的实现,例如Python中的 这种方法R中的这种方法 。 我们真的需要在Neo4j中构建一个吗? 让我们不像侏罗纪公园的科学家那样-着迷于他们是否可以,他们没有停下来思考是否应该。

Image for post
Photo by Amy-Leigh Barnard on Unsplash
照片由 Amy-Leigh Barnard
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值