sql server 群集
Microsoft Clustering is the next data mining topic we will be discussing in our SQL Server Data mining techniques series. Until now, we have discussed a few data mining techniques like: Naïve Bayes, Decision Trees, Time Series, and Association Rules.
Microsoft群集是我们将在SQL Server数据挖掘技术系列中讨论的下一个数据挖掘主题。 到目前为止,我们已经讨论了一些数据挖掘技术,例如:朴素贝叶斯,决策树,时间序列和关联规则。
Microsoft Clustering is an unsupervised learning technique. In supervised training, there will be a variable that is already tagged to. In unsupervised training, there is no previously set variable as such.
Microsoft群集是一种无监督的学习技术。 在监督训练中,将有一个已被标记为变量。 在无监督训练中,没有先前设置的变量。
Clustering is used to find out imperceptible natural grouping in a data set. This data set can be a large data set. Further, if there are a large number of attributes, you need a special technique to find natural grouping as the manual grouping is impossible.
聚类用于找出数据集中不可感知的自然分组。 该数据集可以是大数据集。 此外,如果存在大量属性,则由于无法进行手动分组,因此需要一种特殊的技术来查找自然分组。
Let us see how we can perform clustering in the Microsoft SQL Server platform. In this example, we will be using the vTargetMail view in the AdventureWorksDW sample database, as we did for previous examples in the series.
让我们看看如何在Microsoft SQL Server平台中执行群集。 在本示例中,我们将使用AdventureWorksDW示例数据库中的vTargetMail视图,就像本系列中的先前示例一样。
Let us first create a data source and the Data Source View as we did in the other examples. In this, the data source would be AdventureWorksDW, while vTargetMail is the data source views.
让我们首先创建一个数据源和“数据源视图”,就像在其他示例中一样。 在这种情况下,数据源将是AdventureWorksDW,而vTargetMail是数据源视图。
In the wizard, the next is to choose the data mining technique:
在向导中,下一步是选择数据挖掘技术:
Since there is only one view in the Data Source View, vTargetMail will be the Case table and next is to choose relevant attributes as shown in the below screenshot:
由于数据源视图中只有一个视图,因此vTargetMail将成为Case表,接下来是选择相关属性,如下面的屏幕快照所示:
In the above, the Customer Key is chosen as the Key from the algorithm. Since it is assumed that attributes such as Middle Name, Title will not make major contributions towards the natural grouping, input variables are chosen with sense. If not, there will be unnecessary processing time for the data mining structures. So in the above selection, Age, BikeBuyer, CommuteDistance, EnglishEducation, EnglishOccupation, Gender, HouseOwnerFlag, MaritalStatus, NumberCarsOwned, NumberChildrenatHome, Region, TotalChildren and YearlyIncome were chosen as relevant attributes.
在上面,从算法中选择了客户密钥作为密钥。 由于假定诸如中间名,标题之类的属性不会对自然分组做出重大贡献,因此请合理选择输入变量。 否则,数据挖掘结构将不必要的处理时间。 因此,在以上选择中,选择了“ 年龄”,“骑自行车的人”,“通勤距离”,“英语教育”,“英语职业”,“性别”,“房屋所有者标志”,“婚姻状况”,“拥有的汽车数量”,“ NumberChildrenatHome”,“ Region”,“ TotalChildren”和“ YearlyIncome”作为相关属性。
Since we are using the Microsoft Clustering algorithm, there is no need to choose Predict variable. This is why we said earlier that the Microsoft Clustering is an unsupervised learning technique.
由于我们使用的是Microsoft聚类算法,因此无需选择Predict变量。 这就是为什么我们之前说过Microsoft群集是一种无监督的学习技术的原因。
Next is to select the correct Content types, though there are default Content types. Content types can be modified from the following screen:
尽管存在默认的内容类型,但下一步是选择正确的内容类型。 可以从以下屏幕修改内容类型:
In the above screenshot, for the numerical data type or long data type, the default content type will be Continuous. For example, columns like Age, Number Cars owned are numeric data types. Though they are numeric, we know that values are Discrete as Number Cars Owned contain values such as 0, 1, 2, 3, etc. Content-Type of Age, Bike Buyer, Number Cars Owned, Number Children at Home and Total Children attributes were changed to Discrete from Continuous. We will leave the Yearly Income content type as Continuous.
在上面的屏幕截图中,对于数字数据类型或长数据类型,默认内容类型将为“连续”。 例如,“年龄”,“拥有的汽车数量”之类的列是数字数据类型。 尽管它们是数字,但我们知道值是离散的,因为拥有的汽车数量包含诸如0、1、2、3等值。内容类型的年龄,自行车购买者,拥有的汽车数量,在家中子代数和子代总数