测量数据拟合参考数据_数据测量水平

最新推荐文章于 2024-10-08 12:37:10 发布

李_涛

最新推荐文章于 2024-10-08 12:37:10 发布

阅读量765

点赞数 1

文章标签： python 机器学习大数据 mysql 算法

原文链接：https://medium.com/@zachary.a.zazueta/levels-of-data-measurement-2ba7d858904d

版权

测量数据拟合参考数据

It is widely reported that over 80% of a data scientist’s time is spent cleaning and engineering data. Great effort is put into preparing the information that will feed a data scientist’s models. Including irrelevant information or messy data in the modeling cycle can lead to models that are inaccurate or show false insights.

据广泛报道，数据科学家80％以上的时间都花在清洁和工程数据上。在准备提供数据科学家模型的信息方面付出了巨大的努力。在建模周期中包含不相关的信息或混乱的数据可能会导致模型不准确或显示错误的见解。

As such, one of the first steps of data cleaning is understanding what features or attributes will be available for the work and what type of attribute they will be. Data can be measured on four main scales: Interval, Ratio, Nominal, and Ordinal. Knowing the difference in these scales allows the data scientist to correctly structure the available data and informs them what statistical methods can be applied to said data.

因此，数据清理的第一步就是了解哪些特征或属性可用于工作以及它们将具有哪种类型的属性。可以在四个主要标度上测量数据：时间间隔，比率，标称和有序。知道这些标度的差异可以使数据科学家正确地构造可用数据，并告知他们可以将哪些统计方法应用于所述数据。

Data that can be classified on the Interval and Ratio scales will be numeric. Interval and Ratio data can be both continuous and discrete. With these variables, defined values separate units of measure encountered in the data set. The key difference between these data measure types is that Interval data have no “true zero”.

可以在“时间间隔”和“比率”量表上分类的数据将是数字。间隔和比率数据可以是连续的也可以是离散的。使用这些变量，定义的值将数据集中遇到的度量单位分开。这些数据度量类型之间的主要区别在于，间隔数据没有“真零”。

Interval data are ordered data. There is also a known distance between data points that has credible meaning. An example of this is time — we know that 30 minutes separate 1:00 and 1:30; similarly, 30 minutes also separates 1:30 and 2:00, meeting the consistent values separating units of measure. It is also widely agreed that 1:00 comes before 1:30, satisfying order. However, there is no absolute zero to a variable like time. The lack of a “true zero” limits the arithmetic operations which can be applied to interval data to addition and subtraction — we can measure the number of days since a customer last made a purchase, or the number of days we’ve been dealing with the coronavirus pandemic (at least an eternity, right?), but we would not divide November by March.

间隔数据是有序数据。数据点之间还有一个已知距离，它具有可信的含义。时间就是一个例子，我们知道30分钟分别是1:00和1:30；同样，30分钟也将1:30和2:00分开，满足了将度量单位分开的一致值。人们也普遍同意，1：00早于1:30才能满足要求。但是，像时间这样的变量没有绝对零。缺少“真零”限制了可用于区间数据加法和减法的算术运算-我们可以测量自客户上次购物以来的天数，或者我们一直在处理的天数冠状病毒大流行(至少是永恒的，对吗？)，但我们不会在11月之前将其划分为3月。

Another example of an interval variable is a student’s SAT scores; numbers between 0 to 200 are not used when the College Board scales raw scores. Now, interval data can contain a zero — it is just that “0” does not carry as much meaning as “true zero”. Temperature is a great example of this. Considering the interval variable of temperature in Fahrenheit — we can have 0o F, but that does not mean there is no temperature, it just means you better be wearing a winter jacket.

区间变量的另一个示例是学生的SAT分数。当大学理事会评估原始分数时，将不使用0到200之间的数字。现在，间隔数据可以包含零-只是“ 0”的含义不如“真实零”。温度就是一个很好的例子。考虑到华氏温度的间隔变量-我们可以设定为0o F，但这并不意味着没有温度，只是意味着您最好穿一件冬季外套。

When you have ordered numeric data with a clear definition of zero, you are dealing with Ratio data. Classic examples of ratio data are height/weight and price. If a person has zero height or weight, they don’t exist; a customer can’t pay less than $0.00 for an item at a store.

订购了明确定义为零的数字数据时，就在处理比率数据。比率数据的经典示例是身高/体重和价格。如果一个人的身高或体重为零，则不存在；客户在商店购买商品的价格不能低于$ 0.00。

Measures of central tendency — mean, median, mode, and standard deviation — can be computed for these numerical data measures, but it is only with the presence of the true zero in ratio measures that a wide range of descriptive and inferential statistics can be applied. Interval and Ratio levels of measurement are considered to be much more exact than their qualitative counterparts.

可以为这些数值数据度量计算集中趋势的度量值(均值，中位数，众数和标准偏差)，但是只有在比率度量值中存在真零的情况下，才能应用广泛的描述性和推论统计量。测量的间隔和比率水平被认为比定性的水平精确得多。

Qualitative data are either nominal or ordinal. Qualitative data that are ordered are considered Ordinal level and those that are unordered are considered Nominal level.

定性数据是名义数据或有序数据。已排序的定性数据被视为序数级别，而未排序的定性数据被视为标称级别。

Ordinal data are usually numbered to represent rank or order in a list, but the numbers typically reflect opinion or observation and are not mathematical measures. Ordinal data are often seen as ranked responses, like from a survey. Unlike interval data, the difference between the unit responses is not necessarily uniform as gaps between non-numeric concepts can vary — e.g. the cognitive difference between neutral and agree is potentially wider than the distance between agree and strongly agree.

通常对序数数据进行编号，以表示列表中的等级或顺序，但是这些数字通常反映观点或观察结果，而不是数学量度。像调查一样，顺序数据通常被视为排名响应。与间隔数据不同，单位响应之间的差异不一定是一致的，因为非数字概念之间的差异可能会发生变化-例如，中性和同意之间的认知差异可能比同意和强烈同意之间的距离宽。

Ranked placement in a race is another strong example of how ordinal data lack standard unit variance. The difference between 1st and 2nd place might be a matter of 1/10th of a second while the difference between 2nd and 3rd place could be 4/10ths of a second — the inconsistency between placing makes the division between placements less meaningful.

种族中的排名位置是序数数据缺乏标准单位方差的另一个有力例子。第一名与第二名之间的差异可能约为1/10秒，而第二名与第三名之间的差异可能约为4/10秒-位置之间的不一致使位置之间的划分变得没有意义。

Because there is this variance in measures, there is only so much information that can be extracted from ordinal data. Central tendency can be captured through mode and median, but a purist will insist that a mean cannot be defined from an ordinal set. There are some measures that can be taken during data collection to maximize information gain for ordinal data — e.g. designing a customer satisfaction survey that aligns to the Likert scale — but as the data shift away from continuous numerical data, its usefulness can be debated.

由于度量存在差异，因此只能从序数数据中提取太多信息。中心趋势可以通过众数和中位数来捕获，但纯粹主义者会坚持认为不能从序数集中定义均值。在数据收集过程中可以采取一些措施来最大化序数数据的信息获取(例如，设计与李克特量表一致的客户满意度调查)，但是随着数据从连续的数值数据转移而来，其有效性尚有争议。

Nominal measures are often considered to be the lowest level of data classification as they provide the least information. Nominal data is often seen in binaries (sale was made online or not), categories (a customer’s identified race), and sets of things (e.g. types of tomatoes — cherry, Roma, San Marzano…). Nominal data can have numbers assigned to them — e.g. 1 for Female, 2 for Male, 3 for Other, 4 for Unspecified — but unlike with ordinal data, the numbers do not reflect that one category is better than another. Mode is the measure of central tendency for nominal data; other quantitative descriptions are not appropriate for measuring these unordered categorical features.

标称度量通常被认为是数据分类的最低级别，因为它们提供的信息最少。标称数据通常以二进制文件(无论是否在线销售)，类别(客户识别的种族)和事物集(例如，西红柿，樱桃，罗马，圣马萨诺等)的类型显示。名义数据可以分配有编号，例如，女性1个，男性2个，其他3个，未指定4个，但是与序数数据不同，数字不能反映一个类别比另一个类别更好。模式是名义数据集中趋势的度量；其他定量描述不适用于测量这些无序分类特征。

Nominal data is interesting in that it has a few subcategories. One subcategory is “nominal with order” — this describes data that has some order but is not ranked — hot/warm/cold. This is difficult to separate from ordinal. Then there is “nominal without order” — this would be something like eye color or race where there is grouping into unique categories. Finally, there are “dichotomous” nominal data. These data are further broken down into either binary data or discrete or continuous dichotomous variables. Binary variables are variables assigned either a 0 or 1, e.g. Heads(0) or Tails(1) for a coin flip. Discrete dichotomous variables are where there is no possible outcome other than one of two options, e.g. whether the dress was blue and black or white and gold. Continuous dichotomous variables represent outcomes where there are possibilities in between, e.g. if a house is within 5 miles of the town center it is considered to be in a metropolitan area, but if it is more than 5 miles from town center, it is listed as a rural property.

标称数据很有趣，因为它包含几个子类别。一个子类别是“标称有序”(nonomous with order)-这描述的数据具有一定的顺序，但没有排序-热/热/冷。这很难与序数分开。然后是“无序的标称”-就像眼睛的颜色或种族一样，它们被分组为独特的类别。最后，存在“二分”名义数据。这些数据进一步细分为二进制数据或离散或连续二分变量。二进制变量是分配为0或1的变量，例如用于掷硬币的Heads(0)或Tails(1)。离散二分变量是除了两个选择之一之外没有其他结果的地方，例如，衣服是蓝色和黑色还是白色和金色。连续的二分变量表示结果之间可能存在的结果，例如，如果房屋位于市中心5英里范围内，则被认为是大城市地区，但如果距市中心5英里以上，则被列为农村财产。

While numeric data can have arithmetic measures applied, measurements for nominal and ordinal data types are largely reduced to such measures as proportionality and frequency of distribution of the attributes they inform us on. Even with this reduced capacity, there are tests that can be applied if the data scientist has a sound understanding of the data they are working with. I will aim to dive into that in my next post!

虽然数值数据可以应用算术度量，但名义和有序数据类型的度量在很大程度上减少了诸如比例和它们告知我们的属性的分布频率之类的度量。即使容量减少了，如果数据科学家对正在使用的数据有很好的了解，也可以进行测试。在下一篇文章中，我将致力于探讨这一点！