机器学习的数据类型_用于机器学习的统计数据中的数据类型

本文探讨了在机器学习中使用的各种数据类型,这些类型对于理解及处理数据分析和大数据至关重要。通过了解这些类型,可以更好地进行特征工程和模型构建。
摘要由CSDN通过智能技术生成

机器学习的数据类型

统计概论 (Introduction to Statistics)

The field of statistics is the science of learning from data. Statistical knowledge helps you use the proper methods to collect the data, employ the correct analyses, and effectively present the results. statistics is a crucial process behind how we make discoveries in science, make decisions based on data, and make predictions. Statistics allows you to understand a subject much more deeply.

统计领域是从数据中学习的科学。 统计知识可帮助您使用正确的方法来收集数据,进行正确的分析并有效地呈现结果。 统计学是我们进行科学发现,基于数据做出决策和做出预测的关键过程。 统计信息使您可以更深入地理解主题。

To become a successful Data Scientist you must know our basics. Math and Stats are the building blocks of Machine Learning algorithms. It is important to know the techniques behind various Machine Learning algorithms to know how and when to use them. Now the question arises, what exactly is Statistics?

要成为一名成功的数据科学家,您必须了解我们的基础知识。 数学和统计是机器学习算法的基础。 重要的是要了解各种机器学习算法背后的技术,以了解如何以及何时使用它们。 现在问题来了,统计到底是什么?

“Statistics is a Mathematical Science of data collection, analysis, interpretation and presentation”.

“统计学是数据收集,分析,解释和表示的数学科学”。

为什么要学习统计? (Why Learn Statistics?)

One of the central concepts of data science is gaining insights from data. Statistics is an excellent tool for unlocking such insights in data. Statistics is a form of math, and it involves formulas, but it doesn’t have to be that scary even if you’ve never encountered it before.

数据科学的中心概念之一是从数据中获取见识。 统计数据是解锁此类数据洞察力的绝佳工具。 统计数据是数学的一种形式,它涉及公式,但是即使您以前从未遇到过统计数据,也不必那么吓人。

Machine learning came from statistics. The algorithms and models used in machine learning all come from what’s called statistical learning. Knowing some basic statistics is extremely helpful whether you are deep into machine learning algorithms or just staying up-to-date on the latest machine learning research.

机器学习来自统计。 机器学习中使用的算法和模型都来自所谓的统计学习。 无论您是深入研究机器学习算法还是了解最新的机器学习研究知识,了解一些基本统计信息都将非常有帮助。

数据类型简介 (Introduction to Data Types)

Having a good understanding of the different data types, also called measurement scales, is a crucial prerequisite for doing Exploratory Data Analysis (EDA) since you can use certain statistical measurements only for specific data types.

充分了解不同数据类型(也称为度量标准)是进行探索性数据分析(EDA)的关键先决条件,因为您只能将某些统计度量用于特定数据类型。

You also need to know which data type you are dealing with to choose the right visualization method. Think of data types as a way to categorize different types of variables. We will discuss the main types of data and look an example for each.

您还需要知道要处理的数据类型,以选择正确的可视化方法。 将数据类型视为对不同类型的变量进行分类的一种方式。 我们将讨论主要的数据类型,并为每种数据寻找一个示例。

Image for post
Types of Data
资料类型
Image for post
Types of Data
资料类型

定性与定量数据 (Qualitative versus Quantitative Data)

The distinction between qualitative and quantitative data is the most fundamental way to divide types of data. Is the characteristic something you can objectively measure with numbers or not?

定性和定量数据之间的区别是划分数据类型的最基本方法。 您可以用数字客观地衡量特征吗?

1)定性 (1) Qualitative)

The information represents characteristics that you do not measure with numbers. Instead, the observations fall within a countable number of groups. This type of variable can capture information that isn’t easily measured and can be subjective. Taste, the colour of a car, architectural style, and marital status are all types of qualitative data. Analysts also refer to this as categorical data.

该信息表示您没有用数字衡量的特征。 取而代之的是,观察结果属于可数的组。 这种类型的变量可以捕获不容易测量且可能是主观的信息。 口味,汽车的颜色,建筑风格和婚姻状况都是定性数据的所有类型。 分析师也将此称为分类数据。

i)名义数据 (i)Nominal Data)

Nominal values represent discrete units and are used to label variables, that has no quantitative value. Just think of them as labels. Note that nominal data that has no order. Therefore if you would change the order of its values, the meaning would not change. You can see two examples of nominal features below:

标称值表示离散单位,用于标记没有定量值的变量。 只需将它们视为标签即可。 请注意,没有顺序的名义数据。 因此,如果您更改其值的顺序,则含义不会改变。 您可以在下面看到两个名义特征的示例:

Image for post
Nominal data example
标称数据示例

Visualization Methods: To visualize nominal data you can use a pie chart or a bar chart.

可视化方法 :要可视化名义数据,可以使用饼图或条形图。

Image for post
For Nominal Visualization
用于名义可视化

In Data Science, you can use one-hot encoding, to transform nominal data into a numeric feature.

在数据科学中,您可以使用单次热编码将名义数据转换为数字特征。

ii)序数数据 (ii) Ordinal Data)

Ordinal data mixes of both numerical and categorical data. The data fall into categories, but the numbers placed on the categories have meaning. For example, rating a restaurant on a scale from 0 (lowest) to 4 (highest) stars gives ordinal data. Ordinal data are often treated as categorical, where the groups are ordered when graphs and charts are made. However, unlike categorical data, the numbers do have mathematical meaning. It is therefore nearly the same as nominal data, except that it’s ordering matters. You can see an example below:

数字和分类数据的有序数据混合。 数据属于类别,但是放置在类别上的数字具有含义。 例如,以0(最低)至4(最高)星的等级对餐厅进行评级可得出序数数据。 顺序数据通常被视为分类数据,其中在制作图形和图表时对组进行排序。 但是,与分类数据不同,数字确实具有数学意义。 因此,它与标称数据几乎相同,除了排序方面。 您可以在下面看到一个示例:

Image for post
Customer Rating for service providing this example order is matters
提供此示例订单的服务的客户评级很重要

ordinal scales are usually used to measure non-numeric features like happiness, customer satisfaction, Rank of students in the class, education qualification etc.

顺序量表通常用于测量非数字特征,例如幸福感,客户满意度,班级学生的学历,学历等。

Therefore you can summarize your ordinal data with frequencies, proportions, percentages. And you can visualize it with pie and bar charts. Additionally, you can use percentiles, median, mode and the interquartile range to summarize your data.

因此,您可以使用频率,比例,百分比来汇总您的序数数据。 您可以使用饼图和条形图对其进行可视化。 此外,您可以使用百分位数,中位数,众数和四分位数范围来汇总数据。

In addition to ordinal and nominal values, there is a special type of categorical data called binary.

除了序数和标称值外,还有一种特殊类型的分类数据,称为二进制。

Binary data types only have two values — yes or no. This can be represented in different ways such as “True” and “False” or 1 and 0. Binary data is used heavily for classification machine learning models. Examples of binary variables can include whether a person has stopped their subscription service or not, or if a person bought a car or not.

二进制数据类型只有两个值-是或否。 这可以用不同的方式表示,例如“ True”和“ False”或1和0。二进制数据大量用于分类机器学习模型。 二进制变量的示例可以包括一个人是否停止了其订阅服务,或者一个人是否购买了汽车。

Image for post
Binary data types
二进制数据类型

2)定量: (2)Quantitative:)

The information is recorded as numbers and represents an objective measurement or a count. Temperature, weight, and a count of transactions are all quantitative data. Analysts also refer to this type as numerical data.

该信息记录为数字,代表客观的度量或计数。 温度,重量和交易次数都是定量数据。 分析师也将此类型称为数值数据。

i) Discrete Data

i) 离散数据

Discrete quantitative data are a count of the presence of a characteristic, result, item, or activity. These measures cannot be meaningfully divided into smaller increments. For example, a single household can have 1 or 2 cars, but it cannot have 1.6. There are a finite number of possible values that you can record for an observation.

离散的定量数据是对特征,结果,项目或活动的存在的计数。 这些措施不能有意义地分为较小的增量。 例如,一个家庭可以拥有1或2辆汽车,但不能拥有1.6辆。 您可以为观察记录有限数量的可能值。

With discrete variables, you can calculate and assess a rate of occurrence or a summary of the count, such as the mean, sum, and standard deviation. For example, U.S. households had an average of 2.11 vehicles in 2014.

使用离散变量,您可以计算和评估发生率或计数摘要,例如平均值,总和和标准偏差。 例如,2014年美国家庭平均有2.11辆汽车。

Bar charts are a standard way to graph discrete variables. Each bar represents a distinct value, and the height represents its proportion in the entire sample.

条形图是绘制离散变量的标准方法。 每个条代表一个不同的值,高度代表其在整个样本中的比例。

Image for post
Bar chart for numbers of cars in household
家用汽车数量的条形图

ii)连续数据 (ii) Continuous Data)

Continuous variables can take on almost any numeric value and can be meaningfully divided into smaller increments, including fractional and decimal values. You often measure a continuous variable on a scale. For example, when you measure height, weight, and temperature, you have continuous data.

连续变量几乎可以采用任何数值,并且可以有意义地分为较小的增量,包括小数和十进制值。 您经常在刻度上测量连续变量。 例如,当您测量身高,体重和温度时,您将获得连续的数据。

For example, the mean height in India is 5 feet 9 inches for men and 5 feet 4 inches for women.

例如,印度的平均身高是男性5英尺9英寸,女性5英尺4英寸。

In Continuous data and there are 2 types

在连续数据中,有两种类型

a) Interval Data

a) 间隔数据

Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. An example would be a feature that contains the temperature of a given place as you can see below:

间隔值表示具有相同差异的有序单位 。 因此,当我们拥有一个包含有序数字值的变量并且知道这些值之间的确切差异时,我们就说间隔数据。 一个示例就是包含给定位置的温度的功能,如下所示:

Image for post
Positive and negative intervals
正负间隔

The problem with interval values data is that they don’t have a “true zero”.

间隔值数据的问题在于它们没有“真零”

b)Ratio Data

b) 比率数据

Ratio values are also ordered units that have the same difference. Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are height, weight, length etc.

比率值也是具有相同差异的有序单位。 比率值是 与间隔值相同,不同之处在于它们的绝对值为零 。 身高,体重,长度等都是很好的例子。

Image for post
Length of a table
桌长

When you are dealing with continuous data, you can use the most methods to describe your data. You can summarize your data using percentiles, median, interquartile range, mean, mode, standard deviation, and range.

处理连续数据时,可以使用大多数方法来描述数据。 您可以使用百分位数,中位数,四分位数范围,平均值,众数,标准差和范围来汇总数据。

Visualization Methods:

可视化方法:

To visualize continuous data, you can use a histogram or a box-plot. With a histogram, you can check the central tendency, variability, modality, and kurtosis of a distribution. Note that a histogram can’t show you if you have any outliers. This is why we also use box-plots.

要可视化连续数据,可以使用直方图或箱形图。 使用直方图,您可以检查分布的集中趋势,变异性,模态和峰度。 请注意,如果您有任何异常值,直方图将无法显示。 这就是为什么我们也使用箱形图的原因。

Image for post
This plots and graph are for continuous data analysis
该图和图形用于连续数据分析

摘要 (Summary)

In this post, you discovered the different data types that are used throughout statistics. You learned the difference between discrete & continuous data and learned what nominal, ordinal,binary, interval and ratio measurement scales are. Furthermore, you now know what statistical measurements you can use at which datatype and which are the right visualization methods. You also learned, with which methods categorical variables can be transformed into numeric variables. This enables you to create a big part of an exploratory analysis on a given dataset.

在本文中,您发现了整个统计信息中使用的不同数据类型。 您了解了离散数据和连续数据之间的区别,并了解了名义,有序,二进制,间隔和比率的度量标准。 此外,您现在知道可以在哪种数据类型上使用哪些统计度量,以及哪种是正确的可视化方法。 您还了解了可以使用哪些方法将分类变量转换为数字变量。 这使您可以在给定的数据集上创建探索性分析的很大一部分。

翻译自: https://medium.com/swlh/data-types-in-statistics-used-for-machine-learning-5b4c24ae6036

机器学习的数据类型

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值