数据分析模型和工具_数据分析师工具包:模型

数据分析模型和工具

You’ve cleaned up your data and done some exploratory data analysis. Now what? As data analysts we have a lot of tools in our toolkit, but just like a screwdriver might be used to hammer in a nail, it isn’t the best tool for the job. Our tools are models, or if you prefer the mathematical term, algorithms. They allow us to make sense of the data we have collected and to make predictions.

您已经清理了数据并进行了一些探索性数据分析。 怎么办? 作为数据分析人员,我们的工具包中有很多工具,但是就像螺丝起子可能会被钉在钉子上一样,它并不是工作的最佳工具。 我们的工具是模型,或者,如果您更喜欢数学术语,则是算法。 它们使我们能够理解所收集的数据并做出预测。

There are three basic types of models, depending on the type of data. For continuous numerical data we have a variety of regression techniques. These are our screwdrivers and wrenches. Fairly simple to understand and use, they bring data together to fit them to some sort of line or multidimensional plane. For categorical or discrete data, we have clustering and classification models. These are our saws and knives. They separate the data into different pieces of like versus unlike. With so many choices, it may be difficult to know which tool to use under which circumstance. So, let’s look at each in turn.

根据数据类型,共有三种基本模型类型。 对于连续的数值数据,我们有多种回归技术。 这些是我们的螺丝刀和扳手。 它们非常易于理解和使用,它们将数据组合在一起以使其适合某种线或多维平面。 对于分类或离散数据,我们有聚类和分类模型。 这些是我们的锯子和刀子。 他们将数据分为相似与不相似的不同部分。 有这么多选择,可能很难知道在哪种情况下要使用哪种工具。 因此,让我们依次看一下。

Numerical regression models seek to find the best line to fit continuous numerical data. They can be linear, in which the dependent variable (usually called y) is fit to one or more independent variables using some type of polynomial function. Nonlinear regression is used to fit one or more independent variables to a logarithmic, exponential, or sigmoid function.

数值回归模型试图找到适合连续数值数据的最佳直线。 它们可以是线性的,其中使用某种多项式函数将因变量(通常称为y)拟合到一个或多个自变量。 非线性回归用于将一个或多个自变量拟合到对数,指数或S型函数。

Linear regressions include:

线性回归包括:

1)Single Linear Regression: one independent variable fit to a basic line:

1)单一线性回归:一个独立变量适合基本线:

  • y = mx + b, where m is the slope of the line and b is the value of y at x=0

    y = mx + b,其中m是直线的斜率,b是x = 0时y的值

2) Multiple Linear Regression: 2 or more independent variables fit to a line of order 1:

2)多元线性回归: 2个或多个自变量适合1阶直线:

  • y = mx + nz + c, where m and n are the slopes of the line in the x and z planes, and c is the value of y at x=z=0

    y = mx + nz + c,其中m和n是线在x和z平面上的斜率,c是x = z = 0时y的值

3) Polynomial Regression: both single and multiple linear regressions are actually special cases of polynomial regression, where one or more independent variables fit to a polynomial of order greater than 1:

3)多项式回归:单线性回归和多元线性回归实际上都是多项式回归的特殊情况,其中一个或多个自变量适合大于1的多项式:

  • y = m0 + m1x + m2x2 + m3 x3 + …

    y = m0 + m1x + m2x2 + m3 x3 +…

Nonlinear regressions include:

非线性回归包括:

1)Logarithmic Regression

1)对数回归

  • y = alog(x) or y = bln(x)

    y = alog(x)或y = bln(x)

2)Exponential Regressions

2)指数回归

  • y = e^x + b

    y = e ^ x + b

3)Sigmoidal Regressions: use functions that create an S-curve, such as sine and cosine

3)S形回归:使用创建S曲线的函数,例如正弦和余弦

  • y = asin(x) + b or y = dcos(x) + e

    y = asin(x)+ b或y = dcos(x)+ e

In each of these cases, a line (or plane) is fit to continuous data. Note that it is also possible to split up your data into sections and fit different lines to each section. There are various techniques that you can use to determine the best fit line, but that is for another article.

在每种情况下,一条线(或平面)都适合连续数据。 请注意,也可以将数据分成多个部分,并在每个部分中插入不同的行。 您可以使用多种技术来确定最佳拟合线,但这是另一篇文章。

What if you don’t have continuous data? What if you have only two or three discrete values: yes/no, for instance, or small/medium/large? Or perhaps twenty options, but each is apparently independent of the other. From a business standpoint, you may be asking about which customers are likely to default on a loan, or determining the demographics of customers purchasing a particular product. In these cases you would find it difficult to fit a linear or nonlinear regression to your data. Instead we have other types of tools that sort data rather than fit it: classification models and clustering models. While similar, the chief difference is that with classification models, you already have predefined classes into which you sort your data. For clustering models, the data is sorted into like categories, without knowing what those categories are ahead of time. (Note that these models can also be used on continuous data, but you will need to bin the continuous data into discrete units.) While regressions fit a line to the data, classification and clustering draw lines or planes between the data, separating them into categories of like vs unlike.

如果没有连续数据怎么办? 如果您只有两个或三个离散值:例如,是/否,或小/中/大,该怎么办? 也许有二十种选择,但每种选择显然彼此独立。 从业务的角度来看,您可能在询问哪些客户可能拖欠贷款,或确定购买特定产品的客户的人口统计信息。 在这些情况下,您会发现很难对数据进行线性或非线性回归。 取而代之的是,我们有其他类型的工具对数据进行排序而不是对数据进行拟合:分类模型和聚类模型。 尽管相似,但主要的区别在于使用分类模型时,您已经具有预定义的类,可以在其中对数据进行排序。 对于聚类模型,数据被分类为相似的类别,而不事先知道这些类别是什么。 (请注意,这些模型也可以用于连续数据,但是您需要将连续数据分类为离散的单位。)虽然回归拟合数据线,但是对数据之间的分类和聚类绘制线或平面进行聚类,将其分成喜欢与不喜欢的类别。

Classification Models include:

分类模型包括:

  • Decision trees: Here, the data begins by being broken into two categories with boolean results, True or False. At each juncture, a new boolean is considered, until all the like data splits into separate categories and can no longer be separated. This technique can get cumbersome once you go beyond a handful of branches.

    决策树:在这里,数据首先分为具有布尔结果的两类,即True或False。 在每个关头,都将考虑一个新的布尔值,直到所有类似的数据拆分为单独的类别,并且不再可以分离为止。 一旦您超越了少数分支机构,该技术将变得很麻烦。

  • Random Forest: Similar to decision trees, except you start with several different trees.

    随机森林 :与决策树类似,不同之处在于您从几棵不同的树开始。

  • K-Nearest Neighbor (KNN): In this classification technique, you start with K number of clusters and each data point is assigned to the center of the cluster to which it is nearest. It is similar to K-Means Clustering (below) but the analyst chooses the number and location of clusters.

    K最近邻(KNN):在这种分类技术中,您从K个簇开始,并且每个数据点都被分配到最接近的簇中心。 它类似于下面的K-Means聚类,但分析人员选择聚类的数量和位置。

  • Logistic Regression: The name sounds like this should be similar to logarithmic regression, but it is actually entirely different. In fact, it isn’t even a regression, but a classification algorithm. It is used to determine the probability of success or failure, or the probability of one outcome over another.

    Logistic回归:这个名称听起来应该类似于对数回归,但实际上完全不同。 实际上,它甚至不是回归,而是分类算法。 它用于确定成功或失败的概率,或一个结果胜过另一个结果的概率。

Clustering Models include:

聚类模型包括:

  • Hierarchical clustering: Generally used with smaller data sets, as it becomes quickly unwieldy with too much data. Starts with a single cluster of the entire data set, and with each iteration, breaks into more clusters until one runs out of data, or all data has been assigned a branch that does not change. Similar to a decision tree, except that you do not know the categories ahead of time. Usually shown on a dendrite diagram.

    分层聚类 :通常用于较小的数据集,因为它会因过多的数据而很快变得难以处理。 从整个数据集的单个群集开始,并在每次迭代中分成更多的群集,直到一个数据用完或为所有数据分配了不变的分支。 与决策树类似,不同之处在于您不提前知道类别。 通常显示在枝晶图上。

  • Agglomerative clustering: A special case of hierarchical clustering, but beginning from the bottom up. Each data point begins in its own cluster, then with each iteration, data are linked together into clusters that are similar. Like hierarchical clustering, this works best with smaller data sets, because of space and time limitations.

    聚集聚类 :层次聚类的一种特殊情况,但从下而上开始。 每个数据点都从其自己的群集开始,然后在每次迭代中,数据都链接在一起成为相似的群集。 与分层群集一样,由于空间和时间限制,这对于较小的数据集最有效。

  • K-means: A method of partitioning observations into k clusters, where the data within each cluster is more closely related to one another than the data outside the clusters. It is done iteratively, so that at each round, the location of each cluster center changes until all points have been assigned to a cluster and the clusters no longer change. K-means clustering can be used with both large and small data sets. It works best with sets of data that can form into roughly spherical sets.

    K均值:一种将观察结果划分为k个聚类的方法,其中每个聚类中的数据比聚类外的数据彼此之间的关系更紧密。 它是反复进行的,因此在每个回合中,每个聚类中心的位置都会更改,直到将所有点分配给一个聚类并且聚类不再更改为止。 K均值聚类可用于大型和小型数据集。 它最适合可以形成大致球形的数据集。

Classification and clustering models can be used with numeric data or non-numeric data that have been one-hot encoded. That is, the textual data has a limited number of discrete values and can be converted to individual numbers, that do not mean anything. For example, you have three clothing sizes: Small, Medium, and Large. You can encode these as 1 for small, 2 for medium, and 3 for large. However, they are merely classifications. In this case 1+2 != 3.

分类和聚类模型可以与已被一键编码的数字数据或非数字数据一起使用。 即,文本数据具有有限数量的离散值,并且可以转换为单个数字,这并不意味着任何意义。 例如,您有三种衣服尺码:小,中和大。 您可以将它们编码为1(小),2(中)和3(大)。 但是,它们仅仅是分类。 在这种情况下1 + 2!= 3。

Like the regression models above, these models can be used both to describe your current set of data, and to make predictions about new data. Using machine learning, you can program these models by training them on sets of data you already know, to predict the data that you do not know. The mechanics of that is beyond this article, but there are many great resources on machine learning.

像上面的回归模型一样,这些模型既可以用来描述您当前的数据集,又可以对新数据进行预测。 使用机器学习,您可以通过对已经知道的数据集进行训练来对这些模型进行编程,以预测不知道的数据。 其机制已超出本文的范围,但是在机器学习方面有很多很棒的资源。

Conclusion:

结论:

We have many tools for modeling data in our data analyst toolkit. Regression models are the screwdrivers and wrenches of our kit, pulling continuous data together and fitting it to some sort of line or plane in one or more dimensions. Classification and clustering models are our saws and knives, cutting apart the data and separating it into groups or clusters of like versus unlike. These are our most basic models in our toolkit, and it is important to understand when we can use one type of model or another, and which is the best model for our data.

我们的数据分析器工具包中有许多用于对数据建模的工具。 回归模型是我们工具包中的螺丝起子和扳手,用于将连续数据收集在一起并以一个或多个维度安装在某种线或平面上。 分类和聚类模型是我们的工具,将数据分割并将其分为相似或不相似的组或簇。 这些是我们工具箱中最基本的模型,了解何时可以使用一种或另一种模型以及哪种模型是我们数据的最佳模型非常重要。

For Further Learning:

为了进一步学习:

For a great background in data science, try Confident Data Skills, by Kirill Eremenko, a data scientist out of Australia who is head of SuperDataScience. You can check out his online courses on Udemy as well. He is very enthusiastic about data science and his courses are well plotted and easy to follow.

对于数据科学大背景下,尝试自信数据技能 ,由基里尔Eremenko ,数据科学家从澳大利亚是谁的头SuperDataScience 。 您也可以查看有关Udemy的在线课程。 他对数据科学非常热情,他的课程设计得很好并且易于上手。

For a really in-depth look at the mathematics behind these models and other machine learning models, look at Machine Learning: A Concise Introduction, by Steven Knox. Steve is the head of data analytic at the NSA, and a former colleague of mine. His book won the award for best prose in a textbook, and is straightforward and easy to follow, with a depth of mathematical rigor that most data analysts tend to gloss over.

要真正深入了解这些模型和其他机器学习模型背后的数学原理,请参阅Steven Knox的《 机器学习:简明介绍》 。 史蒂夫(Steve)是国家安全局(NSA)数据分析的负责人,也是我的前同事。 他的书获得了教科书中最佳散文奖,而且简单易懂,而且数学上的严谨性强,大多数数据分析师都倾向于掩饰这一观点。

For a great online course, try IBM’s data science track on Coursera, a series of nine courses using python for data science, that covers everything from the basics of data analysis up through machine learning models. It is especially well done, with lots of labs, assignments, and projects to be done, including a final capstone project to complete their data science certificate.

对于一门很棒的在线课程,请在Coursera上尝试IBM的数据科学课程, Coursera是一系列使用python进行数据科学的九门课程,涵盖了从数据分析的基础知识到机器学习模型的所有内容。 它做得特别好,需要完成许多实验室,任务和项目,其中包括完成数据科学证书的最终项目。

And, of course, there is the data science section of Medium, which offers a wide variety of data science topics from beginner to advanced, and has been a wealth of information for me as a career changer.

而且,当然,还有Medium数据科学部分 ,其中提供了从初学者到高级的各种数据科学主题,对于我来说,作为职业改变者,它已经为我提供了很多信息。

About me: I am a lifelong user of data, originally as an environmental engineer, then (surprisingly) in the field of ministry. Having left that world, I have relearned old data analytic techniques and the wealth of new tools, to become a freelance data analyst. You can find me on LinkedIn.

关于我:我是一生的数据用户,最初是一名环境工程师,后来(出人意料地)在事工领域。 离开那个世界后,我重新学习了旧的数据分析技术和大量的新工具,成为一名自由数据分析师。 您可以在LinkedIn上找到我。

翻译自: https://medium.com/swlh/the-data-analysts-toolkit-models-81aae3611f65

数据分析模型和工具

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
SPSS经典教材之一。本书主要针对SPSS 的中、高级用户,定位为应用统计专业的研究生教材和其他专业的统计分析参考书。它以SPSS 12.0 的功能为准,以统计理论为主线,详细介绍了SPSS 中的各种多变量统计模型和多元统计分析方法。在保持全书简明易懂风格的基础上,对统计理论作了详细的讲解。全书内容共分四大部分:第一部分讲解了一般线性模型和混合线性模型,并重点对前者中的方差分析模型进行了介绍;第二部分则在此基础上进一步介绍了回归模型,包括对连续因变量建模的线性回归模型、线性回归的衍生模型、通径分析模型和非线性回归模型,以及对分类因变量建模的Logistic 模型族和Probit 模型;第三部分系统介绍了因子分析、判别分析、对应分析、多维尺度分析等多元统计方法的原理及其在SPSS 中的实现;第四部分则对信度分析、生存分析、缺失值分析方法等较难归类,但又比较重要的统计分析方法进行了讲解。各章后均提供了参考文献和思考练习题,书后附录则以流程图的方式提纲草领地给出了统计方法的分类体系,便于读者理解。另外,为便于读者自行对比分析结果,书中大部分表为SPSS 自动生成的。因此,大部分表及表题为英文。 需要指出的是,作为本套丛书的通用统计教材,本高级教程严格遵循了统计理论这一主线,在统计方法的纳入上是有所选择的。方法体系中比较特殊的时间序列模型并未纳入本书范畴,另有分册专门介绍;而对于联合分析、多维偏好分析、离散选择分析等在方法原理上并无特别之处,行业应用特点明显的模型,则将被放在相应的行业应用分册中讲解,本书不再专门介绍。除作为各专业研究生的统计教材和参考书外,本书还适用于各行业中希望深入学习和应用高级统计分析方法的读者。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值