数据科学与大数据分析学习笔记-8分类

本文详细介绍了决策树算法,包括其工作原理、评估标准和属性选择。此外,还讨论了简单贝叶斯分类器,以及如何通过拉普拉斯平滑处理罕见事件。同时,文章提到了分类器的诊断方法,如混淆矩阵,以及防止过拟合的策略。最后,文章提及了额外的分类模型,如随机森林和支持向量机。
摘要由CSDN通过智能技术生成

•分类是数据挖掘相关应用中出现的一种基本学习方法。
•分类器执行的主要任务是为新的观察值分配类别标签。
•监督分类方法
–从一组带标签的观察开始。
–预测新观察结果。

Decision Tree

在这里插入图片描述
• Each node tests a particular input variable.
• Each branch represents the decision made.
• Classifying a new observation is to traverse this decision tree.
• The depth of a node is the minimum number of steps required to reach the node from root.
• Leaf nodes are at the end of the last branches on the tree, representing class labels.

如果决策是数值,“大于”分支通常都放在右侧,“小于”分支放在左侧。根据变量的性质,其中一个分支可能需要包含“等于”的情况。
内部节点(internal node)是指决策或测试点。每个内部节点对应一个输入变量或属性。顶端的内部节点也叫做根节点(root)。图中的决策树是一个二叉树,其中每个内部节点不会有两个以上的分支。节点的分岔被称为分裂(split。

图中的决策树显示了收入等于或者小于$45,000 的女性和年龄小于或等于 40 岁的男性被分类成会购买产品的人群。在遍历决策树以后,发现女性的年龄与决策无关,而男性的收入与决策无关。在这里插入图片描述
The most informative attribute is identified by– Information gain, calculated based on Entropy.
在这里插入图片描述
根节点 P(subscribed=yes)=1−1789/2000=10.55%
熵:用来衡量属性的杂质。
信息增益:用来衡量属性的纯净度

基础熵
在这里插入图片描述
Conditional entropy条件熵
在这里插入图片描述
在这里插入图片描述
Information gain信息增益

Data Science and Big Data Analytics is about harnessing the power of data for new insights. The book covers the breadth of activities and methods and tools that Data Scientists use. The content focuses on concepts, principles and practical applications that are applicable to any industry and technology environment, and the learning is supported and explained with examples that you can replicate using open-source software. This book will help you: Become a contributor on a data science team Deploy a structured lifecycle approach to data analytics problems Apply appropriate analytic techniques and tools to analyzing big data Learn how to tell a compelling story with data to drive business action Prepare for EMC Proven Professional Data Science Certification Corresponding data sets are available at www.wiley.com/go/9781118876138. Get started discovering, analyzing, visualizing, and presenting data in a meaningful way today! Table of Contents Chapter 1 Introduction to Big Data Analytics Chapter 2 Data Analytics Lifecycle Chapter 3 Review of Basic Data Analytic Methods Using R Chapter 4 Advanced Analytical Theory and Methods: Clustering Chapter 5 Advanced Analytical Theory and Methods: Association Rules Chapter 6 Advanced Analytical Theory and Methods: Regression Chapter 7 Advanced Analytical Theory and Methods: Classification Chapter 8 Advanced Analytical Theory and Methods: Time Series Analysis Chapter 9 Advanced Analytical Theory and Methods: Text Analysis Chapter 10 Advanced Analytics—Technology and Tools: MapReduce and Hadoop Chapter 11 Advanced Analytics—Technology and Tools: In-Database Analytics Chapter 12 The Endgame, or Putting It All Together
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值