机器学习（Machine Learning and Data Mining）CS 5751——mid1复习记录_data mining and machine learning: fundamenta...-CSDN博客

本文链接：https://blog.csdn.net/yinxx325/article/details/83240379

本文是机器学习CS 5751课程的复习笔记，涵盖数据的定义、属性类型（离散与连续）、数据质量、距离度量、数据预处理、线性回归和决策树等核心概念。重点讨论了离散属性、连续属性的区分，以及欧氏距离、马氏距离等距离计算方法。此外，还涉及了决策树的Hunt算法、过拟合与欠拟合的处理策略以及模型评估方法。

摘要由CSDN通过智能技术生成

机器学习（Machine Learning and Data Mining）CS 5751——mid1复习记录

(1)基础定义
(2)具体考点（PPT-2）
（3）具体考点-线性回归（PPT-4）
（4）具体考点-决策树（PPT-4-1）
（5）具体考点（PPT-4-2）

因为是整理来给自己看的，所以都是大纲……

(1)基础定义

什么是data?

Collection of data objects and their attributes
在这里插入图片描述

四种类型的属性

Categorical (Qualitative)

Nominal
– Examples: ID numbers, eye color, zip codes
Ordinal
– Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium, short}

Numeric (Quantitative)

Interval
– Examples: calendar dates, temperatures in Celsius or Fahrenheit
Ratio
– Examples: temperature in Kelvin卡尔文温度, length, time, counts

属性类型的判断：

取决于属性可以执行怎样的操作（operations）

四种操作

– Distinctness: = = （第二个是不等于符号）
– Order: < >
– Differences are meaningful :+ -
– Ratios are meaningful：* /

分别的判断

– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful differences
– Ratio attribute: all 4 properties/operations

离散与连续

离散属性

只有一组有限或可数无限的值。
示例：邮政编码，计数。
通常表示为整数变量。
注意：二进制属性是离散的特例

连续属性

将实数作为属性值。
示例：温度，高度或重量。
实际上，只能用有限数字来测量和表示实际值。
连续属性通常表示为浮点变量。

数据质量

数据质量差会对许多数据处理工作产生负面影响
Examples of data quality problems:

Noise and outliers
Missing values
Duplicate data
Wrong data

(2)具体考点（PPT-2）

距离

Euclidean Distance欧几里得距离
Minkowski Distance闵可夫斯基距离

r = 1.Manhattan, taxicab, L1 norm, rectilinear distance
r = 2. Euclidean distance
协方差（Covariance）
协方差矩阵(covariance matrix)
相关系数correlation

相关系数的缺点：
假设我们有下面两个向量:
x = (-3, -2, -1, 0, 1, 2, 3)
y = (9, 4, 1, 0, 1, 4, 9)
y = x^2
corr = (-3)(5)&