5001: Statistical Machine Learning I, Textbook: An Introduction to machine leatring.

First class

What is data?

Collection of data objects and their attributes
Object is also known as record, point, case, sample, entity or
instance
An attribute is a property or characteristic of an object
Examples: eye color of a person, age, height, weight
Attribute is also known as variable, field, characteristic, or feature
A collection of attributes that describe an object

Types of variables

Types of data

Data matrix

If data objects have the same fixed set of numeric variables, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct variable
Such data set can be represented by an n × p matrix, where there are n rows, one for each object, and p columns, one for each variable

Text data

Transaction data

Graph data

Data quality

Noise and outliers

Missing values

Sampling bias

Sample distortion arises from a mismatch between the random
sample and the population of interest

Convenience sample

(survival bias)
can only collect the information of the returned fighters

Population drift

the datasets of USA can not be used at the analysis of Hongkong
the datasets 3 months before on Google can not be used at the analysis 3 months later

What is data exploration?

Data exploration techniques

Summary statistics

Visualization

Sea surface temperature

Iris data

Histogram

2-d histogram

Boxplot

Scatter plot

Matrix plot

Iris similarity matrix

Parallel coordinates plot

Other visualization techniques

Star plots

Chernoff faces

1 Introduction

生僻单词

astrophysics
quadratic discriminant analysis model
demographic information
computationally infeasible
scalar

Content

Statistical learning

Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised.

supervised statistical learning

supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs

unsupervised statistical learning

With unsupervised statistical learning, there are inputs but
no supervising output; nevertheless we can learn relationships and structure from such data.

three real-world data sets

Wage Data (predicting a continuous or quantitative output value. regression problem)

Stock Market Data (predicting whether a given day’s stock market performance will fall into the Up bucket or the Down bucket. classification problem)

Gene Expression Data (wish to understand which types of customers are similar to each other by grouping individuals according to their observed characteristics.clustering problem)

Notation and Simple Matrix Algebra

n

We will use n to represent the number of distinct data points, or observations, in our sample.

p

We will let p denote the number of variables that are
available for use in making predictions.

Variable Names

在这里插入图片描述

Xij(x小写,下标ij)

we will let Xij represent the value of the jth variable for the
ith observation, where i = 1, 2,…,n and j = 1, 2,…,p.
(另)
在这里插入图片描述

X

We let X denote a n×p matrix whose (i, j)th element is xij

yi

We use yi to denote the ith observation of the variable on which we
wish to make predictions

2 Statistical Learning

生僻单词

2.1 What Is Statistical Learning?

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值