Statistical Machine Learning I
- First class
- What is data?
- Types of variables
- Types of data
- Data matrix
- Text data
- Transaction data
- Graph data
- Data quality
- Noise and outliers
- Missing values
- Sampling bias
- What is data exploration?
- Data exploration techniques
- Summary statistics
- Visualization
- Sea surface temperature
- Iris data
- Histogram
- 2-d histogram
- Boxplot
- Scatter plot
- Matrix plot
- Iris similarity matrix
- Parallel coordinates plot
- Other visualization techniques
- 1 Introduction
- 2 Statistical Learning
First class
What is data?
Collection of data objects and their attributes
Object is also known as record, point, case, sample, entity or
instance
An attribute is a property or characteristic of an object
Examples: eye color of a person, age, height, weight
Attribute is also known as variable, field, characteristic, or feature
A collection of attributes that describe an object
Types of variables
Types of data
Data matrix
If data objects have the same fixed set of numeric variables, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct variable
Such data set can be represented by an n × p matrix, where there are n rows, one for each object, and p columns, one for each variable
Text data
Transaction data
Graph data
Data quality
Noise and outliers
Missing values
Sampling bias
Sample distortion arises from a mismatch between the random
sample and the population of interest
Convenience sample
(survival bias)
can only collect the information of the returned fighters
Population drift
the datasets of USA can not be used at the analysis of Hongkong
the datasets 3 months before on Google can not be used at the analysis 3 months later
What is data exploration?
Data exploration techniques
Summary statistics
Visualization
Sea surface temperature
Iris data
Histogram
2-d histogram
Boxplot
Scatter plot
Matrix plot
Iris similarity matrix
Parallel coordinates plot
Other visualization techniques
Star plots
Chernoff faces
1 Introduction
生僻单词
astrophysics
quadratic discriminant analysis model
demographic information
computationally infeasible
scalar
Content
Statistical learning
Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised.
supervised statistical learning
supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs
unsupervised statistical learning
With unsupervised statistical learning, there are inputs but
no supervising output; nevertheless we can learn relationships and structure from such data.
three real-world data sets
Wage Data (predicting a continuous or quantitative output value. regression problem)
Stock Market Data (predicting whether a given day’s stock market performance will fall into the Up bucket or the Down bucket. classification problem)
Gene Expression Data (wish to understand which types of customers are similar to each other by grouping individuals according to their observed characteristics.clustering problem)
Notation and Simple Matrix Algebra
n
We will use n to represent the number of distinct data points, or observations, in our sample.
p
We will let p denote the number of variables that are
available for use in making predictions.
Variable Names
Xij(x小写,下标ij)
we will let Xij represent the value of the jth variable for the
ith observation, where i = 1, 2,…,n and j = 1, 2,…,p.
(另)
X
We let X denote a n×p matrix whose (i, j)th element is xij
yi
We use yi to denote the ith observation of the variable on which we
wish to make predictions