Hands on Data Analytics for Everyone (UIC)

Hands on Data Analytics for Everyone

  • 此文基本包括了uic hands on data 这门课quiz和期末考的知识点, 高亮标注都是考点
  • 期末project 需要学习KNIME软件来建模

Preparation

Data Analytics

  • Autonomous Driving Car
  • Industrial Production
  • Investment in Financial Markets
  • Scientific Research

Data Sicience Skills

  • Data Engineer: Database, Coding Skills (Computer Programming)
  • Machine Learning Researcher: Math, Statistics, Machine Learning Knowledge (Math/Stat)
  • Filed Knowledge: biology, business…

Chapter 1 Data Analytics Foundations

Data Science: Extract knowledge and insights from structured and unstructured data

Data project life cycle

Data->Data Preparation->Model Training->Model Optimization->Model Testing

Structured Data & Unstructured Data

  • Structured data can be processed by machine directly, while unstructured data cannot.
  • For example the data on student grades collected bythe Academic Registry are structured data while the content of student emails is unstructured.
  • Structured data are stored in well designed database such as sales data of a company of thecustomer relationship management system of a company, while unstructured data could be collected and stored but not in specifically designed database such as phone calls and blogposting on weibo.
  • Structured Data: Excel
  • Unstructured Data: E-mail,Wechat(Social-media)
Common Structured Data Types
  • CSV (Comma-separated values)
  • XML (Extensible Markup Language)
  • JSON (JavaScript Object Notation)
  • XLS (Microsoft Excel)
5Comma-Separated Values
  1. Each line of the file is a data record.
  2. Each record consists of one or more attributes. The attributes are separated by commas.
In ".csv” format, new records are separated by new line

第一条记录,可以是字段名: Year,Make,Model,Description,Price
每条记录占一行 以逗号为分隔符 e.g.(1997,Ford,E350,“ac, abs, moon”,3000.00)
逗号前后的空格会被忽略
字段中包含有逗号,换行符,空格,该字段必须用双引号括起来
字段中的双引号用两个双引号表示
字段中如果有双引号,该字段必须用双引号括起来 aa,“bb,”“cc” ctrl

Data Type
  • Categorical: noun
  • Numeric: number
  • Ordinal: High-Normal-Low

Relation between computer science, statistics and data analytics

If viewed as a pipeline, data analytics is the bridge that connects statistics and computer science.

Difference between computer science, statistics and data analytics

It focuses on using statistical methods to discover insights from data, Statistics is more traditional and theoretical, Computer science focuses on solving all problem in a computable way, including topics in computability, algorithms, system design, networks, artificial intelligence, software engineering,etc.


Chapter 2 Data Processing

Data Summary

Basic Descriptive Statistics

Statistical measures can be used to describe a dataset

  • Range: R a n g e = M a x v a l u e − M i n v a l u e Range=Max value- Min value Range=MaxvalueMinvalue
  • Min/Max value
  • Mean: µ = 1 / n ∑ i = 0 n x i µ =1/n\sum_{i=0}^{n}x_i µ=1/ni=0nxi
  • Variance: σ 2 = 1 / ( n − 1 ) ∑ i = 1 n ( x i − µ ) 2 \sigma^2=1/(n-1)\sum_{i=1}^{n}(x_i-µ)^2 σ2=1/(n1)i=1n(xiµ)2
  • Standard deviation: σ = 1 / ( n − 1 ) ∑ i = 1 n ( x i − µ ) 2 \sigma=\sqrt{1/(n-1)\sum_{i=1}^{n}(x_i-µ)^2} σ=1/(n1)i=1n(xiµ)2
  • Median: The middle number1
  • Mode: Most frequently occurring value
Percentiles (Quartiles)
  • q%-quantile (0 < q < 100): The value for which q% of the values are smaller and 100-q% are larger. The median is the 50%-quantile
  • Quartiles: 25%-quantile (1st quartile), median (2nd quartile), 75%-quantile (3rd quartile)
  • Interquartile range(IQR): 3 r d q u a r t i l e – 1 s t q u a r t i l e 3rd quartile – 1st quartile 3rdquartile–1stquartile
How to find quartile?
  1. Count the number of observations in the dataset(n).
  2. Sort the observations from smallest to largest.
  3. Find the first/second/third quartile
    Calculate n*(1/4)
    If n*(1/4) is an integer,then the first quartile is the mean of the numbers at this position n*(1/4) and n*(1/4)+1
    If n*(1/4) is not an integer , then round it up. The number at this position is the first quartile

Data Visualization

Chart
Dimension 1
Bar chart
  • A bar chart is a simple way to depict the frequencies of the values of a categorical attribute.
Histogram
  • A histogram shows the frequency distribution for a numerical attribute.
    Difference:

Bar chart is discrete.
Histogram is continuous.
Bar chart is suitable for categorical data while histogram is for numeric data

Choice of Number of Bins
  • Choosing a low number of bins

The two peaks of the original distribution are no longer visible, and one gets the wrong impression that the distribution is unimodal.

  • Choosing a high number of bins

Usually leads to a very scattered histogram in which it is difficult to distinguish true peaks from random peaks.

  • Best Choise

k = [ log ⁡ 2 n + 1 ] k=[\log_2n+1] k=[log2n+1]

  • Boxplot 2

  • Boxplots are a very compact way to visualize and summarize the main characteristics of a numeric attribute, through the median, the IQR, and possible outliers.

Dimension 2
Scatter Plot
  • In scatter plots two attributes are plotted against each other
  • Can be enriched with additional features (color, shape, size)
  • Suitable for small number of points; not suitable for large datasets
  • Points can hide each other
Dimension 3
3D plot
Scatter Matrixes
  • A matrix of scatter plots m×m where m is the number of attributes (data dimensionality)
  • For m attributes there are m(m − 1)/2 possible scatter plots
Parallel Coordinates Plot “cuba data”
Radar Plot “spider plots”
  • Similar idea of the Parallel Coordinates plot
  • Axes are drawn in a star-like fashion intersecting in one point
  • Suitable for small datasets
Sunburst Chart

Dimensionality Reduction Techniques

Measure based

Requires min-max-normalization of numeric columns

  • Ratio of missing values: If missing value > threshold, then remove the column.
  • Low variance: If variance < threshold, then remove column. 3
  • High Correlation: If(corr(var1,var2) > threshold), then remove var1.

Data Cleaning

Missing Values
Missing Value Type
  • Missing Completely At Random (MCAR): the probability that a value for X is missing does neither depend on the value of X nor on other variables. (Most serious)
  • Missing At Random (MAR): the probability that Y is missing depends only on the value of X.
  • Not Missing At Random (NMAR): the probability that Y is missing depends on the unobserved value of Y itself
Missing Values Imputation
  • Ignore or delete the record
  • Fill in (impute) missing value as “unknown”,mean/median/mode 4
Outliers
  • An outlier is a value or data object that is far away or very different from all or most of the other data.
  • Errors in measurements or exceptional conditions that don’t describe the common functioning of the underlying system.
Outlier Detection Techniques
Knowledge-based
  • We know that a 200 year old person must be a mistake
  • We know that “A” in a number corpus is an outlier
Statistics-based
  • Distance from the median
  • Position in the distribution tails
Statistical Methods
  • Quantile-based: Box plot
  • Distribution-based: Z-Score
Data Normalization
  • min–max normalization x ∈ [ 0 , 1 ] x\in[0,1] x[0,1] x = ( x − m i n ) / ( m a x − m i n ) x = (x-min)/(max-min) x=(xmin)/(maxmin)
  • z-score standardization
  • robust z-score standardization
  • decimal scaling
Feature Engineering
Scale Conversion
  • Categorical → Numerical: map categorical and ordinal values to a set of binary values
  • Numerical → Categorical: Discretization (equal-width, equal-depth, V-optimal)
Data Integration
Vertical Data Integration

Concatenation: (column do not change)

  • Unify database structures
  • Remove duplicates
Horizontal Data Integration

Join: (column change)

  • Overrepresentation of items
  • Data explosion

Chapter 3 Machine Learning

Supervised Learning && Unsupervised Learning

Supervised Learning

The learner is provided with a set of data inputs together with the corresponding desired outputs

  • Data act as a “teacher”
  • Classification & Regression
    Example:
  • teach kids to recognize different animals
  • grade examinations with correct answer provided
Unsupervised Learning

Training examples as input patterns, with no associated output

  • no “teacher”
  • Clustering
  • similarity measure exists to detect groupings/ clusterings
    Main differences: unsupervised learning has no “teacher”, supervised learning uses labeled input and
    output data, while an unsupervised learning does not
Classification and Regression (Supervised Learning)
Regression Problem

The target variable that we’re trying to predict is continuous. eg.(living areas and prices)

Classification problem

The target variable can take on only a small number of discrete values. eg.(insurance)

Linear Regression

Given a training set, to learn a function (hypothesis/model) f: X ⟼ Y, so that f(x) is a “good” predictor for the corresponding value of y.
f ( x ) = θ 0 + θ 1 x f(x)=\theta_0+\theta_1x f(x)=θ0+θ1x

  • The model is in linear in terms of parameters θ 0 \theta_0 θ0 and θ 1 \theta_1 θ1.
  • Linear regression with one variable (univariate linear regression).
Linear Regression Evaluation
  • Mean absolute error (MAE) 1 / n ∑ i = 0 n ∣ y i − f ( x i ) ∣ 1/n\sum_{i=0}^{n}|y_i-f(x_i)| 1/ni=0nyif(xi)
  • Mean squared error (MSE) 1 / n ∑ i = 0 n ( y i − f ( x i ) ) 2 1/n\sum_{i=0}^{n}(y_i-f(x_i))^2 1/ni=0n(yif(xi))2
  • Root mean squared error (RMSE) 1 / n ∑ i = 0 n ( y i − f ( x i ) ) 2 \sqrt{1/n\sum_{i=0}^{n}(y_i-f(x_i))^2} 1/ni=0n(yif(xi))2
  • R-squared 5
    1 − ( ∑ i = 0 n ( y i − f ( x i ) ) 2 ) / ( ∑ i = 0 n ( y i − y ˉ ) 2 ) 1-(\sum_{i=0}^{n}(y_i-f(x_i))^2)/(\sum_{i=0}^{n}(y_i-\bar{y})^2) 1(i=0n(yif(xi))2)/(i=0n(yiyˉ)2)
Error
  • Training error/Empirical error: the error of the learner/model on the training data
  • Generalization error: the error on the new data

Classification

Classification accuracy

The percentage of test set tuples that are correctly classified by the classifier

Confusion matrix
Consider a two-class problem and the confusion matrix below
ClassC1(predicted)C2(predicted)TotalAccuracy
C1true positives (TP)false negatives (FN)positives§TP/P
C2false positives (FP)true negatives (TN)negatives(N)TN/N
Totalpredicted positives(Pp)predicted negatives(Pn)All(TP+TN)/All
Decision Tree
  • Each internal nodedenotes a test on an attribute
  • Each branch represents an outcome of the test
  • Each leaf node holds a class label

Clustering (Unsupervised Learning)

Discover hidden structures in unlabeled data
Clustering identifies a finite set of groups (clusters) C 1 , C 2 , . . . , C k C_1,C_2,...,C_k C1,C2,...,Ck in the dataset such that:

  • Objects within the same cluster C i C_i Ci shall be as similar as possible
  • Objects of different clusters C i , C j C_i,C_j Ci,Cj ( i ! = j i!=j i!=j) shall be as dissimilar as possible
    Example:
  • Customer segmentation
  • Molecule search 6
  • Anomaly detection 7
  • Structuring large sets of text documents 8
  • Generating thematic maps from satellite images 9
Types of Clustering Approach
  • Linkage Based
    e.g. Hierarchical Clustering
  • Clustering by Partitioning
    e.g. k-Means
(Dis-)similarity Functions for Numeric Attributes
  • Minkowski-Distance ( L p − M e t r i c L_p-Metric LpMetric)
  • Euclidean Distance ( L 2 − p = 2 L_2 − p = 2 L2p=2)
  • Manhattan-Distance ( L 1 − p = 1 L_1 − p = 1 L1p=1)

  1. Found by ordering all data points and picking out the one in the middle - or if there are two middle numbers, taking the mean of those two numbers ↩︎

  2. The middle line of the box, which is the median of the data, represents the average of the sample data. The upper and lower limits of the box are the upper and lower quartiles of the data, respectively. This means that the box contains 50% of the data. The height of the box partly reflects how fluctuating the data is. Above and below the box, there is a line. Represents the maximum and minimum values, sometimes some points “pop out”, which can be understood as “outliers” ↩︎

  3. Only works for numeric columns ↩︎

  4. A predicted value based on the other attributes (inference-based such as
    Bayesian, Decision Tree ↩︎

  5. Proportion of the variance for a dependent variable that’s explained by the regression model.Normally ranges from 0 to 1, the closer to 1 the better performance. ↩︎

  6. Find molecules with similar structure to already working ones ↩︎

  7. Find unusual patterns in data from sensors monitoring mechanical engines ↩︎

  8. hierarchical clustering of the text documents ↩︎

  9. clustering sets of raster images of the same area (feature vectors) ↩︎

  • 6
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

YGYAllen

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值