Study Plan & Record 02

Recently I learnt about both Data Mining and reviewed Python fundamentals, so every blog will be dividen into two sections for each of them, including contents I learn and thoughts I have every day. I am native chinese so some typoes may seem a little ridiculous. Sorry for that(Well, nobody will read it any way! but I will try to do my best).

Data Minin: Concepts and Technologies

Today, I read the Chatper 3 about data preprocessing, and it basically includes four parts: data cleaning, data integration, data reduction, and data transformation. The central idea above the chapter is to make our dataset to be accurate(excluding noisy data or that deviates the espectation of the attribute), complete(consisting of all interesting attributes), consistent(eg: no discripancy on catergory of attribute values), timely, believable, and interpretable(easy to be understood).

Data Cleaning:

Data cleaning is normaly the first step in Data Mining or Data Analysis. This step is to handle with problems about missing value, noisy data, and identify and delete outliers. Methods are to represent miss values by using espectations or values in regression on other attributes, but they will let the result be bias. For noisy data, binning and replacing data in a bin by their espectation to smooth the data. And Outlier Analysis refers to clusters that will be explained in chapter 8 and 9.

Data Integration:

My summary is to remove the redundant attributes and objects in different data sources. For identifying redundant attributes, Chi-square test is effective in analyzing nominal data, and Correlation Coefficient is useful to analyze numerical data, so is Covariance. Moreover, dealing with redundant objects, method is like using denormalized table(I have no idea about it now, further study in future).

Data Reduction:

Dimensionality Reduction and Numerosity Reduction are two directions to avoid low efficiency in analyzing too many data that is also within not primary attributes. Methods in dimensionality reduction are like Discrete Wavelet Transformation, Discrete Fourier Transformation, Principle Components Analysis, and Attributes Subsets Selections. For methods in numerosity reduction, methods like using regressions, histograms, clusters, and sampling are helpful.

Data Transformation and Data Discretization:

Idea is to make each attribute to be in a suitable format to analyze. Smoothing, attribute rebuilding, clustering, standerlization, discretization, and conceptional stratification of nominal data generation. Take standerlization as an example. Transform data to be Max-min normalization, z-score normalization, and decimal scaling. It can balance the weight that each attribute have. In data discretization,  I still need time to make every concept clearer.

Python: Review

Basically, I did a review about OOP like __repr__(), __eq__() and some relevant tricks. Write a MVC framework again... tomorrow, I wish I can move to the part on functional programming. Too many thing needed to reniew!


Move on! need to go to bed, bye!

TO BE CONTINUED...

  • 17
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值