Recently I learnt about both Data Mining and reviewed Python fundamentals, so every blog will be dividen into two sections for each of them, including contents I learn and thoughts I have every day. I am native chinese so some typoes may seem a little ridiculous. Sorry for that(Well, nobody will read it any way! but I will try to do my best).
Data Minin: Concepts and Technologies
Today, I read the Chatper 3 about data preprocessing, and it basically includes four parts: data cleaning, data integration, data reduction, and data transformation. The central idea above the chapter is to make our dataset to be accurate(excluding noisy data or that deviates the espectation of the attribute), complete(consisting of all interesting attributes), consistent(eg: no discripancy on catergory of attribute values), timely, believable, and interpretable(easy to be understood).
Data Cleaning:
Data cleaning is normaly the first step in Data Mining or Data Analysis. This step is to handle with problems about missing value, noisy data, and identify and delete outliers. Methods are to represent miss values by using espectations or values in regression on other attributes, but they will let the result be bias. For noisy data, binning and replacing data in a bin by their espectation to smooth the data. And Outlier Analysis refers to clusters that will be explained in chapter 8 and 9.
Data Integration:
My summary is to remove the redundant attributes and objects in different data sources. For identifying redundant attributes, Chi-square test is effective in analyzing nominal data, and Correlation Coefficient is useful to analyze numerical data, so is Covariance. Moreover, dealing with redundant objects, method is like using denormalized table(I have no idea about it now, further study in future).
Data Reduction:
Dimensionality Reduction and Numerosity Reduction are two directions to avoid low efficiency in analyzing too many data that is also within not primary attributes. Methods in dimensionality reduction are like Discrete Wavelet Transformation, Discrete Fourier Transformation, Principle Components Analysis, and Attributes Subsets Selections. For methods in numerosity reduction, methods like using regressions, histograms, clusters, and sampling are helpful.
Data Transformation and Data Discretization:
Idea is to make each attribute to be in a suitable format to analyze. Smoothing, attribute rebuilding, clustering, standerlization, discretization, and conceptional stratification of nominal data generation. Take standerlization as an example. Transform data to be Max-min normalization, z-score normalization, and decimal scaling. It can balance the weight that each attribute have. In data discretization, I still need time to make every concept clearer.
Python: Review
Basically, I did a review about OOP like __repr__(), __eq__() and some relevant tricks. Write a MVC framework again... tomorrow, I wish I can move to the part on functional programming. Too many thing needed to reniew!
Move on! need to go to bed, bye!
TO BE CONTINUED...