如题,明天上午考试(马上半夜,也可能是今天上午考试),刚刚复习完所有PPT,以及写完开放的两个已知问题:
In this term, we have learned a course which is called Machine Learning.
Please answer the following questions:
- What chapters that we have introduced in this course and what are main points of these chapters, respectively?
- If let you design a machine learning system to classify boys and girls in the campus of the School of Software, please give the scheme on your understanding. This is an open problem, I hope you use as much knowledge that you learned in this course as possible and try to show your original opinions in your scheme. At least, you should consider about sensing, feature definition and extraction, and classifier.
Please note that you should answer these questions in English. Also, you should write your student number and name on the answer sheet.
特别强调:
a、有点像课程论文,英文撰写
b、每位同学自己独立准备:所有雷同卷全部置0分
c、平时学习情况与考试成绩
以下为个人答案,仅供参考,注意上面的强调事项
( 由于初稿是个人考前背诵检验用,a)如有字词语法错误,以后修改 b)不含数字符号 )
1.What chapters that we have introduced in this course and what are main points of these chapters, respectively?
*************************************************************************************************************************************************
Chapter 1 Introduction to machine learning
This chapter introduces some concepts of machine learning such as decision function,training and test set, supervised and unsupervised learning, learner such as K-NN, decision tree, SVM and machine learning paradigms such as ensemble learning and deep learning
Chapter 2 Main concepts
This chapter consists of two part.
The first part decribes an example of machine learning system to classify type of fish, which is divide into five stage: sensing, sgementation, feature extraction, classification and post-processing. The system is designed according to the design cycle of machine learning system: data collecting, feature definition, feature selection and feature extraction, model choice, training and evaluation.
The second part decribes many important concepts: Empirical error and overfitting, Evalution method, Performance measure, Comparsion test, bais and variance.
Chapter 3 Bayes decision and conditional probability distribution estimation
This chaper introduces a variety of decision methods based Bayes decision theory and discuss the question about how to estimate conditional probability distribution and priori probability through a limited amount of samples, as well as a solution called maximium likelihood estimation.
Chapter 4 Feature selection and Feature extraction
This chaper introduces the feature, including feature selection, feature extraction and feature learning. Feature selection has three common strategy: Filter, Wrapper and Embedding. Feature extraction, including linear methods such as PCA, LDA and nonlinear methods such as Kernal PCA. Deep learning can be used for feature learning
Chapter 5 Decision tree
This chapter introduces a classical classifier decision tree, including its difinition and some concepts related to its generation algorithm, which are information gain, Entropy for ID3 and Gain rate for C4.5.
This chaper also introduces three common problem in decision tree, incluing overfitting(solved by pre-pruning or post-pruning),continuous attributes(solved by finding a threshold) and missing values(solved by weighted attributes)
Chapter 6 SVM
In this chapter, support vector machine(SVM) is introduced, which essence is to maximium margin. Then, this chaper introduces the lagrangian duality of the primal optimization problem of maximium margin, including KKT condition nad SMO algorithm(sequential minimal optimization)
Then, this chapter also introduces Kernal-SVM for linear inseparable datasets, soft margin with its dual problem for decreasing overfitting and support vector regression for finding a regression plane that minimizes the distances from all data in a set to this plane.
Chapter 7 K-NN
This chapter introduces another classical classifier K nearest neighbors(K-NN), including its backgroud and definition. The main points of K-NN is distance measures such as Euclidean distance, Minkowski distance, cosine similarity and hamming distance and selection K, for example, a odd number, by cross-validation or smaller than the square of samples
This chapter introduces some issues to be noted: Feature normalization, Feature weighting, Non-numerical fearture values and Sample weighting. Finally, this chapter introduces the advantage and disadvantage of K-NN and K-Dimension tree for hyperspace search and nearest neighbor search.
Chapter 8 Neural networks and deep learning
This chapter introduces arificial neural networks, including M-P neural model, activation function, percepron and multi-layer networks and Back propagation algorithm, and deep learning, including a classical model CNN(convolution neuron network) and an example about handwritten numeral recognition
Chapter 9 Semi-supervised learning
This chapter introduces semi-supervised learning, including its definition with unlabeled data and some semi-supervised methods such as Generative method, which is simple, easy and has better performance only is model assmption is right, Disagreement-based method, including self-training, co-training and tri-training, semi-supervised SVM and Graphic-based semi-supervised learning, which has clear concepts but need high storage consuming and can't predict new sample situation
Chapter 10 Ensemble learning
This chapter introduces ensemble learning, including its definition, which is a machine learning paradigm with multiple indivdual learners, and two ensemble methods: Bagging and boosting. Then this chapter introduces fusion strategy to combine the base learners: Averaging, Voting and (Stacking) and finally introduces how to measure diversity of base leaner in semble leaners.
Chapter 11 Linear regression
This chapter introduces regression, which is different from classification and its output is continuous and it estimates the relationship between input and output.
This chapter mainly introduces linear regression model, including Basic linear regression model based on the least square method, Ridge regression for solving the problem that the least square method is too sensitive to noise and Lasso regression, which is easier to get sparse solution.
Finally, this chapter analyst how to use regression model in classification, which contains logistic function and maximium likelihood estimation
*************************************************************************************************************************************************
2.If let you design a machine learning system to classify boys and girls in the campus of the School of Software, please give the scheme on your understanding. This is an open problem, I hope you use as much knowledge that you learned in this course as possible and try to show your original opinions in your scheme. At least, you should consider about sensing, feature definition and extraction, and classifier.
*************************************************************************************************************************************************
Machine learning system to classify sex of students
Our goal is to classify students' sex. We assume that the collected data doesn't contain the information that directly related to sex, otherwise it's too easy. The following subsections follow the design cycle of machine learning system: data collection, feature definition, feature selection and feature extraction, model choice, training and evaluation.
(In the following detailed description of machine learning system, I only use traditional and intutive thoughts and methods. Because in past few years, I've implemented some ML system based on them for other courses and my research. Some pioneering ideas to be realized are put forward at the end of this article. Due to limitation of examination time, I may not be able to describe everything in detail)
Firstly, let's discuss the data format. Considering the cost of time and equipment, there are three common type of data format: Text, Audio and Image. For different types of data format, I design the corresponding machine learning system and describe their structure, evaluation and some details, including missing values, pruning and post-processing, which I mention now in passing and shall refer to it again.
Text data, like a survey, includes some float values such as "height", "weight" and boolean values such as "whether or not like shopping" and enum values(equivalent to integer values in a sense) such as "which color do you like".
Audio data is particularly easy to collect, requring only a microphone and time for one sentence. Audio data means sound signals, which contain many important information, including amplitude, frequency, duration and so on. They can be used as features and can be extracted by some classical algorithms such as Fourier transform, High-pass or Low-pass filter. I used a open toolkit called GNURadio and called its APIs in Python to process the audio data.
Image data is easier to get, especially with so many face recognition devices in our campus. We can even export pictures of students in database of students management system (only If I have offical permission :D)
Time seems to running,. so I'll try to decribe it as brief as possible.
Secondly, let's consider the feature. After pre-processing, text and audio data has became similar, their features are artificially designed and clear. I used Random Forest for feature selection, which has two ways: mean decrease impurity and mean decrease accuracy. The results are good enough without feature extracion for text and audio data which is commonly used in Image processing and NLP. For the image data, there are two solutions.a) Its feather can be extracted by some classical image processing algorithm, like SIFT or HOG. So the next processing seems to text and audio data. b)I can directly throw the image data into CNN(convolutional neural network).
Thirdly, let's choose the models. For text and audio data, even the image data, they can be divide into a simple 0/1 classification. There are lots of models for solving this question to choose, such as decision tree, SVM and regression. The detail is omitted.
Fourthly, let's consider the training and evaluation. I used K-fold cross validation (I set K to 10) to divide data into training set and test set. And I adjust the parameters of models by measuring performance of these, including accuracy and recall rate.
Fifthly, let's focus on some details. The first is missing values. The processing of missing values depend on the type of the specific model. For example, missing values in decision tree is processed by calculating entropy of weighted node.The second is overfitting. We can decrease overfitting in decision tree by pruning, including pre-pruning and post-pruning. Then, in order to improve accuracy, I used bagging represented Random Forest, then used boosting with multiple learners, including SVM, linear regression and random forest. The final result can be obtained from weighted result of each classfier.
Last but not least, about the expansion of text data. Due to the time cost of collecting survey and questions like missing values, I'd rather collect some formless text data, for example, students' comment in higher education evalution teaching system or the news that students posted on their social software. In this way, the orignal question turn to a question in NLP. The following is my thoughts and are partially implemented. First, we divide sentences into a set of single word (jieba for Chinese and NLTK for English) and remove useless word like preposition. Second, we use TF-IDF to extract the key words. Third, we use WordEmbedding to convert a word into a vector of fixed length which is convenient for mathematical processing. Then we can use multiple neural networks such as RNN, LSTM and other forms of LSTM to calculate distances between two sentences. Now finally, we can identify sex of a student by comparing the average distance between sentences of this student and all of the boys or girls sentences. Noted that the training data in RNN or LSTM doesn't need label, so we can improve learning ability of neural network by collecting lots of unlabeled text data.
Time's up, that's all. Thanks for reading.
*************************************************************************************************************************************************