Abstract
Searching for an optimal feature subset from a high dimensional feature space is known to be an NP-complete problem. We present a hybrid algorithm, SAGA, for this task. SAGA combines the ability to avoid being trapped in a local minimum of simulated annealing with the very high rate of convergence of the crossover operator of genetic algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks. We compare the performance over time of SAGA and well-known algorithms on synthetic and real datasets. The results show that SAGA outperforms existing algorithms.
Feature selection with a measure of deviations from Poisson in text categorization
Available online 28 August 2008.
Abstract
To improve the performance of automatic text classification, it is desirable to reduce a high dimensionality of the feature space. In this paper, we propose a new measure for selecting features, which estimates term importance based on how largely the probability distribution of each term deviates from the standard Poisson distribution. In information retrieval literatures, the deviation from Poisson has been used as a measure for weighting keywords and this motivates us to adopt the deviation from Poisson as a measure for feature selection in text classification tasks. The proposed measure is constructed so as to have the same computational complexity with other standard measures used for feature selection. To test the effectiveness of our method, we conducted evaluation experiments on Reuters-21578 corpus with support vector machine and k -NN classifiers. In the experiments, we performed binary classifications to determine whether each of the test documents belongs to a certain target category or not. For the target category, each of the top 10 categories of Reuters-21578 was used because of enough numbers of training and test documents. Four measures were used for feature selection; information gain (IG), χ 2 -statistic, Gini index and the proposed measure in this work. Both the proposed measure and Gini index proved to be better than IG and χ 2 -statistic in terms of macro-averaged and micro-averaged values of F 1 , especially at higher vocabulary reduction levels.
Keywords: Text categorization; Feature selection ; Poisson distribution; Support vector machine; k -NN classifier
Article Outline
1. Introduction
2. Poisson distribution in information retrieval
3. Application of deviation from Poisson to feature selection
4. Experimental setup
4.1. Data collection
4.2. Feature selection
4.3. Document representation
4.4. Classifiers
4.5. Performance measure
5. Results and discussion
5.1. performance
5.2. performance
5.3. Scalability
6. Conclusion
Acknowledgements
References
Abstract
As an important preprocessing technology in text classification, feature selection can improve the scalability, efficiency and accuracy of a text classifier. In general, a good feature selection method should consider domain and algorithm characteristics. As the Naïve Bayesian classifier is very simple and efficient and highly sensitive to feature selection, so the research of feature selection specially for it is significant. This paper presents two feature evaluation metrics for the Naïve Bayesian classifier applied on multi-class text datasets: Multi-class Odds Ratio (MOR), and Class Discriminating Measure (CDM). Experiments of text classification with Naïve Bayesian classifiers were carried out on two multi-class texts collections. As the results indicate, CDM and MOR gain obviously better selecting effect than other feature selection approaches.
Article Outline
1. Introduction
2. Feature evaluation metrics for Naïve Bayes classifiers
2.1. The MOR metric
2.2. The CDM metric
3. Naïve Bayesian classifiers used on text data
4. Experiments
4.1. Data collections and performance setting
4.2. Experimental results and ****yses
5. Conclusion
Acknowledgements
References
Text feature selection using ant colony optimization
Abstract
Feature selection and feature extraction are the most important steps in classification systems. Feature selection is commonly used to reduce dimensionality of datasets with tens or hundreds of thousands of features which would be impossible to process further. One of the problems in which feature selection is essential is text categorization. A major problem of text categorization is the high dimensionality of the feature space; therefore, feature selection is the most important step in text categorization. At present there are many methods to deal with text feature selection. To improve the performance of text categorization, we present a novel feature selection algorithm that is based on ant colony optimization. Ant colony optimization algorithm is inspired by observation on real ants in their search for the shortest paths to food sources. Proposed algorithm is easily implemented and because of use of a simple classifier in that, its computational complexity is very low. The performance of proposed algorithm is compared to the performance of genetic algorithm, information gain and CHI on the task of feature selection in Reuters-21578 dataset. Simulation results on Reuters-21578 dataset show the superiority of the proposed algorithm.
Keywords: Feature selection; Ant colony optimization; Genetic algorithm; Text categorization
Article Outline
1. Introduction
2. Feature selection approaches
3. Ant colony optimization (ACO)
3.1. Ant colony optimization for feature selection
3.1.1. Graph representation
3.1.2. Heuristic desirability
3.1.3. Pheromone update rule
3.1.4. Solution construction
4. Proposed feature selection algorithm
5. Genetic algorithm (GA)
5.1. Genetic algorithm for feature selection
6. Statistical approaches
6.1. Information gain (IG)
6.2. χ 2 Statistic (CHI)
7. Experimental results
7.1. Dataset
7.2. Feature extraction
7.3. Performance measure
7.4. Results
8. Conclusion
Acknowledgements
References
A novel ACO–GA hybrid algorithm for feature selection in protein function prediction
Abstract
Protein function prediction is an important problem in functional genomics. Typically, protein sequences are represented by feature vectors. A major problem of protein datasets that increase the complexity of classification models is their large number of features. Feature selection (FS) techniques are used to deal with this high dimensional space of features. In this paper, we propose a novel feature selection algorithm that combines genetic algorithms (GA) and ant colony optimization (ACO) for faster and better search capability. The hybrid algorithm makes use of advantages of both ACO and GA methods. Proposed algorithm is easily implemented and because of use of a simple classifier in that, its computational complexity is very low. The performance of proposed algorithm is compared to the performance of two prominent population-based algorithms, ACO and genetic algorithms. Experimentation is carried out using two challenging biological datasets, involving the hierarchical functional classification of GPCRs and enzymes. The criteria used for comparison are maximizing predictive accuracy, and finding the ****allest subset of features. The results of experiments indicate the superiority of proposed algorithm.
Article Outline
1. Introduction
2. Protein function prediction
3. Feature selection approaches
4. Ant colony optimization
4.1. Ant colony optimization for feature selection
4.1.1. Graph representation
4.1.2. Heuristic desirability
4.1.3. Pheromone update rule
5. Genetic algorithm (GA)
5.1. Genetic algorithm for feature selection
6. Proposed ACO–GA algorithm
7. Experimental results
7.1. Datasets
7.2. Experimental methodology
7.3. Results
7.4. Discussion
8. Conclusion and future research
References
Optimal feature selection for support vector machines
Abstract
Selecting relevant features for support vector machine (SVM) classifiers is important for a variety of reasons such as generalization performance, computational efficiency, and feature interpretability. Traditional SVM approaches to feature selection typically extract features and learn SVM parameters independently. Independently performing these two steps might result in a loss of information related to the classification process. This paper proposes a convex energy-based framework to jointly perform feature selection and SVM parameter learning for linear and non-linear kernels. Experiments on various databases show significant reduction of features used while maintaining classification performance.
Article Outline
1. Introduction
2. Previous work
2.1. Support vector machines
2.2. Feature construction in SVM
3. SVMs and parameterized kernels
4. Learning feature weights
5. Feature weighting in feature space
6. Connection to L 1 -SVMs and sparsity
7. Experiments
7.1. Handwritten digit recognition
7.2. Pose classification
7.3. Eye detection
7.4. Experiments on other datasets
7.5. Software packages and training time
8. Conclusion
Acknowledgements
Appendix A. Proof of Theorem 1
Appendix B. Theorem 2
References
Vitae