Weka中的特征选择(Attribute selection)

按照http://weka.wiki.sourceforge.net/Use+Weka+in+your+Java+code的说法,在使用weka进行分类时,其实没有必要在代码中直接使用特征选择类,因为已经有meta-classifier和filter可以进行特征选择。
Weka里有个称为AttributeSelectedClassifier的带有特征选择的分类器,和一个称为GreedyStepwise的搜索类。这个分类器会利用GreedyStepwise采用贪婪算法(或称贪心算法)不断地搜索特征值子集并且评估最优解。
下面就是用CfsSubsetEVal和GreedStepwise方法的例子。下面的meta-classifier在数据传给classifier之前,进行了一个预外理的步骤:
Instances data = ...  // from somewhere
AttributeSelectedClassifier classifier = new
AttributeSelectedClassifier();
CfsSubsetEval eval = new CfsSubsetEval();
GreedyStepwise search = new GreedyStepwise();
search.setSearchBackwards(true);
J48 base = new J48();
classifier.setClassifier(base);
classifier.setEvaluator(eval);
classifier.setSearch(search);
// 10-fold cross-validation
Evaluation evaluation = new Evaluation(data);
evaluation.crossValidateModel(classifier, data, 10, new Random(1));
System.out.println(evaluation.toSummaryString());
说明:
(1)search.setSearchBackwards(true)是设置为反向搜索,即从最大子集开始,逐步减小的方法。
(2)10-fold cross-validation, 就是十折交叉验证,用来评估算法准确度。是常用的精度测试方法。将数据集分成十份,轮流将其中的9份做训练、1份做测试,10次结果的均值作为对算法精度的估计。
(3)对GreedyStepwise类对象search,也可以使用search.setThreshold(double threshold) 设置门限值,以便特征选择模块丢弃一些特征。Set the threshold by which the AttributeSelection module can discard attributes.
(4)或者使用setNumToSelect(int n) 指定要选定多少个特征值。 Specify the number of attributes to select from the ranked list (if generating a ranking).  -1 indicates that all attributes are to be retained.
(5)执行search.search(eval, data)之后,得到的特征子集到底是什么呢?这时已经搜索到了最优解,但看不到所有解。这时,可以调用rankedAttributes(),它会继续搜索其他解,并返回排序的全部解。rankedAttributes() :Produces a ranked list of attributes. Search must have been performed prior to calling this function. Search is called by this function to complete the traversal of the search space. A list of attributes and merits are returned. The attributes are ranked by the order they are added to the subset during a forward selection search. Individual merit values reflect the merit associated with adding the corresponding attribute to the subset; because of this, merit values may initially increase but then decrease as the best subset is "passed by" on the way to the far side of the search space.
 
昨晚写完后没有实际验证,今天验证了一下,运行非常耗时,代码如下:
       //load filtered train data arff file  
       ArffLoader loader = new ArffLoader();
       loader.setFile(new File( "dataFiltered.arff" ));
       Instances data = loader.getDataSet();
       data.setClassIndex( 0 );
 
       AttributeSelection attsel = new AttributeSelection();
       CfsSubsetEval eval = new CfsSubsetEval();
       GreedyStepwise search = new GreedyStepwise();
       search.setSearchBackwards(true);
       attsel.setEvaluator(eval);
       attsel.setSearch(search);
       attsel.SelectAttributes(data);
 
       int attarray[] =attsel.selectedAttributes();
       for (int i=0;i<attarray.length;i++ ){
           System.out.println("the selected attributes are as follows: \n");
          System.out.println(data.classAttribute().value((int)attarray[i])+"\n");
       }
 
 
 
 
weka给出的例子:
import weka.attributeSelection.*;
import weka.core.*;
import weka.core.converters.ConverterUtils.*;
import weka.classifiers.*;
import weka.classifiers.meta.*;
import weka.classifiers.trees.*;
import weka.filters.*;
 
import java.util.*;
 
/**
 * performs attribute selection using CfsSubsetEval and GreedyStepwise
 * (backwards) and trains J48 with that. Needs 3.5.5 or higher to compile.
 *
 * @author FracPete (fracpete at waikato dot ac dot nz)
 */
public class AttributeSelectionTest {
 
  /**
   * uses the meta-classifier
   */
  protected static void useClassifier(Instances data) throws Exception {
    System.out.println("\n1. Meta-classfier");
    AttributeSelectedClassifier classifier = new AttributeSelectedClassifier();
    CfsSubsetEval eval = new CfsSubsetEval();
    GreedyStepwise search = new GreedyStepwise();
    search.setSearchBackwards(true);
    J48 base = new J48();
    classifier.setClassifier(base);
    classifier.setEvaluator(eval);
    classifier.setSearch(search);
    Evaluation evaluation = new Evaluation(data);
    evaluation.crossValidateModel(classifier, data, 10, new Random(1));
    System.out.println(evaluation.toSummaryString());
  }
 
  /**
   * uses the filter
   */
  protected static void useFilter(Instances data) throws Exception {
    System.out.println("\n2. Filter");
    weka.filters.supervised.attribute.AttributeSelection filter = new weka.filters.supervised.attribute.AttributeSelection();
    CfsSubsetEval eval = new CfsSubsetEval();
    GreedyStepwise search = new GreedyStepwise();
    search.setSearchBackwards(true);
    filter.setEvaluator(eval);
    filter.setSearch(search);
    filter.setInputFormat(data);
    Instances newData = Filter.useFilter(data, filter);
    System.out.println(newData);
  }
 
  /**
   * uses the low level approach
   */
  protected static void useLowLevel(Instances data) throws Exception {
    System.out.println("\n3. Low-level");
    AttributeSelection attsel = new AttributeSelection();
    CfsSubsetEval eval = new CfsSubsetEval();
    GreedyStepwise search = new GreedyStepwise();
    search.setSearchBackwards(true);
    attsel.setEvaluator(eval);
    attsel.setSearch(search);
    attsel.SelectAttributes(data);
    int[] indices = attsel.selectedAttributes();
    System.out.println("selected attribute indices (starting with 0):\n" + Utils.arrayToString(indices));
  }
 
  /**
   * takes a dataset as first argument
   *
   * @param args        the commandline arguments
   * @throws Exception  if something goes wrong
   */
  public static void main(String[] args) throws Exception {
    // load data
    System.out.println("\n0. Loading data");
    DataSource source = new DataSource(args[0]);
    Instances data = source.getDataSet();
    if (data.classIndex() == -1)
      data.setClassIndex(data.numAttributes() - 1);
 
    // 1. meta-classifier
    useClassifier(data);
 
    // 2. filter
    useFilter(data);
 
    // 3. low-level
    useLowLevel(data);
  }
}


来源: <http://blog.sciencenet.cn/blog-713110-568654.html>
 
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
weka平台上,可以使用ID3算法和C4.5算法来实现分类任务。ID3算法是一种基于信息增益的决策树算法,而C4.5算法是ID3算法的改进版本,它使用信息增益比来选择最佳的划分属性。\[2\] 在进行分类实验之前,需要进行数据处理,确保数据符合weka的输入格式。然后可以使用ID3算法和C4.5算法来构建决策树模型。\[3\] 根据引用\[1\]的描述,KNN算法在该数据集上的准确率较高,但K值的选择是一个难点。而C4.5算法和朴素贝叶斯算法的准确率相同,但它们在错误分类的分布上有所不同。C4.5算法对错误分类的分布较为均匀,而朴素贝叶斯算法对不同实例的错误分类较为不均匀。因此,在选择分类算法时,可以根据具体情况考虑实例的分类侧重点。如果侧重于实例c的分类,则贝叶斯算法较好;如果侧重于实例b的分类,则C4.5算法较好。\[1\] #### 引用[.reference_title] - *1* *3* [《数据挖掘基础》实验:Weka平台实现分类算法](https://blog.csdn.net/qq_36949278/article/details/122061663)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^koosearch_v1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [weka使用ID3和C4.5算法 分类实验](https://blog.csdn.net/Fannie08/article/details/78845546)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^koosearch_v1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值