

一篇关于数据挖掘 决策树有关知识的文章 小弟菜鸟


The purpose of the decision tree classifier is to classify instances based on values of ordinary attributes and class label attribute. Traditionally, the data set is single-valued and single-labeled. In this data set, each record has many single- valued attributes and a given single-labeled attribute (i.e. class label attribute), and the class labels that can have two or more than two types are exclusive to each other or one another. Prior art decision tree classifiers, such as ID3 (Quinlan, 1979, 1986), Distance-based method (Mantaras, 1991),IC(Agrawal, Ghosh, Imielinski, Iyer, & Swami, 1992), C4.5 (Quinlan, 1993), Fuzzy ID3 (Umano et al., 1994), CART (Steinberg & Colla, 1995), SLIQ (Mehta, Agrawal, &Rissanen, 1996), SPRINT (Shafer, Agrawal, & Mehta, 1996), Rainforest (Gehrke, Ramakrishnan, & Ganti, 1998)and PUBLIC (Rastogi & Shim, 1998),all focus on this single- valued and single-labeled data set.

However, there is multi-valued and multi-labeled data in the real world as shown in Table 1.Multi-valued data means that a record can have multiple values for an ordinary

attribute. Multi-labeled data means that a record can belong to multiple class labels, and the class labels are not exclusive to each other or one another. Readers might have difficulties to distinguish multi-labeled data from two-classed or multi- classed data mentioned in some related works. To clarify

this confusion, we discuss the exclusiveness among classes, number of class and representation of the class label attribute in the related works as follows:

1. Exclusiveness: Each data can only belong to a single class. Classes are exclusive to one another. ID3,

Distance-based Method, IC, C4.5, Fuzzy ID3, CART, SLIQ, SPRINT, Rainforest and PUBLIC are such examples.

2. Number of class: Data with classes classified into two types in the class label attribute is called two-classed data. ID3 and C4.5 are such examples. Data with classes classified into more than two types in the class label attribute is called multi-classed data. IC, CART and Fuzzy ID3 are such examples.

3. Label representation: Data with a single value for the class label attribute is called single-labeled data. ID3, Distance-based Method, IC, C4.5, Fuzzy ID3, CART, SLIQ, SPRINT, Rainforest and PUBLIC are such examples.

According to the discussion above, a multi-valued and multi-labeled data as we defined here can beregarded as a non-exclusive, multi-classed and multi-labeled data.

In our previous work(Chen, Hsu, & Chou, 2003), we have explained why the traditional classifiers are not capable of handling this multi-valued and multi-labeled data. To solve this multi-valued and multi-labeled classifi-cation problem, we have designed a decision tree classifier named MMC(Chen et al., 2003) before. MMC differs from the traditional ones in some major functions including growing a decision tree, assigning labels to represent a leaf and making a prediction for a new data. In the process of growing a tree, MMC proposes a new measure named weighted similarity for selecting multi-valued attribute to partition a node into child nodes to approach perfect grouping. To assign labels, MMC picks the ones with numbers large enough to represent a leaf. To make a prediction for a new data, MMC traverses the tree as usual, and as the traversing reaches several leaf nodes for the record with multi-valued attribute, MMC would union all the labels of the leaf nodes as the prediction result. Experimental results show that MMC can get an average predicting accuracy of 62.56%.

Having a decision classifier developed for the multi-valued and multi-labeled data, this research steps further to

improve the classifier’s accuracy. Considering the following over-fitting problems(Han & Kamber, 2001; Russell &Norving, 1995)of MMC, improvement on its predictingaccuracy seems possible. First, MMC neglects to avoid the situation when the data set is too small. Therefore, it may choose some attributes irrelevant to the class labels. Second, MMC appears to prefer the attribute which splits into child nodes with larger similarity among multiple

labels. Therefore, MMC exists inductive bias(Gordon &Desjardins, 1995).

Trying to minimize the over-fitting problems above, this paper proposes solutions as: (1) Set a constraint of size for the data set in each node to avoid the data set being too small. (2) Consider not only the average similarity of labels of each child node but also the average appropriateness of labels of

each child node to decrease the bias problem of MMC.Based on the propositions above, we have designed a new decision tree classifier to improve the accuracy of MMC.The decision tree classifier, named MMDT (multi-valued and multi-labeled decision tree), can construct a multi-

valued and multi-labeled decision tree as Fig. 1 shows. The rest of the paper is organized as follows. In Section 2,the symbols will be introduced first. In Section 3, the tree construction and data prediction algorithms are described. In Section 4, the experiments are presented. And, finally, Section 5 makes summaries and conclusions.

  • 0
  • 0
    觉得还不错? 一键收藏
  • 0
智慧校园整体解决方案是响应国家教育信息化政策,结合教育改革和技术创新的产物。该方案以物联网、大数据、人工智能和移动互联技术为基础,旨在打造一个安全、高效、互动且环保的教育环境。方案调从数字化校园向智慧校园的转变,通过自动数据采集、智能分析和按需服务,实现校园业务的智能化管理。 方案的总体设计原则包括应用至上、分层设计和互联互通,确保系统能够满足不同用户角色的需求,并实现数据和资源的整合与共享。框架设计涵盖了校园安全、管理、教学、环境等多个方面,构建了一个全面的校园应用生态系统。这包括智慧安全系统、校园身份识别、智能排课及选课系统、智慧学习系统、精品录播教室方案等,以支持个性化学习和教学评估。 建设内容突出了智慧安全和智慧管理的重要性。智慧安全管理通过分布式录播系统和紧急预案一键启动功能,增校园安全预警和事件响应能力。智慧管理系统则利用物联网技术,实现人员和设备的智能管理,提高校园运营效率。 智慧教学部分,方案提供了智慧学习系统和精品录播教室方案,支持专业级学习硬件和智能化网络管理,促进个性化学习和教学资源的高效利用。同时,教学质量评估中心和资源应用平台的建设,旨在提升教学评估的科学性和教育资源的共享性。 智慧环境建设则侧重于基于物联网的设备管理,通过智慧教室管理系统实现教室环境的智能控制和能效管理,打造绿色、节能的校园环境。电子班牌和校园信息发布系统的建设,将作为智慧校园的核心和入口,提供教务、一卡通、图书馆等系统的集成信息。 总体而言,智慧校园整体解决方案通过集成先进技术,不仅提升了校园的信息化水平,而且优化了教学和管理流程,为学生、教师和家长提供了更加便捷、个性化的教育体验。




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


