matlab 决策树 categorical,决策树和交叉验证预测模型和ROC分析图

Decision Trees and Predictive Models with cross-validation and ROC analysis plot

Decision tree learning is a common method used in data mining. Most of the commercial packages offer complex Tree classification algorithms, but they are very much expensive.

This matlab code uses ‘classregtree' function that implement GINI algorithm to determine the best split for each node (CART).

The main function of this code is named Tree. It imports data directly from an excel or csv file, using the first row as variable names (necessary). The first column is the outcome group and It must be numeric.

To start the classification tree type in Matlab workspace: Tree(‘filename.xls’) or Tree(‘filename.csv’) (be careful that your excel file contains a first row with variable names and the outcome group in the first column).

It can also import directly from matlab file (.mat extention). Please create a file with this 3 variables: X (matrix of covariate values), y (outcome values), textdata (cell structure contains the text name of outcome and covariates). If you want an example please type: [X, y, textdata] = ExcelImport (‘example.xls’) or [X, y, textdata] = ExcelImport (‘yourfile.xls’) and watch the output.

There are two important issues:

1)outcome classes must be numeric, with value from 0 to n.

2)outcome classes must’n contain NaN (the code will exit in this circumstance).

At this point a first GUI helps you to select variables to include in the analysis, so you don’t need to modify your original datafile. It continues with a second GUI that asks for categorical variables: select one or more if necessary.

Then the Tree function:

1)Calculates the features relative importance.

2)Draws classification tree.

3)Performs a cross validation in order to obtain the best pruning position.

4)Draws the cost for pruning.

5)Plots ROC curves for each target classes (output classes) and display AUC

6)Estimates the classification rate (accuracy) with the 10-fold crossvalidation and with the leave one out crossvalidation.

There are some important notes:

1)Please pay attention when you save your datafile. The Excel import function of Matlab doesn’t recognize well all excel file type. In MAC OS 10.6.2 with Matlab 2009a, for example, you must save it with Excel 95 compatibility.

2)Sometimes the Excel import function does mistakes. In this case watch your file for ‘number typed as string’ or blank columns on the right. In this case I advice you to select the outcome and covariates to analyze with the mouse and copy it into a new file (with Ms Excel copy and paste) and use that one.

3)Handle with care datafile with missing values. The Matlab classregtree function doesn’t use surrogates splits.

4)This code runs only with Matlab 2009a (or 2009b). The previous version support classification tree but the functions are quite different.

An example file (example.xls) is included in zip. In matlab type : Tree(‘example.xls’) to start.

Please send me your opinion.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
背景描述 胶质瘤是最常见的原发性脑肿瘤。根据组织学/影像学标准,它们可以分为LGG(低级别胶质瘤)或GBM(多型胶质母细胞瘤)。临床和分子/突变因素对分级过程也非常重要。分子测试对于帮助准确诊断神经胶质瘤患者来说是昂贵的。在该数据集中,考虑了TCGA-LGG和TCGA-GBM脑胶质瘤项目中最频繁突变的20个基因和3个临床特征。预测任务是确定患者是具有给定临床和分子/突变特征的LGG还是GBM。主要目标是为神经胶质瘤分级过程找到最佳的突变基因子集和临床特征,以提高性能并降低成本。 数据说明 在该数据集中,考虑了TCGA-LGG和TCGA-GBM脑胶质瘤项目中最频繁突变的20个基因和3个临床特征。 预测任务是确定患者是具有给定临床和分子/突变特征的LGG还是GBM。主要目标是为神经胶质瘤分级过程找到最佳的突变基因子集和临床特征,以提高性能并降低成本。 在这个数据集中,实例代表了患有脑胶质瘤的患者的记录。该数据集是基于TCGA-LGG和TCGA-GBM脑胶质瘤项目构建的。 每个记录由20个分子特征(根据TCGA Case_ID,每个分子特征可以是突变的或非突变的(野生型))和3个临床特征(涉及患者的人口统计)表征。 -原始文件中有23个实例的“性别”、“年龄_诊断”或“种族”特征值为“--”或“未报告”。这些实例在经过预处理的数据集中被过滤掉。 -尽管存在于原始数据集中,但我们在预处理的数据集中不包括Project、Case_ID和Primary_Diagnosis列。 -通过将日期信息添加到数据集中相应的年份信息中作为预处理阶段的浮点数,将Age_at_diagnosis特征值从字符串转换为连续值。 此目录中还存在所有已处理和未处理的文件。 以下是原始数据集文件的附加列列表(及其相应描述): -Project列表示相应的TCGA-LGG或TCGA-GBM项目名称。 -Case_ID列是指相关项目的Case_ID信息。 -Primary_Diagnosis列提供与主要诊断类型相关的信息。 |Variable Name|Role|Type|Demographic|Description|Units|Missing Values| | ------------------ | ----- | ----- | --------------- |-------------- | ----- |------------------- | |Grade|Target|Categorical | -|Glioma grade class information (0 = "LGG"; 1 = "GBM")|N/A|no| |Gender|Feature|Categorical|Gender|Gender (0 = "male"; 1 = "female")|N/A|no| |Age_at_diagnosis|Feature|Continuous|Age|Age at diagnosis with the calculated number of days|years|no| |Race|Feature|Categorical|Race|Race (0 = "white"; 1 = "black or african American"; 2 = "asian"; 3 = "american indian or alaska native") |N/A|no| |IDH1|Feature|Categorical |- |isocitrate dehydrogenase (NADP(+))1 (0 = NOT_MUTATED; 1= MUTATED)|N/A|no| |TP53|Feature|Categorical | - |tumor protein p53 (0 = NOT_MUTATED; 1 = MUTATED)|N/A|no| |ATRX|Feature|Categorical | -|ATRX chromatin remodeler (0 = NOT_MUTATED; 1 = MUTATED)|N/A|no| |PTEN|Feature|Categorical | -|phosphatase and tensin homolog (0 = NOT_MUTATED; 1 = MUTATED)|N/A|no| |EGFR|Feature|Categorical|- |epidermal growth factor receptor (0 = NOT_MUTATED; 1 = MUTATED)|N/A|no| 引用格式 @misc{Glioma7545, title = { 胶质瘤分级临床和突变特征 }, author =

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值