数据挖掘中对Categorical特征的处理

Categorical特征常被称为离散特征、分类特征,数据类型通常是object类型,而我们的机器学习模型通常只能处理数值数据,所以需要对Categorical数据转换成Numeric特征。

Categorical特征又有两类,我们需要理解它们的具体含义并进行对应的转换。

  • Ordinal 类型:这种类型的Categorical存在着自然的顺序结构,如果你对Ordinal 类型数据进行排序的话,可以是增序或者降序,比如在学习成绩这个特征中具体的值可能有:A、B、C、D四个等级,但是根据成绩的优异成绩进行排序的话有A>B>C>D
  • Nominal类型:这种是常规的Categorical类型,不能对Nominal类型数据进行排序。比如血型特征可能的值有:A、B、O、AB,但你不能得出A>B>O>AB的结论。

对于OrdinalNominal类型数据有不同的方法将它们转换成数字。

对于Ordinal类型数据可以使用LabelEncoder进行编码处理,例如成绩的A、B、C、D四个等级进行LabelEncoder处理后会映射成1、2、3、4,这样数据间的自然大小关系也会保留下来。

对于Nominal类型数据可以使用OneHotEncoder进行编码处理

  1. Use pandas’ get_dummies() method to return a new DataFrame containing a new column for each dummy variable
  2. Use the concat() method to add these dummy columns back to the original DataFrame
  3. Then drop the original columns entirely using the drop method

  • In case you are dealing with ordinal feature –> you map its values to 1, 2, 3, 4 or 3, 2, 1 or whatever if not already mapped. Ordinal feature means its values may be arranged in some order that makes logical sense. For example, you have a feature “Size” with alphanumeric values, let’s say “small, medium, big”; indeed “big” is bigger than “small”, you can compare those values and it will make sense. You map “small, medium, big” to 1, 2, 3 for example. Example in Titanic: Pclass is an ordinal feature: Pclass=1 is better than Pclass=3. Note that in this case Pclass feature is already mapped to 1, 2, 3 so you don’t have to do anything with it. You would have to map it if Pclass contained alphanumeric values like “high_class, medium_class, low_class”.
  • In case you are dealing with categorical feature - you look at how much categories (possible values in that particular feature) do you have. If you have only 2 categories you map them to 0 and 1 or to -1 and that’s it. If you have more than 2 categories, you create dummy variables. Example in Titanic: Sex is a categorical variable with 2 categories - ‘male’ and ‘female’, you map them for example to 0 and 1, and that’s it. Note that it’s not ordinal because male is not
    better nor worse than female, you can’t logically compare them. Now, Embarked is a categorical feature too, but it has 3 categories instead of just 2. You make dummy variables out of this feature. And make just 2, not 3, the 3rd one is redundant. Well this feature is redundant by itself but anyway.

Edit: following further discussion, there are cases when turning ordinal features to dummies may improve your score a bit. It’s hard to tell beforehand, so it should be usefeul to make 2 sets of features, one including ordinal data and the other with ordinal-to-one-hot data, compare the results on various models and pick the one that worked out best in your specific case.

  • 5
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
泰坦尼克号乘客生还数据挖掘是一个经典的数据分析和机器学习案例,通常用于初学者入门。在MATLAB进行这个项目,你需要使用Titanic数据集,该数据集包含乘客的基本信息(如性别、年龄、舱位等)以及他们是否在事故生还的信息。以下是一个简化的步骤和代码示例: 首先,你需要安装Matlab和`pandas`、`matplotlib`等数据处理库(如果你还没安装),然后从网上下载Titanic数据集,或者使用`pandas_datareader`从Kaggle或其他来源获取。 ```matlab % 加载必要的库 if ~exist('pandas', 'caller') addpath(genpath(fullfile(matlabroot, 'extern', 'engines', 'python'))); end import pandas as pd % 读取数据 url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'; titanicData = readtable(url); ``` 接下来,你可以对数据进行清洗、预处理和分析,例如填充缺失值、编码类别变量等: ```matlab % 数据清洗 titanicData = fillmissing(titanicData, 'previous'); % 填充缺失值 titanicData['Sex'] = categorical(titanicData.Sex); % 将性别转为类别 ``` 构建特征与目标变量,并可能使用逻辑回归、决策树或随机森林等算法进行预测: ```matlab % 特征工程 X = titanicData(:, {'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare'}); y = titanicData.Survived; % 划分训练集和测试集 cv = cvpartition(y, 'HoldOut', 0.3); % 30%验证集 X_train = X(training(cv), :); y_train = y(training(cv)); X_test = X(test(cv), :); y_test = y(test(cv)); % 逻辑回归模型示例 mdl = fitglm(X_train, y_train, 'Distribution', 'binomial'); y_pred = predict(mdl, X_test); ``` 最后,评估模型性能: ```matlab % 评估模型 accuracy = sum(y_pred == y_test) / numel(y_test); confusion_matrix = confusionmat(y_test, y_pred); roc_auc = roc_auc_score(y_test, y_pred); ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值