机器学习分类问题标签如何做编码

有人说:“没事多读书”,
确实是这样,一直以为机器学习分类问题标签只有“One-hot”,主要是用得多,当然Hash应该也可以。     
直到我看到了这个库category_encoders,仿佛发现了新大陆。  

https://github.com/scikit-learn-contrib/category_encoders
以后就发挥想象,想怎么玩怎么玩。对于标签(目标对象)的理解,决定了编码方式,机器学习擅长编码和解码,输入和输出之间的编码距离如何做功最小,性能最好,这就是研究的价值所在。

References

  1. Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
  2. Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
  3. Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
  4. Owen Zhang - Leave One Out Encoding. From https://datascience.stackexchange.com/questions/10839/what-is-difference-between-one-hot-encoding-and-leave-one-out-encoding
  5. Beyond One-Hot: an exploration of categorical variables. From http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
  6. BaseN Encoding and Grid Search in categorical variables. From http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
  7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538
  8. Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
  9. Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/
  10. Simple Count or Frequency Encoding. From https://www.datacamp.com/community/tutorials/encoding-methodologies
  11. Transforming categorical features to numerical features. From https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/
  12. Andrew Gelman and Jennifer Hill (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. From https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf
  13. Carlos Mougan, David Masip, Jordi Nin and Oriol Pujol (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. https://link.springer.com/chapter/10.1007%2F978-3-030-85529-1_14
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
这是一个简单的机器学习猫狗分类代码示例: 首先导入必要的库: ``` import os import cv2 import numpy as np from sklearn.model_selection import train_test_split from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense, Activation, Dropout, Flatten, Conv2D, MaxPooling2D ``` 然后可以定义一些常量和变量: ``` # 定义一些常量和变量 IMG_SIZE = 64 CHANNELS = 3 CATEGORIES = ['cat', 'dog'] NUM_CATEGORIES = len(CATEGORIES) ``` 接着读取和预处理数据: ``` # 读取猫和狗的图像数据 def read_data(): X = [] y = [] for i, category in enumerate(CATEGORIES): for file_name in os.listdir(category): img_path = os.path.join(category, file_name) img = cv2.imread(img_path) img = cv2.resize(img, (IMG_SIZE, IMG_SIZE)) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) X.append(img) y.append(i) return X, y # 预处理数据 def preprocess_data(X, y): # 将图像数据进行归一化处理 X = np.array(X).astype('float32') X /= 255.0 # 将标签数据进行 one-hot 编码 y = to_categorical(y, NUM_CATEGORIES) # 将数据集划分为训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) return X_train, X_test, y_train, y_test ``` 然后可以定义模型: ``` # 定义模型 def create_model(): model = Sequential() model.add(Conv2D(32, (3, 3), input_shape=(IMG_SIZE, IMG_SIZE, CHANNELS), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(64, (3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(128)) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(NUM_CATEGORIES)) model.add(Activation('softmax')) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) return model ``` 最后我们可以训练并测试模型: ``` X, y = read_data() X_train, X_test, y_train, y_test = preprocess_data(X, y) model = create_model() model.fit(X_train, y_train, batch_size=32, epochs=10, validation_data=(X_test, y_test)) # 在测试集上评估模型性能 score = model.evaluate(X_test, y_test, verbose=1) print('Test loss: ', score[0]) print('Test accuracy: ', score[1]) ``` 注意:这只是一个简单的示例,实际使用中可能需要根据应用场景进行调整和优化,以取得更好的性能。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值