>- **🍨 本文为[🔗365天深度学习训练营](https://mp.weixin.qq.com/s/rbOOmire8OocQ90QM78DRA) 中的学习记录博客**
>- **🍖 原作者:[K同学啊 | 接辅导、项目定制](https://mtyjkh.blog.csdn.net/)**
词向量
将文字转化为计算机能看懂的数字
用1,2,3的连续数字可能让模型认为不同词语之间有连续关系,所以使用one-hot编码
如 This is an apple
This: [1,0,0,0]
is: [0,1,0,0]
....
Advantages of One-Hot Encoding
-
Simplicity and Ease of Use: It's straightforward to implement and understand. This makes it a popular choice for initial data processing.
-
Compatibility with Machine Learning Models: Many algorithms require numerical input, and one-hot encoding provides a way to use categorical data with these models.
-
No Implicit Ordering: Since each category is represented by a binary vector, it ensures that the model doesn't assume a natural ordering among categories (which is the case with label encoding).
-
Clear Representation: Each category is equally represented without any implicit weight or priority, which can be important for non-ordinal categories.
-
Model Performance: In some cases, especially with algorithms like decision trees and ensemble methods, one-hot encoding can lead to better performance.
Drawbacks of One-Hot Encoding
-
Curse of Dimensionality: It significantly increases the data dimensionality, especially if the categorical variable has many unique values. This can lead to increased computational complexity and memory usage.
-
Sparsity of Data: The resulting data matrix is often sparse, which can be inefficient and can negatively impact model performance, particularly in algorithms not designed for sparse data.
-
Loss of Information: Information about categories that might be inherently ordered (ordinal data) is lost, as each category is treated equally and independently.
-
Does Not Capture Relationships: One-hot encoding treats each category as independent and does not capture any possible relationships between categories.
-
Model Overfitting: With a large number of features, models might overfit, especially if the dataset is not large enough to support the increased dimensionality.
-
Ineffectiveness for High Cardinality: It is not effective for categorical variables with a large number of categories (high cardinality).
代码实现:
使用jieba分语对中文文本进行one-hot编码