NLP-one-hot编码

最新推荐文章于 2024-08-20 00:18:14 发布

pj5624

最新推荐文章于 2024-08-20 00:18:14 发布

阅读量824

点赞数 21

文章标签：自然语言处理人工智能

本文链接：https://blog.csdn.net/pj5624/article/details/134760476

版权

>- **🍨 本文为[🔗365天深度学习训练营](https://mp.weixin.qq.com/s/rbOOmire8OocQ90QM78DRA) 中的学习记录博客**
>- **🍖 原作者：[K同学啊 | 接辅导、项目定制](https://mtyjkh.blog.csdn.net/)**

词向量

将文字转化为计算机能看懂的数字

用1，2，3的连续数字可能让模型认为不同词语之间有连续关系，所以使用one-hot编码

如 This is an apple

This: [1,0,0,0]

is: [0,1,0,0]

....

Simplicity and Ease of Use: It's straightforward to implement and understand. This makes it a popular choice for initial data processing.
Compatibility with Machine Learning Models: Many algorithms require numerical input, and one-hot encoding provides a way to use categorical data with these models.
No Implicit Ordering: Since each category is represented by a binary vector, it ensures that the model doesn't assume a natural ordering among categories (which is the case with label encoding).
Clear Representation: Each category is equally represented without any implicit weight or priority, which can be important for non-ordinal categories.
Model Performance: In some cases, especially with algorithms like decision trees and ensemble methods, one-hot encoding can lead to better performance.

Curse of Dimensionality: It significantly increases the data dimensionality, especially if the categorical variable has many unique values. This can lead to increased computational complexity and memory usage.
Sparsity of Data: The resulting data matrix is often sparse, which can be inefficient and can negatively impact model performance, particularly in algorithms not designed for sparse data.
Loss of Information: Information about categories that might be inherently ordered (ordinal data) is lost, as each category is treated equally and independently.
Does Not Capture Relationships: One-hot encoding treats each category as independent and does not capture any possible relationships between categories.
Model Overfitting: With a large number of features, models might overfit, especially if the dataset is not large enough to support the increased dimensionality.
Ineffectiveness for High Cardinality: It is not effective for categorical variables with a large number of categories (high cardinality).

代码实现：

使用jieba分语对中文文本进行one-hot编码

关注