NLP-one-hot编码

>- **🍨 本文为[🔗365天深度学习训练营](https://mp.weixin.qq.com/s/rbOOmire8OocQ90QM78DRA) 中的学习记录博客**
>- **🍖 原作者:[K同学啊 | 接辅导、项目定制](https://mtyjkh.blog.csdn.net/)**

词向量

将文字转化为计算机能看懂的数字

用1,2,3的连续数字可能让模型认为不同词语之间有连续关系,所以使用one-hot编码

如 This is an apple

This: [1,0,0,0]

is: [0,1,0,0]

....

Advantages of One-Hot Encoding

  1. Simplicity and Ease of Use: It's straightforward to implement and understand. This makes it a popular choice for initial data processing.

  2. Compatibility with Machine Learning Models: Many algorithms require numerical input, and one-hot encoding provides a way to use categorical data with these models.

  3. No Implicit Ordering: Since each category is represented by a binary vector, it ensures that the model doesn't assume a natural ordering among categories (which is the case with label encoding).

  4. Clear Representation: Each category is equally represented without any implicit weight or priority, which can be important for non-ordinal categories.

  5. Model Performance: In some cases, especially with algorithms like decision trees and ensemble methods, one-hot encoding can lead to better performance.

Drawbacks of One-Hot Encoding

  1. Curse of Dimensionality: It significantly increases the data dimensionality, especially if the categorical variable has many unique values. This can lead to increased computational complexity and memory usage.

  2. Sparsity of Data: The resulting data matrix is often sparse, which can be inefficient and can negatively impact model performance, particularly in algorithms not designed for sparse data.

  3. Loss of Information: Information about categories that might be inherently ordered (ordinal data) is lost, as each category is treated equally and independently.

  4. Does Not Capture Relationships: One-hot encoding treats each category as independent and does not capture any possible relationships between categories.

  5. Model Overfitting: With a large number of features, models might overfit, especially if the dataset is not large enough to support the increased dimensionality.

  6. Ineffectiveness for High Cardinality: It is not effective for categorical variables with a large number of categories (high cardinality).

代码实现:

使用jieba分语对中文文本进行one-hot编码

  • 21
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值