pandas 转换为文本类型_pandas文本数据转整数分类编码的最佳实践

最新推荐文章于 2023-06-02 14:45:44 发布

easycofe

最新推荐文章于 2023-06-02 14:45:44 发布

阅读量1.3k

点赞数

文章标签： pandas 转换为文本类型

本文链接：https://blog.csdn.net/weixin_31205797/article/details/111962510

版权

本文探讨了如何使用pandas和scikit-learn将包含分类变量的数据集转换为数值编码，以便于分析。通过示例介绍了替换字符串、标签编码、one-hot编码和自定义二分类等方法，以及它们的优缺点。文中使用了一个包含汽车数据的UCI数据集进行演示。

摘要由CSDN通过智能技术生成

问题描述

在许多实际的数据处理工作中，数据集通常包含分类变量。这些变量通常存储为表示各种特征的文本值。一些示例包括颜色(“红色”，“黄色”，“蓝色”)，尺寸(“小”，“中”，“大”)或地理名称(州或国家)。无论使用何种值，挑战在于确定如何在分析中使用此数据。许多机器学习算法可以支持分类值而无需进一步操作，但还有许多算法不支持。因此，分析师面临的挑战是如何将这些文本属性转换为数值以便进一步处理。

与数据科学世界的许多其他方面一样，关于如何解决这个问题没有单一的答案。每种方法都需要权衡，并对分析结果产生潜在影响。幸运的是，pandas和scikit-learn的python工具提供了几种方法，可用于将分类数据转换为合适的数值。本文将对一些常见的(以及一些更复杂的)方法进行汇总，希望它能帮助其他人将这些技术应用于他们的现实世界问题。

数据集

在本文中，我在UCI机器学习库中找到一个好的数据集。这个特定的汽车数据集包括分类值和连续值的组合，并且作为相对容易理解的有用示例。由于在决定如何编码各种分类值时，领域知识是一个重要方面 - 这个数据集是一个很好的个案研究。

在我们开始编码各种值之前，我们需要载入数据并进行一些小的清理。幸运的是，pandas使这简单明了：

15import pandas as pd

import numpy as np

# 定义数据的列名称, 因为这个数据集没有包含列名称

headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",

"num_doors", "body_style", "drive_wheels", "engine_location",

"wheel_base", "length", "width", "height", "curb_weight",

"engine_type", "num_cylinders", "engine_size", "fuel_system",

"bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",

"city_mpg", "highway_mpg", "price"]

# 读取在线的数据集, 并将?转换为缺失NaN

df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",

header=None, names=headers, na_values="?" )

df.head()[df.columns[:10]]

输出(html):

symboling

normalized_losses

make

fuel_type

aspiration

num_doors

body_style

drive_wheels

engine_location

wheel_base

NaN

alfa-romero

gas

std

two

convertible

rwd

front

88.6

NaN

alfa-romero

gas

std

two

convertible

rwd

front

88.6

NaN

alfa-romero

gas

std

two

hatchback

rwd

front

94.5

164.0

audi

gas

std

four

sedan

fwd

front

99.8

164.0

audi

gas

std

four

sedan

4wd

front

99.4

看一下所有列的数据类型:

1df.dtypes

输出(plain):

symboling int64

normalized_losses float64

make object

fuel_type object

aspiration object

num_doors object

body_style object

drive_wheels object

engine_location object

wheel_base float64

length float64

width float64

height float64

curb_weight int64

engine_type object

num_cylinders object

engine_size int64

fuel_system object

bore float64

stroke float64

compression_ratio float64

horsepower float64

peak_rpm float64

city_mpg int64

highway_mpg int64

price float64

dtype: object

因为我们只关心文本数据, 所以我们选出类型为”object”的列, 而pandas提供了select_dtypes方法可以快速达到目的:

2df2 = df.select_dtypes('object').copy()

df2.head()

输出(html):

make

fuel_type

aspiration

num_doors

body_style

drive_wheels

engine_location

engine_type

num_cylinders

fuel_system

alfa-romero

gas

std

two

convertible

rwd

front

dohc

four

mpfi

alfa-romero

gas

std

two

convertible

rwd

front

dohc

four

mpfi

alfa-romero

gas

std

two

hatchback

rwd

front

ohcv

six

mpfi

audi

gas

std

four

sedan

fwd

front

ohc

four

mpfi

audi

gas

std

four

sedan

4wd

front

ohc

five

mpfi

因为数据集种包括缺失数据, 这会增加后续处理的难度, 我们为了简单起见, 将缺失值删除即可:

1df2.dropna(inplace=True)

方案Ⅰ:替换字符串

最简单的方式就是, 查找列中所有的字符串, 然后给不同的字符串一个编号, 然后用编号替换字符串:

使用vlaue_counts获取所有的字符串:

4col = 'body_style'

strs = df2[col].value_counts()

strs

输出(plain):

sedan 94

hatchback 70

wagon 25

hardtop 8

convertible 6

Name: body_style, dtype: int64

将所有字符串映射为数字:

2value_map = dict((v, i) for i,v in enumerate(strs.index))

value_map

输出(plain):

{'sedan': 0, 'hatchback': 1, 'wagon': 2, 'hardtop': 3, 'convertible': 4}

使用replace方法替换字符串

1df2.replace({col:value_map})[col].head()

输出(plain):

0 4

1 4

2 1

3 0

4 0

Name: body_style, dtype: int64

你会看到, 不仅仅字符串被替换, 而且series的数据类型变成了int64

方案Ⅱ:标签编码

编码分类值的另一种方法是使用称为标签编码的技术。标签编码只是将列中的每个值转换为数字。例如，body_style列包含5个不同的值。我们可以选择像这样编码：

convertible -> 0

hardtop -> 1

hatchback -> 2

sedan -> 3

wagon -> 4

首先你可以将列的数据格式转换为category

2bs = df2['body_style'].astype('category')

bs.head()

输出(plain):

0 convertible

1 convertible

2 hatchback

3 sedan

4 sedan

Name: body_style, dtype: category

Categories (5, object): [convertible, hardtop, hatchback, sedan, wagon]

然后你只需要使用标签的编码作为真正的数据就可以了:

1bs.cat.codes.head()

输出(plain):

0 0

1 0

2 2

3 3

4 3

dtype: int8

方案三: 转换成哑变量, 或者叫one-hot编码

标签编码的优点是它很简单，但它的缺点是数值可能被算法“误解”。例如，0的值显然小于4的值，但这是否真的与现实生活中的数据集相对应？在我们的计算中，旅行车的重量是否比敞篷车重4倍？在这个例子中，我不这么认为。所以我们需要将数据转换为哑变量(onehot), 在pandas中, 这个转变只需要一行代码:

1pd.get_dummies(df[['drive_wheels', 'body_style']]).head()

输出(html):

drive_wheels_4wd

drive_wheels_fwd

drive_wheels_rwd

body_style_convertible

body_style_hardtop

body_style_hatchback

body_style_sedan

body_style_wagon