更改Pandas中列的数据类型

本文介绍了在Pandas中转换DataFrame列数据类型的方法,包括使用`to_numeric`进行基本转换,处理错误和下垂,使用`astype`进行显式类型转换,以及使用`convert_dtypes`进行软转换。文章提供了示例和注意事项,帮助用户动态处理不同列的数据类型转换。
摘要由CSDN通过智能技术生成

本文翻译自:Change data type of columns in Pandas

I want to convert a table, represented as a list of lists, into a Pandas DataFrame. 我想将表示为列表列表的表转换为Pandas DataFrame。 As an extremely simplified example: 作为一个极其简化的示例:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? 将列转换为适当类型的最佳方法是什么,在这种情况下,将列2和3转换为浮点数? Is there a way to specify the types while converting to DataFrame? 有没有一种方法可以在转换为DataFrame时指定类型? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? 还是先创建DataFrame然后遍历各列以更改各列的类型更好? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. 理想情况下,我想以动态方式执行此操作,因为可以有数百个列,并且我不想确切指定哪些列属于哪种类型。 All I can guarantee is that each columns contains values of the same type. 我可以保证的是,每一列都包含相同类型的值。


#1楼

参考:https://stackoom.com/question/14fz4/更改Pandas中列的数据类型


#2楼

How about this? 这个怎么样?

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]: 
  one  two three
0   a  1.2   4.2
1   b   70  0.03
2   x    5     0

df.dtypes
Out[17]: 
one      object
two      object
three    object

df[['two', 'three']] = df[['two', 'three']].astype(float)

df.dtypes
Out[19]: 
one       object
two      float64
three    float64

#3楼

You have three main options for converting types in pandas: 您可以使用三种主要选项来转换熊猫的类型:

  1. to_numeric() - provides functionality to safely convert non-numeric types (eg strings) to a suitable numeric type. to_numeric() -提供了将非数字类型(例如字符串)安全地转换为合适的数字类型的功能。 (See also to_datetime() and to_timedelta() .) (另请参见to_datetime()to_timedelta() 。)

  2. astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). astype() -将(几乎)任何类型转换为(几乎)任何其他类型(即使这样做不一定明智)。 Also allows you to convert to categorial types (very useful). 还允许您转换为分类类型(非常有用)。

  3. infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible. infer_objects() -一种实用的方法,可以将保存Python对象的对象列转换为熊猫类型。

Read on for more detailed explanations and usage of each of these methods. 继续阅读以获取每种方法的更详细的解释和用法。


1. to_numeric() 1. to_numeric()

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric() . 将DataFrame的一列或多列转换为数值的最佳方法是使用pandas.to_numeric()

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate. 此函数将尝试将非数字对象(例如字符串)适当地更改为整数或浮点数。

Basic usage 基本用法

The input to to_numeric() is a Series or a single column of a DataFrame. to_numeric()的输入是Series或DataFrame的单个列。

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

As you can see, a new Series is returned. 如您所见,将返回一个新的Series。 Remember to assign this output to a variable or column name to continue using it: 请记住,将此输出分配给变量或列名以继续使用它:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

You can also use it to convert multiple columns of a DataFrame via the apply() method: 您还可以通过apply()方法使用它来转换DataFrame的多个列:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

As long as your values can all be converted, that's probably all you need. 只要您的值都可以转换,那可能就是您所需要的。

Error handling 错误处理

But what if some values can't be converted to a numeric type? 但是,如果某些值不能转换为数字类型怎么办?

to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN , or simply ignore columns containing these values. to_numeric()还采用了errors关键字参数,该参数允许您将非数字值强制为NaN ,或仅忽略包含这些值的列。

Here's an example using a Series of strings s which has the object dtype: 这是使用具有对象dtype的一系列字符串s的示例:

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

The default behaviour is to raise if it can't convert a value. 如果无法转换值,则默认行为是引发。 In this case, it can't cope with the string 'pandas': 在这种情况下,它不能处理字符串“ pandas”:

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. 我们可能希望将“ pandas”视为丢失/错误的数值,而不是失败。 We can coerce invalid values to NaN as follows using the errors keyword argument: 我们可以使用errors关键字参数将无效值强制为NaN ,如下所示:

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

The third option for errors is just to ignore the operation if an invalid value is encountered: errors的第三个选项是,如果遇到无效值,则忽略该操作:

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

This last option is particularly useful when you want to convert your entire DataFrame, but don't not know which of our columns can be converted reliably to a numeric type. 当您要转换整个DataFrame,但又不知道我们哪些列可以可靠地转换为数字类型时,最后一个选项特别有用。 In that case just write: 在这种情况下,只需写:

df.apply(pd.to_numeric, errors='ignore')

The function will be applied to each column of the DataFrame. 该函数将应用于DataFrame的每一列。 Columns that can be converted to a numeric type will be converted, while columns that cannot (eg they contain non-digit strings or dates) will be left alone. 可以转换为数字类型的列将被转换,而不能转换(例如,它们包含非数字字符串或日期)的列将被保留。

Downcasting 下垂

By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform). 默认情况下,使用to_numeric()转换将为您提供int64float64 dtype(或平台固有的任何整数宽度)。

That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32 , or int8 ? 通常这就是您想要的,但是如果您想节省一些内存并使用更紧凑的dtype(例如float32int8呢?

to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. to_numeric()使您可以选择向下转换为'integer','signed','unsigned','float'。 Here's an example for a simple series s of integer type: 这是一个整数类型的简单序列s示例:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

Downcasting to 'integer' uses the smallest possible integer that can hold the values: 向下转换为“整数”将使用可以保存值的最小整数:

>>> pd.to_numeric(s, downcast='integer')
0    1
1    2
2   -7
dtype: int8

Downcasting to 'float' similarly picks a smaller than normal floating type: 向下转换为“ float”类似地选择了一个比普通浮点型小的类型:

>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -7.0
dtype: float32

2. astype() 2. astype()

The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. astype()方法使您可以明确表示希望DataFrame或Series具有的dtype。 It's very versatile in that you can try and go from one type to the any other. 它非常通用,可以尝试从一种类型转换为另一种类型。

Basic usage 基本用法

Just pick a type: you can use a NumPy dtype (eg np.int16 ), some Python types (eg bool), or pandas-specific types (like the categorical dtype). 只需选择一个类型即可:您可以使用NumPy np.int16 (例如np.int16 ),某些Python类型(例如bool)或特定于熊猫的类型(例如类别dtype)。

Call the method on the object you want to convert and astype() will try and convert it for you: 在要转换的对象上调用方法, astype()将尝试为您转换它:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. 注意,我说的是“尝试”-如果astype()不知道如何转换Series或DataFrame中的值,它将引发错误。 For example if you have a NaN or inf value you'll get an error trying to convert it to an integer. 例如,如果您具有NaNinf值,则尝试将其转换为整数时会出错。

As of pandas 0.20.0, this error can be suppressed by passing errors='ignore' . 从熊猫0.20.0开始,可以通过传递errors='ignore'来抑制此错误。 Your original object will be return untouched. 您的原始对象将保持原样返回。

Be careful 小心

astype() is powerful, but it will sometimes convert values "incorrectly". astype()功能强大,但有时会“错误地”转换值。 For example: 例如:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

These are small integers, so how about converting to an unsigned 8-bit type to save memory? 这些都是小整数,那么如何转换为无符号的8位类型以节省内存呢?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

The conversion worked, but the -7 was wrapped round to become 249 (ie 2 8 - 7)! 转换工作,但-7包裹轮成为249(即2月8日至七日 )!

Trying to downcast using pd.to_numeric(s, downcast='unsigned') instead could help prevent this error. 尝试使用pd.to_numeric(s, downcast='unsigned')可以帮助防止此错误。


3. infer_objects() 3. infer_objects()

Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions). pandas的0.21.0版引入了infer_objects()方法,用于将具有对象数据类型的DataFrame列转换为更特定的类型(软转换)。

For example, here's a DataFrame with two columns of object type. 例如,这是一个带有两列对象类型的DataFrame。 One holds actual integers and the other holds strings representing integers: 一个保存实际的整数,另一个保存代表整数的字符串:

>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a    object
b    object
dtype: object

Using infer_objects() , you can change the type of column 'a' to int64: 使用infer_objects() ,您可以将列“ a”的类型更改为int64:

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

Column 'b' has been left alone since its values were strings, not integers. 由于列“ b”的值是字符串而不是整数,因此已被保留。 If you wanted to try and force the conversion of both columns to an integer type, you could use df.astype(int) instead. 如果要尝试强制将两列都转换为整数类型,则可以改用df.astype(int)


#4楼

Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers. 这是一个函数,该函数将DataFrame和列列表作为参数,并将列中的所有数据强制转换为数字。

# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])
# dependencies: pandas

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

So, for your example: 因此,以您的示例为例:

import pandas as pd

def coerce_df_columns_to_numeric(df, column_list):
    df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col1','col2','col3'])

coerce_df_columns_to_numeric(df, ['col2','col3'])

#5楼

How about creating two dataframes, each with different data types for their columns, and then appending them together? 如何创建两个数据框,每个数据框的列具有不同的数据类型,然后将它们附加在一起?

d1 = pd.DataFrame(columns=[ 'float_column' ], dtype=float)
d1 = d1.append(pd.DataFrame(columns=[ 'string_column' ], dtype=str))

Results 结果

In[8}:  d1.dtypes
Out[8]: 
float_column     float64
string_column     object
dtype: object

After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column. 创建数据框后,可以在第一列中填充浮点变量,并在第二列中填充字符串(或所需的任何数据类型)。


#6楼

this below code will change datatype of column. 下面的代码将更改列的数据类型。

df[['col.name1', 'col.name2'...]] = df[['col.name1', 'col.name2'..]].astype('data_type')

in place of data type you can give your datatype .what do you want like str,float,int etc. 您可以给数据类型代替数据类型。您想要什么,例如str,float,int等。

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值