在Python Pandas中向现有DataFrame添加新列

最新推荐文章于 2024-07-16 12:34:52 发布

xfxf996

最新推荐文章于 2024-07-16 12:34:52 发布

阅读量8k

点赞数

文章标签： python pandas dataframe chained-assignment

原文链接：https://oldbug.net/q/qgDD/Adding-new-column-to-existing-DataFrame-in-Python-pandas

版权

本文介绍了如何在Python Pandas DataFrame中添加新列，包括使用NumPy、创建Series、利用索引以及使用DataFrame的assign方法。讨论了潜在的错误和警告，特别是关于chained assignment的问题，并提供了不同场景下的解决方案。

摘要由CSDN通过智能技术生成

本文翻译自：Adding new column to existing DataFrame in Python pandas

I have the following indexed DataFrame with named columns and rows not- continuous numbers: 我有以下索引的DataFrame，其中的命名列和行不是连续数字：

          a         b         c         d
2  0.671399  0.101208 -0.181532  0.241273
3  0.446172 -0.243316  0.051767  1.577318
5  0.614758  0.075793 -0.451460 -0.012493

I would like to add a new column, 'e' , to the existing data frame and do not want to change anything in the data frame (ie, the new column always has the same length as the DataFrame). 我想在现有数据帧中添加新列'e' ，并且不想更改数据帧中的任何内容（即，新列的长度始终与DataFrame相同）。

0   -0.335485
1   -1.166658
2   -0.385571
dtype: float64

I tried different versions of join , append , merge , but I did not get the result I wanted, only errors at most. 我尝试了不同版本的join ， append ， merge ，但是没有得到想要的结果，最多只有错误。 How can I add column e to the above example? 如何在上面的示例中添加列e ？

#1楼

参考：https://stackoom.com/question/qgDD/在Python-Pandas中向现有DataFrame添加新列

#2楼

Doing this directly via NumPy will be the most efficient: 通过NumPy直接执行此操作将是最有效的：

df1['e'] = np.random.randn(sLength)

Note my original (very old) suggestion was to use map (which is much slower): 请注意，我最初的建议（很旧）是使用map （速度慢得多）：

df1['e'] = df1['a'].map(lambda x: np.random.random())

#3楼

Use the original df1 indexes to create the series: 使用原始的df1索引创建系列：

df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)

Edit 2015 编辑2015
Some reported getting the SettingWithCopyWarning with this code. 一些报告说使用此代码获取SettingWithCopyWarning 。
However, the code still runs perfectly with the current pandas version 0.16.1. 但是，该代码仍可以在当前的熊猫0.10.1版本中完美运行。

>>> sLength = len(df1['a'])
>>> df1
          a         b         c         d
6 -0.269221 -0.026476  0.997517  1.294385
8  0.917438  0.847941  0.034235 -0.448948

>>> df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e
6 -0.269221 -0.026476  0.997517  1.294385  1.757167
8  0.917438  0.847941  0.034235 -0.448948  2.228131

>>> p.version.short_version
'0.16.1'

The SettingWithCopyWarning aims to inform of a possibly invalid assignment on a copy of the Dataframe. SettingWithCopyWarning目的是通知有关数据框副本的可能无效分配。 It doesn't necessarily say you did it wrong (it can trigger false positives) but from 0.13.0 it let you know there are more adequate methods for the same purpose. 它不一定表示您做错了（它可能会触发误报），但是从0.13.0开始，它使您知道有更多适当的方法可以实现相同的目的。 Then, if you get the warning, just follow its advise: Try using .loc[row_index,col_indexer] = value instead 然后，如果收到警告，请遵循其建议： 尝试使用.loc [row_index，col_indexer] = value代替

>>> df1.loc[:,'f'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e         f
6 -0.269221 -0.026476  0.997517  1.294385  1.757167 -0.050927
8  0.917438  0.847941  0.034235 -0.448948  2.228131  0.006109
>>>

In fact, this is currently the more efficient method as described in pandas docs 实际上，这是熊猫文档中描述的当前更有效的方法

Edit 2017 编辑2017

As indicated in the comments and by @Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign : 如评论中所述，@ Alexander指出，当前将Series的值添加为DataFrame的新列的最佳方法是使用assign ：

df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)

#4楼

这是添加新列的简单方法： df['e'] = e

#5楼

One thing to note, though, is that if you do 但是要注意的一件事是，如果您这样做

df1['e'] = Series(np.random.randn(sLength), index=df1.index)

this will effectively be a left join on the df1.index. 这实际上是df1.index上的左连接。 So if you want to have an outer join effect, my probably imperfect solution is to create a dataframe with index values covering the universe of your data, and then use the code above. 因此，如果要产生外部联接效果，我可能不完美的解决方案是创建一个索引值覆盖数据范围的数据框，然后使用上面的代码。 For example, 例如，

data = pd.DataFrame(index=all_possible_values)
df1['e'] = Series(np.random.randn(sLength), index=df1.index)

#6楼

I got the dreaded SettingWithCopyWarning , and it wasn't fixed by using the iloc syntax. 我得到了令人恐惧的SettingWithCopyWarning ，并且没有使用iloc语法解决。 My DataFrame was created by read_sql from an ODBC source. 我的DataFrame是由ODBC源中的read_sql创建的。 Using a suggestion by lowtech above, the following worked for me: 使用上面lowtech的建议，以下内容对我有用：

df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength),  index=df.index))

This worked fine to insert the column at the end. 这样可以很好地在最后插入列。 I don't know if it is the most efficient, but I don't like warning messages. 我不知道这是否是最有效的，但我不喜欢警告消息。 I think there is a better solution, but I can't find it, and I think it depends on some aspect of the index. 我认为有一个更好的解决方案，但我找不到它，而且我认为它取决于索引的某些方面。
Note . 注意。 That this only works once and will give an error message if trying to overwrite and existing column. 这只能工作一次，并且如果尝试覆盖现有列会给出错误消息。
Note As above and from 0.16.0 assign is the best solution. 注意如上所述，从0.16.0开始分配是最佳解决方案。 See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign Works well for data flow type where you don't overwrite your intermediate values. 请参见文档http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign对于不覆盖中间值的数据流类型而言效果很好。