创建一个空的Pandas DataFrame，然后填充它？

最新推荐文章于 2024-07-22 22:26:27 发布

w36680130

最新推荐文章于 2024-07-22 22:26:27 发布

阅读量5.3k

点赞数

文章标签： python dataframe pandas

原文链接：https://oldbug.net/q/vptg/Creating-an-empty-Pandas-DataFrame-then-filling-it

版权

本文探讨了如何正确创建并填充Pandas DataFrame，强调避免在循环中使用`pd.DataFrame()`或`pd.concat()`，推荐先收集数据再一次性创建DataFrame以优化性能。还讨论了初始化全NaN DataFrame的性能问题，并提供了基准测试代码。

摘要由CSDN通过智能技术生成

本文翻译自：Creating an empty Pandas DataFrame, then filling it?

I'm starting from the pandas DataFrame docs here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html 我从这里的pandas DataFrame文档开始： http ://pandas.pydata.org/pandas-docs/stable/dsintro.html

I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. 我想在时间序列类型的计算中用值迭代地填充DataFrame。 So basically, I'd like to initialize the DataFrame with columns A, B and timestamp rows, all 0 or all NaN. 所以基本上，我想用列A，B和时间戳记行（全为0或全部为NaN）初始化DataFrame。

I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so. 然后，我将添加初始值，并遍历此数据，从之前的行计算出新行，例如row[A][t] = row[A][t-1]+1左右。

I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly, or just a better way in general. 我目前正在使用下面的代码，但是我觉得这很丑陋，必须有一种直接使用DataFrame进行此操作的方法，或者通常来说是一种更好的方法。 Note: I'm using Python 2.7. 注意：我正在使用Python 2.7。

import datetime as dt
import pandas as pd
import scipy as s

if __name__ == '__main__':
    base = dt.datetime.today().date()
    dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
    dates.sort()

    valdict = {}
    symbols = ['A','B', 'C']
    for symb in symbols:
        valdict[symb] = pd.Series( s.zeros( len(dates)), dates )

    for thedate in dates:
        if thedate > dates[0]:
            for symb in valdict:
                valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]

    print valdict

#1楼

参考：https://stackoom.com/question/vptg/创建一个空的Pandas-DataFrame-然后填充它

#2楼

Here's a couple of suggestions: 这里有一些建议：

Use date_range for the index: 使用date_range作为索引：

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

Note: we could create an empty DataFrame (with NaN s) simply by writing: 注意：我们可以简单地通过编写以下内容来创建一个空的DataFrame（具有NaN ）：

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

To do these type of calculations for the data, use a numpy array: 要对数据进行这些类型的计算，请使用numpy数组：

data = np.array([np.arange(10)]*3).T

Hence we can create the DataFrame: 因此，我们可以创建DataFrame：

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9

#3楼

If you simply want to create an empty data frame and fill it with some incoming data frames later, try this: 如果您只想创建一个空的数据框并在以后用一些传入的数据框填充它，请尝试以下操作：

In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF. 在此示例中，我使用此pandas文档创建一个新的数据框，然后使用append将oldDF中的数据写入newDF。

Have a look at this 看看这个

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional

if I have to keep appending new data into this newDF from more than one oldDFs, I just use a for loop to iterate over pandas.DataFrame.append() 如果我必须不断地将来自多个旧DF的新数据追加到此newDF中，则仅使用for循环遍历pandas.DataFrame.append（）

#4楼

If you want to have your column names in place from the start, use this approach: 如果要从头开始使用列名，请使用以下方法：

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

If you want to add a record to the dataframe it would be better to use: 如果要将记录添加到数据框，则最好使用：

my_df.loc[len(my_df)] = [2, 4, 5]

You also might want to pass a dictionary: 您可能还想通过字典：

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic

However if you want to add another dataframe to my_df do as follows: 但是，如果要向my_df添加另一个数据帧，请执行以下操作：

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)

If you are adding rows inside a loop consider performance issues: 如果要在循环内添加行，请考虑性能问题：
For around the first 1000 records "my_df.loc" performance is better, but it gradually becomes slower by increasing the number of records in the loop. 对于大约前1000条记录，“ my_df.loc”的性能较好，但通过增加循环中的记录数，它的性能逐渐变慢。

If you plan to do thins inside a big loop (say 10M‌ records or so): 如果您打算在一个大循环中进行细化处理（例如记录大约10M‌）：
You are better off using a mixture of these two; 最好将这两种方法结合使用。 fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe. 用iloc填充数据框，直到大小达到1000，然后将其附加到原始数据框，然后清空临时数据框。 This would boost your performance by around 10 times. 这将使您的性能提高大约10倍。

#5楼

Assume a dataframe with 19 rows 假设有19行的数据框

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

Keeping Column A as a constant 保持A列不变

test['A']=10

Keeping column b as a variable given by a loop 保持列b为循环给出的变量

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

You can replace the first x in pd.Series([x], index = [x]) with any value 您可以将pd.Series([x], index = [x])的第一个x替换为任何值

#6楼

The Right Way™ to Create a DataFrame 创建数据框的正确方法

TLDR; TLDR； (just read the bold text) （只需阅读粗体文字）

Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do. 这里的大多数答案将告诉您如何创建一个空的DataFrame并将其填写，但是没有人会告诉您这是一件坏事。

Here is my advice: Wait until you are sure you have all the data you need to work with. 这是我的建议： 等待直到您确定拥有所有需要使用的数据。 Use a list to collect your data, then initialise a DataFrame when you are ready. 使用列表收集数据，然后在准备好时初始化DataFrame。

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of of NaNs) and append to it over and over again. 一次添加到列表并创建一个DataFrame总是比创建一个空的DataFrame（或NaN之一） 便宜，一次又一次地添加到列表中总是便宜的 。 Lists also take up less memory and are a much lighter data structure to work with , append, and remove (if needed). 列表还占用较少的内存，并且是用于 ，附加和删除（如果需要）的较轻的数据结构 。

The other advantage of this method is dtypes are automatically inferred (rather than assigning object to all of them). 此方法的另一个优点是dtypes是自动推断的 （而不是将object分配给所有类型）。

The last advantage is that a RangeIndex is automatically created for your data , so it is one less thing to worry about (take a look at the poor append and loc methods below, you will see elements in both that require handling the index appropriately). 最后一个优点是为您的数据自动创建了RangeIndex ，因此RangeIndex担心（查看下面不良的append和loc方法，您将在这两种方法中都看到需要适当处理索引的元素）。

Things you should NOT do 你不应该做的事情

`append` or `concat` inside a loop 在循环内`append`或`concat`

Here is the biggest mistake I've seen from beginners: 这是我从初学者看到的最大错误：

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

Memory is re-allocated for every append or concat operation you have. 将为您执行的每个append或concat操作重新分配内存。 Couple this with a loop and you have a quadratic complexity operation . 再加上一个循环，就可以进行二次复杂度运算 。 From the df.append doc page : 从df.append文档页面：

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. 迭代地将行添加到DataFrame可能比单个连接更多地占用大量计算资源。 A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. 更好的解决方案是将这些行添加到列表中，然后一次将列表与原始DataFrame连接起来。

The other mistake associated with df.append is that users tend to forget append is not an in-place function , so the result must be assigned back. 与df.append相关的另一个错误是用户倾向于忘记append不是就地函数 ，因此必须将结果分配回去。 You also have to worry about the dtypes: 您还必须担心dtypes：

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. 处理对象列从来都不是一件好事，因为熊猫无法向量化这些列上的操作。 You will need to do this to fix it: 您将需要执行以下操作来修复它：

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

`loc` inside a loop `loc`在循环内

I have also seen loc used to append to a DataFrame that was created empty: 我还看到loc用来追加到创建为空的DataFrame中：

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row . 和以前一样，您没有每次都预先分配所需的内存量，因此每次创建新行时都会重新增加内存 。 It's just as bad as append , and even more ugly. 它与append一样糟糕，甚至更难看。

Empty DataFrame of NaNs NaN的空数据框

And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith. 然后，创建一个NaN的DataFrame以及与此相关的所有警告。

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

It creates a DataFrame of object columns, like the others. 它会像其他对象一样创建一个对象列的DataFrame。

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

Appending still has all the issues as the methods above. 如上所述，追加仍然存在所有问题。

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]

The Proof is in the Pudding 证明在布丁里

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility. 对这些方法进行计时是最快的方法，以了解它们在内存和实用性方面的差异。