Need help understanding variable assignment, pointers, ...
The following is reproducible.
import pandas as pd
df = pd.DataFrame({
'listData': [
['c', 'f', 'd', 'a', 'e', 'b'],
[5, 2, 1, 4, 3]
]})
df['listDataSort'] = df['listData']
gives:
listData listDataSort
0 [c, f, d, a, e, b] [c, f, d, a, e, b]
1 [5, 2, 1, 4, 3] [5, 2, 1, 4, 3]
If I only want to sort the lists in the listDataSort column, I might try:
df['listDataSort'].apply(lambda l: l.sort())
df
However, that sorts the lists in both columns, in-place.
listData listDataSort
0 [a, b, c, d, e, f] [a, b, c, d, e, f]
1 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5]
I can fix this by instead doing:
df = pd.DataFrame({
'listData': [
['c', 'f', 'd', 'a', 'e', 'b'],
[5, 2, 1, 4, 3]
]})
df['listDataSort'] = df['listData'].apply(sorted)
giving:
listData listDataSort
0 [c, f, d, a, e, b] [a, b, c, d, e, f]
1 [5, 2, 1, 4, 3] [1, 2, 3, 4, 5]
Assigning df to a different variable, say df2 still changes everything back to the original source list. Furthermore, how do I create a new dataframe based on an existing dataframe so I can make changes to the new dataframe without making the same changes to the existing dataframe?
df = pd.DataFrame({
'listData': [
['c', 'f', 'd', 'a', 'e', 'b'],
[5, 2, 1, 4, 3]
]})
df2 = df
print('\ndf\n', df)
print('\ndf2\n', df2)
df2['listDataSort'] = df2['listData']
print('\ndf\n', df)
print('\ndf2\n', df2)
df2['listDataSort'].apply(lambda l: l.sort())
print('\ndf\n', df)
print('\ndf2\n', df2)
prints:
df
listData
0 [c, f, d, a, e, b]
1 [5, 2, 1, 4, 3]
df2
listData
0 [c, f, d, a, e, b]
1 [5, 2, 1, 4, 3]
df
listData listDataSort
0 [c, f, d, a, e, b] [c, f, d, a, e, b]
1 [5, 2, 1, 4, 3] [5, 2, 1, 4, 3]
df2
listData listDataSort
0 [c, f, d, a, e, b] [c, f, d, a, e, b]
1 [5, 2, 1, 4, 3] [5, 2, 1, 4, 3]
df
listData listDataSort
0 [a, b, c, d, e, f] [a, b, c, d, e, f]
1 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5]
df2
listData listDataSort
0 [a, b, c, d, e, f] [a, b, c, d, e, f]
1 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5]
also:
df = pd.DataFrame({
'listData': [
['c', 'f', 'd', 'a', 'e', 'b'],
[5, 2, 1, 4, 3]
]})
print('\ndf\n', df)
df3 = df
df3['listDataSort'] = df3['listData'].apply(sorted)
print('\ndf\n', df)
print('\ndf3\n', df3)
prints:
df
listData
0 [c, f, d, a, e, b]
1 [5, 2, 1, 4, 3]
df
listData listDataSort
0 [c, f, d, a, e, b] [a, b, c, d, e, f]
1 [5, 2, 1, 4, 3] [1, 2, 3, 4, 5]
df3
listData listDataSort
0 [c, f, d, a, e, b] [a, b, c, d, e, f]
1 [5, 2, 1, 4, 3] [1, 2, 3, 4, 5]
解决方案
When you run
df['listDataSort'] = df['listData']
All you do is copy the references of the lists to new columns. This means only a shallow copy is performed and both columns reference the same lists. So any change to one column will likely affect another.
You can use a list comprehension with sorted which returns a copy of the data. This should be the easiest option for you.
df['listDataSort'] = [sorted(x) for x in df['listDataSort']]
df
listData listDataSort
0 [c, f, d, a, e, b] [a, b, c, d, e, f]
1 [5, 2, 1, 4, 3] [1, 2, 3, 4, 5]
Now, when it comes to the problem of making a copy of the entire DataFrame, things are a little more complicated. I would recommend deepcopy:
import copy
df2 = df.apply(copy.deepcopy)