Python 玩转数据 9 - Pandas 增删改查 CRUD: Create, Read, Update, Delete

引言

本文主要介绍 Pandas 抽象数据类型的增 create 删 delete 改 update 查 read,更多 Python 进阶系列文章,请参考 Python 进阶学习 玩转数据系列

内容提要:

  1. Create: list-like, array-like, NumPy arrays, dict of objs, serialized objs, etc
  2. Read: Getting Values, Indexing, Iterations
  3. Update: Overwrite, Add, Set Values, Append, Concatenate
  4. Delete: Item, Rows, Columns: pop(), drop(), reindex()

Create

1D Series: Creation

Pandas 1 维数据类型 Series 可以从下面几种数据结构中创建:

  1. list, tuple, or array-like ADT
  2. NumPy array
  3. dictionary:
    ● 因为 dict 中 keys 是唯一的,所以 index也是唯一的
    ● 但是 Pandas 允许不唯一的 index,所以可以通过其它方法来解决。

举例:

import pandas as pd
import numpy as np

s_from_list = pd.Series(['Ann', 'Tom'])
s_from_tuple = pd.Series(('Ann', 'Tom'))
s_from_ndarray = pd.Series(np.random.random(3))

# dictionary: with duplicate keys
# does not allow to create non-unique indices
s_from_dic = pd.Series({'1969':7, '1970':22, '1970':1})

# possible fix to non-unique index
s_from_dic_fix = pd.Series ({'1969':7, '1970':[1, 22]})

print("s_from_list:\n{}".format(s_from_list))
print("s_from_tuple:\n{}".format(s_from_tuple))
print("s_from_ndarray:\n{}".format(s_from_ndarray))
print("s_from_dic:\n{}".format(s_from_dic))
print("s_from_dic_fix:\n{}".format(s_from_dic_fix))

输出:

s_from_list: 
0    Ann     
1    Tom     
dtype: object
s_from_tuple:
0    Ann
1    Tom
dtype: object
s_from_ndarray:
0    0.192336
1    0.954272
2    0.996505
dtype: float64
s_from_dic:
1969    7
1970    1
dtype: int64
s_from_dic_fix:
1969          7
1970    [1, 22]
dtype: object

2D DataFrame: Creation

Pandas 2 维数据类型 DataFrame 可以通过如下几种方式创建:

  1. numpy ndarray
  2. dict of Series, arrays, lists, tuples, or list-like objects, constants
  3. DataFrame
  4. CSV or JSON file
  5. from serialized object
import pandas as pd
import numpy as np

# lists of dicts
df_from_list_of_dict = pd.DataFrame ([{'Name':'Ann', 'age':15},{'Name':'Tom', 'age':13}])
# dict of lists
df_from_dict_of_list = pd.DataFrame({'Name': ['Ann', 'Tom'], 'age': [15, 13]})
# dict of tuples
df_from_dict_of_tuple = pd.DataFrame({'Name': ('Ann', 'Tom'),'age': (15, 13)})
# dict of Series
s1 = pd.Series([13, 15])
s2 = pd.Series(['Ann', 'Tom'])
df_from_dict_of_series = pd.DataFrame({'age':s1, 'Name':s2})
# dict of arrays
df_from_dict_of_array = pd.DataFrame({'V1':np.random.random(5),'V2': np.random.random(5)})
# NumPy Array
df_from_numpy_array = pd.DataFrame(np.random.randn(5,3), columns=['V1', 'V2', 'V3'])
# from a serialized dict
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
# create a serialized dict object
serialized_dict = sales_df.to_dict()
# create a df from a serialized dict
df_from_serialized_dict = pd.DataFrame.from_dict(serialized_dict)
# create a serialized JSON object
serialized_json = sales_df.to_json()
# create a df from a serialized json object
df_from_serialized_json = pd.read_json (serialized_json)

print("df_from_list_of_dict [{{'Name':'Ann', 'age':15}},{{'Name':'Tom', 'age':13}}]:\n{}".format(df_from_list_of_dict))
print("df_from_dict_of_list {{'Name': ['Ann', 'Tom'], 'age': [15, 13]}}:\n{}".format(df_from_dict_of_list))
print("df_from_dict_of_tuple {{'Name': ('Ann', 'Tom'),'age': (15, 13)}}:\n{}".format(df_from_dict_of_tuple))
print("df_from_dict_of_series {{'age':s1, 'Name':s2}}:\n{}".format(df_from_dict_of_series))
print("df_from_dict_of_array {{'V1':np.random.random(5),'V2': np.random.random(5)}}:\n{}".format(df_from_dict_of_array))
print("df_from_numpy_array np.random.randn(5,3), columns=['V1', 'V2', 'V3']:\n{}".format(df_from_numpy_array))
print("serialized_dict:\n{}".format(serialized_dict))
print("df_from_serialized_dict from_dict(serialized_dict):\n{}".format(df_from_serialized_dict))
print("serialized_json:\n{}".format(serialized_json))
print("df_from_serialized_json read_json (serialized_json):\n{}".format(df_from_serialized_json))

输出:

df_from_list_of_dict [{'Name':'Ann', 'age':15},{'Name':'Tom', 'age':13}]:
  Name  age
0  Ann   15
1  Tom   13
df_from_dict_of_list {'Name': ['Ann', 'Tom'], 'age': [15, 13]}:
  Name  age
0  Ann   15
1  Tom   13
df_from_dict_of_tuple {'Name': ('Ann', 'Tom'),'age': (15, 13)}:
  Name  age
0  Ann   15
1  Tom   13
df_from_dict_of_series {'age':s1, 'Name':s2}:
   age Name
0   13  Ann
1   15  Tom
df_from_dict_of_array {'V1':np.random.random(5),'V2': np.random.random(5)}:
         V1        V2
0  0.786723  0.550221
1  0.506147  0.408700
2  0.370957  0.934121
3  0.738368  0.029502
4  0.864802  0.460838
df_from_numpy_array np.random.randn(5,3), columns=['V1', 'V2', 'V3']:
         V1        V2        V3
0  0.173267  0.528672  1.188062
1  2.168115  0.177140  0.283160
2  0.380348 -0.063541  0.099606
3  0.880627  2.348653 -0.038103
4  0.052342  0.088939 -0.662433
serialized_dict:
{'UPS': {0: 1234, 1: 1234, 2: 1234, 3: 789, 4: 789, 5: 789, 6: 789}, 'Units': {0: 5.0, 1: 2.0, 2: 3.0, 3: 1.0, 4: 2.0, 5: nan, 6: 1.0}, 'Sales': {0: 20.2, 1: 8.0, 2: 13.0, 3: 2.0, 4: 3.8, 5: nan, 6: 1.8}, 'Date': {0: '1/1/2014', 1: '1/2/2014', 2: '1/3/2014', 3: '1/1/2014', 4: '1/2/2014', 5: '1/3/2014', 6: '1/5/2014'}}     
df_from_serialized_dict from_dict(serialized_dict):
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
serialized_json:
{"UPS":{"0":1234,"1":1234,"2":1234,"3":789,"4":789,"5":789,"6":789},"Units":{"0":5.0,"1":2.0,"2":3.0,"3":1.0,"4":2.0,"5":null,"6":1.0},"Sales":{"0":20.2,"1":8.0,"2":13.0,"3":2.0,"4":3.8,"5":null,"6":1.8},"Date":{"0":"1\/1\/2014","1":"1\/2\/2014","2":"1\/3\/2014","3":"1\/1\/2014","4":"1\/2\/2014","5":"1\/3\/2014","6":"1\/5\/2014"}}
df_from_serialized_json read_json (serialized_json):
    UPS  Units  Sales       Date
0  1234    5.0   20.2 2014-01-01
1  1234    2.0    8.0 2014-01-02
2  1234    3.0   13.0 2014-01-03
3   789    1.0    2.0 2014-01-01
4   789    2.0    3.8 2014-01-02
5   789    NaN    NaN 2014-01-03
6   789    1.0    1.8 2014-01-05

Read

Reading: Indexing: Behavior of iloc and loc

更多细节请参考 Python 玩转数据 - Pandas Indexing and Slicing
在这里插入图片描述

Reading: Getting and Setting Values

MethodResult
get(label, [default])Returns a scalar (or Series if duplicate indexes) for label or default on failed lookup
get_value(label)Returns a scalar (or Series if duplicate indexes) for label; KeyError if label does not exist
set_value(label, value)Returns a new Series with label and value inserted (or updated)

注意:有些版本不支持 get_value 和 set_value 方法。

import pandas as pd

s = pd.Series(['Ann', 'Tim'], index=['Hunt', 'Geist'])
print("s:\n{}".format(s))
print("s.get('Hunt'):{}".format(s.get('Hunt')))
print("s.get('Doyle', 'Jon'):{}".format(s.get('Doyle', 'Jon')))
print("s.get('Doyle'):{}".format(s.get('Doyle')))

输出:

s:
Hunt     Ann
Geist    Tim
dtype: object
s.get('Hunt'):Ann
s.get('Doyle', 'Jon'):Jon
s.get('Doyle'):None

Reading: Iterations over Series

  1. values:
    ● for val in s:
    ● for val in s.values():
  2. index:
    ● for idx in s.keys()
  3. index, value (unpacked tuples)
    ● for idx, val in s.iteritems():
  4. (index, value) tuples:
    ● for item in s.iteritems():

注意: Operations performed during iteration are not vectorized in Pandas; they have overhead

举例:

import pandas as pd

s = pd.Series(['Ann', 'Tim'], index=['Hunt', 'Geist'])

print("s:\n{}".format(s))

# to access values of the Series
print("for value in s:")
for value in s:
    print(value)

# to access values of the Series
print("value in s.values:")
for value in s.values:
    print (value)

# to access index
print("for idx in s.keys():")
for idx in s.keys():
    print (idx)    

# access index-value pairs 
# with tuple unpacking
print("for idx, val in s.iteritems():")
for idx, val in s.iteritems():
    print (idx, val)

# access index-value pair tuples
print("for item in s.iteritems():")
for item in s.iteritems():
    print (item)

输出:

s:
Hunt     Ann
Geist    Tim
dtype: object
for value in s:
Ann
Tim
value in s.values:
Ann
Tim
for idx in s.keys():
Hunt
Geist
for idx, val in s.iteritems():
Hunt Ann
Geist Tim
for item in s.iteritems():
('Hunt', 'Ann')
('Geist', 'Tim')

Reading: Iterations over DataFrames

  1. column names:
    ● for col in df:
    ● for col in df.key():
  2. column names and columns as a Series:
    ● for col, ser in df.iteritems():
  3. rows:
    ● for row in df.iterrows():
  4. named tuples containing index and row values:
    ● for row in df.itertuples():

举例:

import pandas as pd

df = pd.DataFrame({'Name': ['Ann', 'Tom'],'age': [15, 13]})

print("df:\n{}".format(df))
print("for column in df:")
# iterate over column names
for column in df:
    print (column)

print("for column in df.keys():")
# iterate over column names
for column in df.keys():
    print (column)

print("for column in df.columns:")
# iterate over column names
for column in df.columns:
    print (column)

print("for col, ser in df.iteritems():")
# iterate over column names
# and columns as a Series
for col, ser in df.iteritems():
    print (col, ser)

print("for row in df.iterrows():")
# iterate over rows
for row in df.iterrows():
    print (row)

print("for row in df.itertuples():")
# iterate over named tuples
# containing the index and row values
for row in df.itertuples():
    print (row)

输出:

df:
  Name  age
0  Ann   15
1  Tom   13
for column in df:
Name
age
for column in df.keys():
Name
age
for column in df.columns:
Name
age
for col, ser in df.iteritems():
Name 0    Ann
1    Tom
Name: Name, dtype: object
age 0    15
1    13
Name: age, dtype: int64
for row in df.iterrows():
(0, Name    Ann
age      15
Name: 0, dtype: object)
(1, Name    Tom
age      13
Name: 1, dtype: object)
for row in df.itertuples():
Pandas(Index=0, Name='Ann', age=15)
Pandas(Index=1, Name='Tom', age=13)

Update

Update: Series

  1. 覆盖: 基于存在的index 标签更新:
    ● 原来的值会被新值覆盖
  2. 覆盖: 基于位置的更新 .iloc
    ● 原来的值会被新值覆盖
    IndexError: if index integer value is out-of-bounds
  3. 新加: 基于不存在的 index 标签:
    ● 新的 item 将新加到原来的 series
  4. .append 追加 (生成一个新的 series):
    ● 原始的series是不会更新的
    ● 一个新的 series 返回
    ● 加有相同的index 标签的 item 也是可以的
  5. .set_value (label, value) 有些版本不支持:
    ● adds a new item for non-existing label
    ● overwrites the value for the existing label
    ● returns a modified series
    ● replaces all occurrences of a non-unique index label with a new value

举例:

import pandas as pd

s = pd.Series(['Ann', 'Tim'], index=['Hunt', 'Geist'])

print("s:\n{}".format(s))
# Overwriting the existing value
s['Geist'] = 'John'
print("s['Geist'] = 'John'-> the updated s:\n{}".format(s))
# Adding new value
s['Smith'] = 'Jimmy'
print("s['Smith'] = 'Jimmy'-> the updated s:\n{}".format(s))

# Position-based overwriting
s.iloc[0] = 'Mary'
print("s.iloc[0] = 'Mary' -> the updated s:\n{}".format(s))

# IndexError: integer index value is out-of-bounds
# s.iloc[5] = 'Liz'

# .append()
new_ser = s.append(pd.Series( 
    {'Doyle': 'Jon', 'Hunt':'Ann'} ))

print("new_ser:\n{}".format(new_ser)) 
print("s:\n{}".format(s))

输出:

s:
Hunt     Ann 
Geist    Tim 
dtype: object
s['Geist'] = 'John'-> the updated s: 
Hunt      Ann
Geist    John
dtype: object
s['Smith'] = 'Jimmy'-> the updated s:
Hunt       Ann
Geist     John
Smith    Jimmy
dtype: object
s.iloc[0] = 'Mary' -> the updated s:
Hunt      Mary
Geist     John
Smith    Jimmy
dtype: object
new_ser:
Hunt      Mary
Geist     John
Smith    Jimmy
Doyle      Jon
Hunt       Ann
dtype: object
s:
Hunt      Mary
Geist     John
Smith    Jimmy
dtype: object

Update: DataFrame

  1. 合并行 Combine rows:
    ● pd.concat ( [list_of_dfs] )
    ● 主要功能就是合并多个一个list 中的多个 df
    ● 原始的 df 是不会改变的
    ● 生成新的 df 合并来自第二个 df 位置 index 是从 0 开始的
  2. 增加一列:
    ● 基于新标签的赋值一个series,新的标签就是新的 column 名

pd.concat ( [list_of_dfs] ) 例子:

import pandas as pd

sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))

more_df = pd.DataFrame([(1234, 4.0, 10.0, '1/4/2014')], columns = ['UPS', 'Units', 'Sales', 'Date'] )
print("salemore_dfs_df:\n{}".format(more_df))

# the positional index for the second df 
# starts at zero for the new df
new_df = pd.concat ([sales_df, more_df])
print("new_df:\n{}".format(new_df))
print("sales_df:\n{}".format(sales_df))
print("salemore_dfs_df:\n{}".format(more_df))

输出:

sales_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
salemore_dfs_df:
    UPS  Units  Sales      Date
0  1234    4.0   10.0  1/4/2014
new_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
0  1234    4.0   10.0  1/4/2014
sales_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
salemore_dfs_df:
    UPS  Units  Sales      Date
0  1234    4.0   10.0  1/4/2014

新增列例子:

import pandas as pd

sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))

sales_df['Priority'] = pd.Series(range(7))
sales_df['Category'] = 'Food'

print("sales_df after new column added:\n{}".format(sales_df))

输出:

sales_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
sales_df after new column added:
    UPS  Units  Sales      Date  Priority Category
0  1234    5.0   20.2  1/1/2014         0     Food
1  1234    2.0    8.0  1/2/2014         1     Food
2  1234    3.0   13.0  1/3/2014         2     Food
3   789    1.0    2.0  1/1/2014         3     Food
4   789    2.0    3.8  1/2/2014         4     Food
5   789    NaN    NaN  1/3/2014         5     Food
6   789    1.0    1.8  1/5/2014         6     Food

Delete

Deleting Series Items

  1. 不寻常的操作:
    ● 直接在原 series上操作数据,取代创建一个新的 copy series
  2. 基于标签 Label-based 删除 item 用 del
  3. 注意: 删除不唯一的 index 标签的 item,会导致难以预料的结果

删除唯一的index 标签

import pandas as pd

s = pd.Series(['Ann', 'Tim'], index = ['Hunt', 'Geist'])
print("origin s:\n{}".format(s))
del s['Geist']
print("s after deleting del s['Geist']:\n{}".format(s))

输出:

origin s:
Hunt     Ann
Geist    Tim
dtype: object
s after deleting del s['Geist']:
Hunt    Ann
dtype: object

删除不唯一的index 标签

import pandas as pd

s = pd.Series(['Ann', 'Tim', 'John'], index = ['Hunt', 'Geist', 'Hunt'])
print("origin s:\n{}".format(s))
del s['Hunt']
print("s after deleting del s['Hunt']:\n{}".format(s))

输出:

origin s:
Hunt      Ann
Geist     Tim
Hunt     John
dtype: object
s after deleting del s['Hunt']:
Geist    Tim
dtype: object

Delete DataFrame

Deleting DataFrame Rows with .drop() method

.drop (list_of_row_index_values, axis=0):
● 返回所有剩下的 rows(s)
● 原来的 df 是不会改变的
● 可以删除一行或多行
● ValueError: 删除不存在的 row index

import pandas as pd

sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))

remaining_rows = sales_df.drop([2, 4, 6], axis = 0)
print("remaining_rows:\n{}".format(remaining_rows))
print("sales_df:\n{}".format(sales_df))

# ValueError: if the list contains non-existent row index values
# sales_df.drop(7, axis = 0)

输出:

sales_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
remaining_rows:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
3   789    1.0    2.0  1/1/2014
5   789    NaN    NaN  1/3/2014
sales_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014

Deleting DataFrame Columns

.pop (column_label):

● 返回从原来的 df 中被移除的列
原来的 df 是会被改变的
● 只能移除一列,不能移除多个列
● KeyError: 如果列名不存在

import pandas as pd

sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))

sales_only = sales_df.pop('Sales')

# cannot pop more than one column
# sales_units = sales_df.pop(['Units','Sales'])

print("sales_only:\n{}".format(sales_only))
print("sales_df after pop:\n{}".format(sales_df))

# add the removed column back
sales_df['Sales'] = sales_only
print("sales_df after add back:\n{}".format(sales_df))

输出:

sales_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
sales_only:
0    20.2  
1     8.0  
2    13.0
3     2.0
4     3.8
5     NaN
6     1.8
Name: Sales, dtype: float64
sales_df after pop:
    UPS  Units      Date
0  1234    5.0  1/1/2014
1  1234    2.0  1/2/2014
2  1234    3.0  1/3/2014
3   789    1.0  1/1/2014
4   789    2.0  1/2/2014
5   789    NaN  1/3/2014
6   789    1.0  1/5/2014
sales_df after add back:
    UPS  Units      Date  Sales
0  1234    5.0  1/1/2014   20.2
1  1234    2.0  1/2/2014    8.0
2  1234    3.0  1/3/2014   13.0
3   789    1.0  1/1/2014    2.0
4   789    2.0  1/2/2014    3.8
5   789    NaN  1/3/2014    NaN
6   789    1.0  1/5/2014    1.8

.drop (list_of_col_labels, axis=1):

● 返回不包括在参数 list_of_col_labels 的所有其它列
原来的 df 不会改变
● 可以 drop 一列或多列
● ValueError: 如果列名不存在

举例:

import pandas as pd

sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))

remaining_df = sales_df.drop (['Units', 'Sales'], axis = 1)
# ValueError: if dropping non-existent column label
# sales_df.drop('Category')

print("remaining_df:\n{}".format(remaining_df))
print("sales_df after drop:\n{}".format(sales_df))

输出:

sales_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
remaining_df:
    UPS      Date
0  1234  1/1/2014
1  1234  1/2/2014
2  1234  1/3/2014
3   789  1/1/2014
4   789  1/2/2014
5   789  1/3/2014
6   789  1/5/2014
sales_df after drop:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014

.reindex (list_of_columns):

● 创建一个新的 df,包含想要的列数据
● 原 df 不会改变

import pandas as pd

sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))

cols = ['Sales', 'Date']
sales_date_only = sales_df.reindex (columns = cols)

print("sales_date_only:\n{}".format(sales_date_only))
print("sales_df after reindex:\n{}".format(sales_df))

输出:

sales_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
sales_date_only:
   Sales      Date
0   20.2  1/1/2014
1    8.0  1/2/2014
2   13.0  1/3/2014
3    2.0  1/1/2014
4    3.8  1/2/2014
5    NaN  1/3/2014
6    1.8  1/5/2014
sales_df after reindex:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014

Indexing with a list of new columns: df [ list_of_columns]

● 返回选中的列
原 df 不会改变
● KeyError: 如果列名标签不存在

import pandas as pd

sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))

sales_units = sales_df[['Units','Sales']]

print("sales_units:\n{}".format(sales_units))
print("sales_df after indexing:\n{}".format(sales_df))

输出:

sales_df:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
sales_units:
   Units  Sales
0    5.0   20.2
1    2.0    8.0
2    3.0   13.0
3    1.0    2.0
4    2.0    3.8
5    NaN    NaN
6    1.0    1.8
sales_df after indexing:
    UPS  Units  Sales      Date
0  1234    5.0   20.2  1/1/2014
1  1234    2.0    8.0  1/2/2014
2  1234    3.0   13.0  1/3/2014
3   789    1.0    2.0  1/1/2014
4   789    2.0    3.8  1/2/2014
5   789    NaN    NaN  1/3/2014
6   789    1.0    1.8  1/5/2014
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值