引言
本文主要介绍 Pandas 抽象数据类型的增 create 删 delete 改 update 查 read,更多 Python 进阶系列文章,请参考 Python 进阶学习 玩转数据系列
内容提要:
- Create: list-like, array-like, NumPy arrays, dict of objs, serialized objs, etc
- Read: Getting Values, Indexing, Iterations
- Update: Overwrite, Add, Set Values, Append, Concatenate
- Delete: Item, Rows, Columns: pop(), drop(), reindex()
Create
1D Series: Creation
Pandas 1 维数据类型 Series 可以从下面几种数据结构中创建:
- list, tuple, or array-like ADT
- NumPy array
- dictionary:
● 因为 dict 中 keys 是唯一的,所以 index也是唯一的
● 但是 Pandas 允许不唯一的 index,所以可以通过其它方法来解决。
举例:
import pandas as pd
import numpy as np
s_from_list = pd.Series(['Ann', 'Tom'])
s_from_tuple = pd.Series(('Ann', 'Tom'))
s_from_ndarray = pd.Series(np.random.random(3))
# dictionary: with duplicate keys
# does not allow to create non-unique indices
s_from_dic = pd.Series({'1969':7, '1970':22, '1970':1})
# possible fix to non-unique index
s_from_dic_fix = pd.Series ({'1969':7, '1970':[1, 22]})
print("s_from_list:\n{}".format(s_from_list))
print("s_from_tuple:\n{}".format(s_from_tuple))
print("s_from_ndarray:\n{}".format(s_from_ndarray))
print("s_from_dic:\n{}".format(s_from_dic))
print("s_from_dic_fix:\n{}".format(s_from_dic_fix))
输出:
s_from_list:
0 Ann
1 Tom
dtype: object
s_from_tuple:
0 Ann
1 Tom
dtype: object
s_from_ndarray:
0 0.192336
1 0.954272
2 0.996505
dtype: float64
s_from_dic:
1969 7
1970 1
dtype: int64
s_from_dic_fix:
1969 7
1970 [1, 22]
dtype: object
2D DataFrame: Creation
Pandas 2 维数据类型 DataFrame 可以通过如下几种方式创建:
- numpy ndarray
- dict of Series, arrays, lists, tuples, or list-like objects, constants
- DataFrame
- CSV or JSON file
- from serialized object
import pandas as pd
import numpy as np
# lists of dicts
df_from_list_of_dict = pd.DataFrame ([{'Name':'Ann', 'age':15},{'Name':'Tom', 'age':13}])
# dict of lists
df_from_dict_of_list = pd.DataFrame({'Name': ['Ann', 'Tom'], 'age': [15, 13]})
# dict of tuples
df_from_dict_of_tuple = pd.DataFrame({'Name': ('Ann', 'Tom'),'age': (15, 13)})
# dict of Series
s1 = pd.Series([13, 15])
s2 = pd.Series(['Ann', 'Tom'])
df_from_dict_of_series = pd.DataFrame({'age':s1, 'Name':s2})
# dict of arrays
df_from_dict_of_array = pd.DataFrame({'V1':np.random.random(5),'V2': np.random.random(5)})
# NumPy Array
df_from_numpy_array = pd.DataFrame(np.random.randn(5,3), columns=['V1', 'V2', 'V3'])
# from a serialized dict
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
# create a serialized dict object
serialized_dict = sales_df.to_dict()
# create a df from a serialized dict
df_from_serialized_dict = pd.DataFrame.from_dict(serialized_dict)
# create a serialized JSON object
serialized_json = sales_df.to_json()
# create a df from a serialized json object
df_from_serialized_json = pd.read_json (serialized_json)
print("df_from_list_of_dict [{{'Name':'Ann', 'age':15}},{{'Name':'Tom', 'age':13}}]:\n{}".format(df_from_list_of_dict))
print("df_from_dict_of_list {{'Name': ['Ann', 'Tom'], 'age': [15, 13]}}:\n{}".format(df_from_dict_of_list))
print("df_from_dict_of_tuple {{'Name': ('Ann', 'Tom'),'age': (15, 13)}}:\n{}".format(df_from_dict_of_tuple))
print("df_from_dict_of_series {{'age':s1, 'Name':s2}}:\n{}".format(df_from_dict_of_series))
print("df_from_dict_of_array {{'V1':np.random.random(5),'V2': np.random.random(5)}}:\n{}".format(df_from_dict_of_array))
print("df_from_numpy_array np.random.randn(5,3), columns=['V1', 'V2', 'V3']:\n{}".format(df_from_numpy_array))
print("serialized_dict:\n{}".format(serialized_dict))
print("df_from_serialized_dict from_dict(serialized_dict):\n{}".format(df_from_serialized_dict))
print("serialized_json:\n{}".format(serialized_json))
print("df_from_serialized_json read_json (serialized_json):\n{}".format(df_from_serialized_json))
输出:
df_from_list_of_dict [{'Name':'Ann', 'age':15},{'Name':'Tom', 'age':13}]:
Name age
0 Ann 15
1 Tom 13
df_from_dict_of_list {'Name': ['Ann', 'Tom'], 'age': [15, 13]}:
Name age
0 Ann 15
1 Tom 13
df_from_dict_of_tuple {'Name': ('Ann', 'Tom'),'age': (15, 13)}:
Name age
0 Ann 15
1 Tom 13
df_from_dict_of_series {'age':s1, 'Name':s2}:
age Name
0 13 Ann
1 15 Tom
df_from_dict_of_array {'V1':np.random.random(5),'V2': np.random.random(5)}:
V1 V2
0 0.786723 0.550221
1 0.506147 0.408700
2 0.370957 0.934121
3 0.738368 0.029502
4 0.864802 0.460838
df_from_numpy_array np.random.randn(5,3), columns=['V1', 'V2', 'V3']:
V1 V2 V3
0 0.173267 0.528672 1.188062
1 2.168115 0.177140 0.283160
2 0.380348 -0.063541 0.099606
3 0.880627 2.348653 -0.038103
4 0.052342 0.088939 -0.662433
serialized_dict:
{'UPS': {0: 1234, 1: 1234, 2: 1234, 3: 789, 4: 789, 5: 789, 6: 789}, 'Units': {0: 5.0, 1: 2.0, 2: 3.0, 3: 1.0, 4: 2.0, 5: nan, 6: 1.0}, 'Sales': {0: 20.2, 1: 8.0, 2: 13.0, 3: 2.0, 4: 3.8, 5: nan, 6: 1.8}, 'Date': {0: '1/1/2014', 1: '1/2/2014', 2: '1/3/2014', 3: '1/1/2014', 4: '1/2/2014', 5: '1/3/2014', 6: '1/5/2014'}}
df_from_serialized_dict from_dict(serialized_dict):
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
serialized_json:
{"UPS":{"0":1234,"1":1234,"2":1234,"3":789,"4":789,"5":789,"6":789},"Units":{"0":5.0,"1":2.0,"2":3.0,"3":1.0,"4":2.0,"5":null,"6":1.0},"Sales":{"0":20.2,"1":8.0,"2":13.0,"3":2.0,"4":3.8,"5":null,"6":1.8},"Date":{"0":"1\/1\/2014","1":"1\/2\/2014","2":"1\/3\/2014","3":"1\/1\/2014","4":"1\/2\/2014","5":"1\/3\/2014","6":"1\/5\/2014"}}
df_from_serialized_json read_json (serialized_json):
UPS Units Sales Date
0 1234 5.0 20.2 2014-01-01
1 1234 2.0 8.0 2014-01-02
2 1234 3.0 13.0 2014-01-03
3 789 1.0 2.0 2014-01-01
4 789 2.0 3.8 2014-01-02
5 789 NaN NaN 2014-01-03
6 789 1.0 1.8 2014-01-05
Read
Reading: Indexing: Behavior of iloc and loc
更多细节请参考 Python 玩转数据 - Pandas Indexing and Slicing
Reading: Getting and Setting Values
Method | Result |
---|---|
get(label, [default]) | Returns a scalar (or Series if duplicate indexes) for label or default on failed lookup |
get_value(label) | Returns a scalar (or Series if duplicate indexes) for label; KeyError if label does not exist |
set_value(label, value) | Returns a new Series with label and value inserted (or updated) |
注意:有些版本不支持 get_value 和 set_value 方法。
import pandas as pd
s = pd.Series(['Ann', 'Tim'], index=['Hunt', 'Geist'])
print("s:\n{}".format(s))
print("s.get('Hunt'):{}".format(s.get('Hunt')))
print("s.get('Doyle', 'Jon'):{}".format(s.get('Doyle', 'Jon')))
print("s.get('Doyle'):{}".format(s.get('Doyle')))
输出:
s:
Hunt Ann
Geist Tim
dtype: object
s.get('Hunt'):Ann
s.get('Doyle', 'Jon'):Jon
s.get('Doyle'):None
Reading: Iterations over Series
- values:
● for val in s:
● for val in s.values(): - index:
● for idx in s.keys() - index, value (unpacked tuples)
● for idx, val in s.iteritems(): - (index, value) tuples:
● for item in s.iteritems():
注意:
Operations performed during iteration are not vectorized in Pandas; they have overhead
举例:
import pandas as pd
s = pd.Series(['Ann', 'Tim'], index=['Hunt', 'Geist'])
print("s:\n{}".format(s))
# to access values of the Series
print("for value in s:")
for value in s:
print(value)
# to access values of the Series
print("value in s.values:")
for value in s.values:
print (value)
# to access index
print("for idx in s.keys():")
for idx in s.keys():
print (idx)
# access index-value pairs
# with tuple unpacking
print("for idx, val in s.iteritems():")
for idx, val in s.iteritems():
print (idx, val)
# access index-value pair tuples
print("for item in s.iteritems():")
for item in s.iteritems():
print (item)
输出:
s:
Hunt Ann
Geist Tim
dtype: object
for value in s:
Ann
Tim
value in s.values:
Ann
Tim
for idx in s.keys():
Hunt
Geist
for idx, val in s.iteritems():
Hunt Ann
Geist Tim
for item in s.iteritems():
('Hunt', 'Ann')
('Geist', 'Tim')
Reading: Iterations over DataFrames
- column names:
● for col in df:
● for col in df.key(): - column names and columns as a Series:
● for col, ser in df.iteritems(): - rows:
● for row in df.iterrows(): - named tuples containing index and row values:
● for row in df.itertuples():
举例:
import pandas as pd
df = pd.DataFrame({'Name': ['Ann', 'Tom'],'age': [15, 13]})
print("df:\n{}".format(df))
print("for column in df:")
# iterate over column names
for column in df:
print (column)
print("for column in df.keys():")
# iterate over column names
for column in df.keys():
print (column)
print("for column in df.columns:")
# iterate over column names
for column in df.columns:
print (column)
print("for col, ser in df.iteritems():")
# iterate over column names
# and columns as a Series
for col, ser in df.iteritems():
print (col, ser)
print("for row in df.iterrows():")
# iterate over rows
for row in df.iterrows():
print (row)
print("for row in df.itertuples():")
# iterate over named tuples
# containing the index and row values
for row in df.itertuples():
print (row)
输出:
df:
Name age
0 Ann 15
1 Tom 13
for column in df:
Name
age
for column in df.keys():
Name
age
for column in df.columns:
Name
age
for col, ser in df.iteritems():
Name 0 Ann
1 Tom
Name: Name, dtype: object
age 0 15
1 13
Name: age, dtype: int64
for row in df.iterrows():
(0, Name Ann
age 15
Name: 0, dtype: object)
(1, Name Tom
age 13
Name: 1, dtype: object)
for row in df.itertuples():
Pandas(Index=0, Name='Ann', age=15)
Pandas(Index=1, Name='Tom', age=13)
Update
Update: Series
- 覆盖: 基于
存在的index 标签
更新:
● 原来的值会被新值覆盖 - 覆盖: 基于位置的更新 .iloc
● 原来的值会被新值覆盖
●IndexError: if index integer value is out-of-bounds
- 新加: 基于
不存在的 index 标签
:
● 新的 item 将新加到原来的 series - .append 追加 (生成一个新的 series):
● 原始的series是不会更新的
● 一个新的 series 返回
● 加有相同的index 标签的 item 也是可以的 - .set_value (label, value)
有些版本不支持
:
● adds a new item for non-existing label
● overwrites the value for the existing label
● returns a modified series
● replaces all occurrences of a non-unique index label with a new value
举例:
import pandas as pd
s = pd.Series(['Ann', 'Tim'], index=['Hunt', 'Geist'])
print("s:\n{}".format(s))
# Overwriting the existing value
s['Geist'] = 'John'
print("s['Geist'] = 'John'-> the updated s:\n{}".format(s))
# Adding new value
s['Smith'] = 'Jimmy'
print("s['Smith'] = 'Jimmy'-> the updated s:\n{}".format(s))
# Position-based overwriting
s.iloc[0] = 'Mary'
print("s.iloc[0] = 'Mary' -> the updated s:\n{}".format(s))
# IndexError: integer index value is out-of-bounds
# s.iloc[5] = 'Liz'
# .append()
new_ser = s.append(pd.Series(
{'Doyle': 'Jon', 'Hunt':'Ann'} ))
print("new_ser:\n{}".format(new_ser))
print("s:\n{}".format(s))
输出:
s:
Hunt Ann
Geist Tim
dtype: object
s['Geist'] = 'John'-> the updated s:
Hunt Ann
Geist John
dtype: object
s['Smith'] = 'Jimmy'-> the updated s:
Hunt Ann
Geist John
Smith Jimmy
dtype: object
s.iloc[0] = 'Mary' -> the updated s:
Hunt Mary
Geist John
Smith Jimmy
dtype: object
new_ser:
Hunt Mary
Geist John
Smith Jimmy
Doyle Jon
Hunt Ann
dtype: object
s:
Hunt Mary
Geist John
Smith Jimmy
dtype: object
Update: DataFrame
- 合并行 Combine rows:
● pd.concat ( [list_of_dfs] )
● 主要功能就是合并多个一个list 中的多个 df
● 原始的 df 是不会改变的
● 生成新的 df 合并来自第二个 df 位置 index 是从 0 开始的 - 增加一列:
● 基于新标签的赋值一个series,新的标签就是新的 column 名
pd.concat ( [list_of_dfs] ) 例子:
import pandas as pd
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))
more_df = pd.DataFrame([(1234, 4.0, 10.0, '1/4/2014')], columns = ['UPS', 'Units', 'Sales', 'Date'] )
print("salemore_dfs_df:\n{}".format(more_df))
# the positional index for the second df
# starts at zero for the new df
new_df = pd.concat ([sales_df, more_df])
print("new_df:\n{}".format(new_df))
print("sales_df:\n{}".format(sales_df))
print("salemore_dfs_df:\n{}".format(more_df))
输出:
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
salemore_dfs_df:
UPS Units Sales Date
0 1234 4.0 10.0 1/4/2014
new_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
0 1234 4.0 10.0 1/4/2014
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
salemore_dfs_df:
UPS Units Sales Date
0 1234 4.0 10.0 1/4/2014
新增列例子:
import pandas as pd
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))
sales_df['Priority'] = pd.Series(range(7))
sales_df['Category'] = 'Food'
print("sales_df after new column added:\n{}".format(sales_df))
输出:
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
sales_df after new column added:
UPS Units Sales Date Priority Category
0 1234 5.0 20.2 1/1/2014 0 Food
1 1234 2.0 8.0 1/2/2014 1 Food
2 1234 3.0 13.0 1/3/2014 2 Food
3 789 1.0 2.0 1/1/2014 3 Food
4 789 2.0 3.8 1/2/2014 4 Food
5 789 NaN NaN 1/3/2014 5 Food
6 789 1.0 1.8 1/5/2014 6 Food
Delete
Deleting Series Items
- 不寻常的操作:
● 直接在原 series上操作数据,取代创建一个新的 copy series - 基于标签 Label-based 删除 item 用 del
注意
: 删除不唯一的 index 标签的 item,会导致难以预料的结果
删除唯一的index 标签
import pandas as pd
s = pd.Series(['Ann', 'Tim'], index = ['Hunt', 'Geist'])
print("origin s:\n{}".format(s))
del s['Geist']
print("s after deleting del s['Geist']:\n{}".format(s))
输出:
origin s:
Hunt Ann
Geist Tim
dtype: object
s after deleting del s['Geist']:
Hunt Ann
dtype: object
删除不唯一的index 标签
import pandas as pd
s = pd.Series(['Ann', 'Tim', 'John'], index = ['Hunt', 'Geist', 'Hunt'])
print("origin s:\n{}".format(s))
del s['Hunt']
print("s after deleting del s['Hunt']:\n{}".format(s))
输出:
origin s:
Hunt Ann
Geist Tim
Hunt John
dtype: object
s after deleting del s['Hunt']:
Geist Tim
dtype: object
Delete DataFrame
Deleting DataFrame Rows with .drop() method
.drop (list_of_row_index_values, axis=0):
● 返回所有剩下的 rows(s)
● 原来的 df 是不会改变的
● 可以删除一行或多行
● ValueError: 删除不存在的 row index
import pandas as pd
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))
remaining_rows = sales_df.drop([2, 4, 6], axis = 0)
print("remaining_rows:\n{}".format(remaining_rows))
print("sales_df:\n{}".format(sales_df))
# ValueError: if the list contains non-existent row index values
# sales_df.drop(7, axis = 0)
输出:
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
remaining_rows:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
3 789 1.0 2.0 1/1/2014
5 789 NaN NaN 1/3/2014
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
Deleting DataFrame Columns
.pop (column_label):
● 返回从原来的 df 中被移除的列
● 原来的 df 是会被改变的
● 只能移除一列,不能移除多个列
● KeyError: 如果列名不存在
import pandas as pd
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))
sales_only = sales_df.pop('Sales')
# cannot pop more than one column
# sales_units = sales_df.pop(['Units','Sales'])
print("sales_only:\n{}".format(sales_only))
print("sales_df after pop:\n{}".format(sales_df))
# add the removed column back
sales_df['Sales'] = sales_only
print("sales_df after add back:\n{}".format(sales_df))
输出:
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
sales_only:
0 20.2
1 8.0
2 13.0
3 2.0
4 3.8
5 NaN
6 1.8
Name: Sales, dtype: float64
sales_df after pop:
UPS Units Date
0 1234 5.0 1/1/2014
1 1234 2.0 1/2/2014
2 1234 3.0 1/3/2014
3 789 1.0 1/1/2014
4 789 2.0 1/2/2014
5 789 NaN 1/3/2014
6 789 1.0 1/5/2014
sales_df after add back:
UPS Units Date Sales
0 1234 5.0 1/1/2014 20.2
1 1234 2.0 1/2/2014 8.0
2 1234 3.0 1/3/2014 13.0
3 789 1.0 1/1/2014 2.0
4 789 2.0 1/2/2014 3.8
5 789 NaN 1/3/2014 NaN
6 789 1.0 1/5/2014 1.8
.drop (list_of_col_labels, axis=1):
● 返回不包括在参数 list_of_col_labels 的所有其它列
● 原来的 df 不会改变
● 可以 drop 一列或多列
● ValueError: 如果列名不存在
举例:
import pandas as pd
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))
remaining_df = sales_df.drop (['Units', 'Sales'], axis = 1)
# ValueError: if dropping non-existent column label
# sales_df.drop('Category')
print("remaining_df:\n{}".format(remaining_df))
print("sales_df after drop:\n{}".format(sales_df))
输出:
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
remaining_df:
UPS Date
0 1234 1/1/2014
1 1234 1/2/2014
2 1234 1/3/2014
3 789 1/1/2014
4 789 1/2/2014
5 789 1/3/2014
6 789 1/5/2014
sales_df after drop:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
.reindex (list_of_columns):
● 创建一个新的 df,包含想要的列数据
● 原 df 不会改变
import pandas as pd
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))
cols = ['Sales', 'Date']
sales_date_only = sales_df.reindex (columns = cols)
print("sales_date_only:\n{}".format(sales_date_only))
print("sales_df after reindex:\n{}".format(sales_df))
输出:
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
sales_date_only:
Sales Date
0 20.2 1/1/2014
1 8.0 1/2/2014
2 13.0 1/3/2014
3 2.0 1/1/2014
4 3.8 1/2/2014
5 NaN 1/3/2014
6 1.8 1/5/2014
sales_df after reindex:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
Indexing with a list of new columns: df [ list_of_columns]
● 返回选中的列
● 原 df 不会改变
● KeyError: 如果列名标签不存在
import pandas as pd
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))
sales_units = sales_df[['Units','Sales']]
print("sales_units:\n{}".format(sales_units))
print("sales_df after indexing:\n{}".format(sales_df))
输出:
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
sales_units:
Units Sales
0 5.0 20.2
1 2.0 8.0
2 3.0 13.0
3 1.0 2.0
4 2.0 3.8
5 NaN NaN
6 1.0 1.8
sales_df after indexing:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014