Pandas库
1 常用数据对象
(1)Series对象
pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
"""
data:可以是数组、列表等数组对象,也可以是字典,还可以是数字、字符串。传入字典时,索引用字典的键代替。
index:索引,传入应为列表
name:为对象起名字
"""
Series对象的属性
gdp.index
gdp.values
gdp.name
gdp.index.name
g=np.array([27466.15,24899.3,19610.9,19492.4,17885.39,17558.76])
gdp=pd.Series(g, index=['上海','北京','重庆','长春','成都','深圳'],name='City GDP')
gdp.index.name='City name'
print(gdp)
结果:
City name
上海 27466.15
北京 24899.30
重庆 19610.90
长春 19492.40
成都 17885.39
深圳 17558.76
Name: City GDP, dtype: float64
创建Series对象的方法:列表,字典(字典的键为索引)
(2)DataFrame对象
储存二维数据,类似电子表格数据库表
pd.DataFrame(data=None,index=None,columns=None,dtype=None,copy=False)
"""
data:嵌套列表,二维数组、字典或者DataFrame对象、
index:索引对象或数组对象
columns:索引对象或数组对象,列索引,用数据库或电子表格中术语叫'字段'
"""
例子1:
gdp2=pd.DataFrame([[27466.15,2419.70],[24899.30,2172.90],
[19610.90,1350.11],[19492.60,1137.87]])
gdp2.index=['上海','北京','重庆','长春']
gdp2.columns=['GDP','Population']
gdp2.index.name='City name'
gdp2.columns.name='Items'
print(gdp2)
结果:
Items GDP Population
City name
上海 27466.15 2419.70
北京 24899.30 2172.90
重庆 19610.90 1350.11
长春 19492.60 1137.87
例子2:
u=np.array([('Beijing',100.00),('beijing',96.91),('hubei',82.57)],dtype=[('city','30S'),('marks',np.float)])
pd.DataFrame(u,index=["PKU","Tsinghua","WHU"])
结果:
city marks
PKU b'Beijing' 100.00
Tsinghua b'beijing' 96.91
WHU b'hubei' 82.57
DataFrame实例对象创建方法:
嵌套列表,字典(键为列索引)
数据格式转换
(1)将其他格式转换为DataFrame
pd.DataFrame.from_dict(data, orient="columns", dtype=None, columns=None)
"""
将字典数据转换为DataFrame
data : dict Of the form {field : array-like} or {field : dict}.
orient : {'columns', 'index'}, default 'columns'
The "orientation" of the data. If the keys of the passed dict
should be the columns of the resulting DataFrame, pass 'columns'
(default). Otherwise if the keys should be rows, pass 'index'.
dtype : dtype, default None
Data type to force, otherwise infer.
columns : list, default None
Column labels to use when ``orient='index'``. Raises a ValueError
if used with ``orient='columns'``.
"""
(2)将DataFrame转换为其他格式
pd.DataFrame.to_csv
# 设DataFrame对象df:
df.to_csv(excel_writer,sheet_name='Sheet1',na_rep='',float_format=None,columns=None,header=True,index=True,index_label=None,startrow=0,startcol=0,engine=None,encoding=None)
"""
Parameters
----------
excel_writer : str or ExcelWriter object
File path or existing ExcelWriter.
sheet_name : str, default 'Sheet1'
Name of sheet which will contain DataFrame.
na_rep : str, default ''
Missing data representation. 缺失的数据表示
float_format : str, optional
Format string for floating point numbers. For example
``float_format="%.2f"`` will format 0.1234 to 0.12.
columns : sequence or list of str, optional
Columns to write.
header : bool or list of str, default True
Write out the column names. If a list of string is given it is
assumed to be aliases for the column names.
index : bool, default True
Write row names (index).
index_label : str or sequence, optional
Column label for index column(s) if desired. If not specified, and `header` and `index` are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
startrow : int, default 0
Upper left cell row to dump data frame.
startcol : int, default 0
Upper left cell column to dump data frame.
engine : str, optional
Write engine to use, 'openpyxl' or 'xlsxwriter'. You can also set this via the options ``io.excel.xlsx.writer``, ``io.excel.xls.writer``, and ``io.excel.xlsm.writer``.
merge_cells : bool, default True
Write MultiIndex and Hierarchical Rows as merged cells.
encoding : str, optional
Encoding of the resulting excel file. Only necessary for xlwt,
other writers support unicode natively.
inf_rep : str, default 'inf'
Representation for infinity (there is no native representation for infinity in Excel).
verbose : bool, default True
Display more information in the error logs.
freeze_panes : tuple of int (length 2), optional
Specifies the one-based bottommost row and rightmost column that
is to be frozen.
"""
写入xls文件:
with pd.ExcelWriter('demo1.xlsx') as writer:
a.to_excel(writer)
pd.DataFrame.to_dict
pd.DataFrame.to_excel
pd.DataFrame.to_json
…
df.to_excel
df.to_csv
df.to_dict
…
(3)Panel对象
Panel was removed in 0.25.0.
pd.Panel(data=None, items=None, major_axis=None, minor_axis=None,copy=False, dtype=None)
"""
data:可以是dict,字典的value对应类型为DataFrame
items:axis=0;
major_axis:axis=1;
minor_axis:axis=2;
"""
2 索引对象
(1)Index对象
pd.Index(data=None,dtype=None,copy=False,name=None,fastpath=False,tupleize_cols=True, **kwargs)
"""
Immutable ndarray implementing an ordered, sliceable set. The basic object
storing axis labels for all pandas objects.
Parameters
----------
data : array-like (1-dimensional)
dtype : NumPy dtype (default: object)
If dtype is None, we find the dtype that best fits the data.
If an actual dtype is provided, we coerce to that dtype if it's safe.
Otherwise, an error will be raised.
copy : bool
Make a copy of input ndarray.
name : object
Name to be stored in the index.
tupleize_cols : bool (default: True)
When True, attempt to create a MultiIndex if possible.
"""
ind=pd.Index(['a','b','c','d'])
a=pd.Series([12,15,19,20],ind)
print(a)
结果:
a 12
b 15
c 19
d 20
dtype: int64
ind2=pd.Index(['name','age'])
a2=pd.DataFrame([['zx',18],['hack',18]],index=[i for i in range(1,3)],columns=ind2)
print(a2)
结果:
name age
1 zx 18
2 hack 18
利用Index中传入data是元素为元组的列表,可实现实例化MultiIndex对象
f3=pd.Index([('a',1),('a',2),('b',1)])
print(f3)
结果:
MultiIndex([('a', 1),
('a', 2),
('b', 1)],
)
(2)MultiIndex对象
多级索引
0 1
一级索引 二级索引
a 1 0.147329 0.377239
2 0.785814 0.012596
b 1 0.794787 0.987981
2 0.851144 0.796109
a,b为一级索引,1,2为二级索引
一级+二级是行索引
0,1为列索引
通过函数创建
pd.MultiIndex.from_tuples(tuples, sortorder=None, names=None)
"""
Convert list of tuples to MultiIndex.
Parameters
----------
tuples : list / sequence of tuple-likes
Each tuple is the index of one row/column.
sortorder : int or None
Level of sortedness (must be lexicographically sorted by that level).
names : list / sequence of str, optional
Names for the levels in the index.
pd.MultiIndex.from_arrays(cls, arrays, sortorder=None, names=lib.no_default)
pd.MultiIndex.from_product(cls, iterables, sortorder=None, names=lib.no_default)
"""
Make a MultiIndex from the cartesian product of multiple iterables.
从多个迭代对象的笛卡尔积创建一个多索引.
"""
笛卡尔积:
[‘a’ , ’b’]与[100, 200] 相乘结果为 [‘a’, 100], [‘a’, 200], [‘b’, 100],[‘b’, 200]
例子:
tup=[('China','Beijing'),('China','Hongkong'),('USA','chicago'),('USA','NewYork')]
b=pd.MultiIndex.from_tuples(tup)
print(b)
arr=np.array([['china','china','USA','USA'],
['beijing','hongkong','newyork','chicago']])
c=pd.MultiIndex.from_arrays(arr)
print(c)
结果:
MultiIndex([('China', 'Beijing'),
('China', 'Hongkong'),
( 'USA', 'chicago'),
( 'USA', 'NewYork')],
)
MultiIndex([('china', 'beijing'),
('china', 'hongkong'),
( 'USA', 'newyork'),
( 'USA', 'chicago')],
)
通过实例化类创建
pd.MultiIndex(levels=None,codes=None,sortorder=None,names=None,copy=False,verify_integrity=True,**kwargs)
levels:以序列类数据表示标签索引
codes:以整数方式表示每个标签索引的位置
levels : sequence of arrays
The unique labels for each level.
codes : sequence of arrays
Integers for each level designating which label at each location.
f=pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])
print(f)
f2=pd.MultiIndex(levels=[['a','b','c','d'],[1,8,9,5]],codes=[[0,1,2,3],[1,2,0,3]])
print(f2)
结果:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
MultiIndex([('a', 8),
('b', 9),
('c', 1),
('d', 5)],
)
gdp_index=[('shanghai',2015),('shanghai',2016),('beijing',2016),('beijing',2015)]
gdp_mind=pd.Index(gdp_index)
# gdp_mind2=pd.MultiIndex.from_tuples(gdp_index,names=('city','year')) #与上一句话等价
gdp=pd.Series([25300,27466,23000,24899],index=gdp_mind)
print(gdp)
gdp2=gdp.unstack()
print(gdp2)
结果:
shanghai 2015 25300
2016 27466
beijing 2016 23000
2015 24899
dtype: int64
2015 2016
beijing 24899 23000
shanghai 25300 27466
ser.unstack(level=-1, fill_value=None)
用于将具有多级索引的Series转换为DataFrame
Unstack, also known as pivot, Series with MultiIndex to produce DataFrame. #也被称为透视表
注:
Index,MultiIndex是针对于索引的对象,无论是Series还是DataFrame对象都可以具有多级索引
3 数据索引和切片
(1)Series对象
gdp_index=['shanghai','shanghai','beijing','beijing']
gdp=pd.Series([25300,27466,23000,24899],index=gdp_index)
print(gdp.keys())
for i in gdp.items():
print(i)
Index(['shanghai', 'shanghai', 'beijing', 'beijing'], dtype='object')
('shanghai', 25300)
('shanghai', 27466)
('beijing', 23000)
('beijing', 24899)
Series对象a的方法:
a.keys():返回Index对象
a.items():返回可迭代对象,与字典相似
a.pop(label index):移除某个元素,用标签索引
a.iloc[]:只接受位置索引,不分索引级别,返回的为数值
a.loc[]:只接受标签索引 前包括,后也包括,会返回索引
对于多级索引,用a[ , ]方法,‘ ,’左面是0级索引,右面是1级索引
索引用法基本与列表类似
gdp_index=[('shanghai',2015),('shanghai',2016),('beijing',2015),('beijing',2016),
('changchun',2015),('changchun',2016)]
index=pd.MultiIndex.from_tuples(gdp_index)
ser=pd.Series([25300,27466,23000,24899,18100,19611],index=index)
print(ser.iloc[0]) #25300
print(ser.iloc[1:5])
"""
(shanghai, 2016) 27466
(beijing, 2015) 23000
(beijing, 2016) 24899
(changchun, 2015) 18100
dtype: int64
"""
print(ser.loc['shanghai'])
"""
2015 25300
2016 27466
"""
(2)DataFrame对象
仅取某列数据
根据列名称或字段名称直接获得
获取某行数据
可使用loc,iloc
通过行列约束取数据
loc,iloc传入行列索引
对于横纵索引都是MultiIndex类型的DataFrame数据a:
a[(h1 ,h2 ),(l1 , l2)]
h1:横索引的一级索引
h2:横索引的二级索引
如果“经验主义”行不通:
引用idx
idx=pd.IndexSlice
a.loc[idx[:,1],idx[:,'phy']]
使用索引查找数据要注意顺序,由外层向内层,层层递进