数据分析Pandas库学习笔记

最新推荐文章于 2021-12-09 14:32:26 发布

Hack.

最新推荐文章于 2021-12-09 14:32:26 发布

阅读量615

点赞数

分类专栏： python 笔记文章标签： python 数据分析

本文链接：https://blog.csdn.net/endlessfc/article/details/112465333

版权

python 同时被 2 个专栏收录

16 篇文章 0 订阅

订阅专栏

笔记

12 篇文章 0 订阅

订阅专栏

Pandas库

1 常用数据对象

(1)Series对象

pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
"""
data:可以是数组、列表等数组对象，也可以是字典，还可以是数字、字符串。传入字典时，索引用字典的键代替。
index:索引，传入应为列表
name:为对象起名字
"""

Series对象的属性

gdp.index

gdp.values

gdp.name

gdp.index.name

g=np.array([27466.15,24899.3,19610.9,19492.4,17885.39,17558.76])
gdp=pd.Series(g, index=['上海','北京','重庆','长春','成都','深圳'],name='City GDP')
gdp.index.name='City name'
print(gdp)

结果：

City name
上海    27466.15
北京    24899.30
重庆    19610.90
长春    19492.40
成都    17885.39
深圳    17558.76
Name: City GDP, dtype: float64

创建Series对象的方法：列表，字典（字典的键为索引）

(2)DataFrame对象

储存二维数据，类似电子表格数据库表

pd.DataFrame(data=None,index=None,columns=None,dtype=None,copy=False)
"""
data:嵌套列表，二维数组、字典或者DataFrame对象、
index:索引对象或数组对象
columns:索引对象或数组对象，列索引，用数据库或电子表格中术语叫'字段'
"""

例子1：

gdp2=pd.DataFrame([[27466.15,2419.70],[24899.30,2172.90],
                  [19610.90,1350.11],[19492.60,1137.87]])
gdp2.index=['上海','北京','重庆','长春']
gdp2.columns=['GDP','Population']
gdp2.index.name='City name'
gdp2.columns.name='Items'
print(gdp2)

结果：

Items           GDP  Population
City name                      
上海         27466.15     2419.70
北京         24899.30     2172.90
重庆         19610.90     1350.11
长春         19492.60     1137.87

例子2：

u=np.array([('Beijing',100.00),('beijing',96.91),('hubei',82.57)],dtype=[('city','30S'),('marks',np.float)])
pd.DataFrame(u,index=["PKU","Tsinghua","WHU"])

结果：

                city   marks
PKU       b'Beijing'  100.00
Tsinghua  b'beijing'   96.91
WHU         b'hubei'   82.57

DataFrame实例对象创建方法：

嵌套列表，字典（键为列索引）

数据格式转换

(1)将其他格式转换为DataFrame

pd.DataFrame.from_dict(data, orient="columns", dtype=None, columns=None)
"""
将字典数据转换为DataFrame
data : dict Of the form {field : array-like} or {field : dict}.
orient : {'columns', 'index'}, default 'columns'
The "orientation" of the data. If the keys of the passed dict
should be the columns of the resulting DataFrame, pass 'columns'
(default). Otherwise if the keys should be rows, pass 'index'.
dtype : dtype, default None
Data type to force, otherwise infer.
columns : list, default None
Column labels to use when ``orient='index'``. Raises a ValueError
if used with ``orient='columns'``.
"""

(2)将DataFrame转换为其他格式

pd.DataFrame.to_csv

# 设DataFrame对象df：
df.to_csv(excel_writer,sheet_name='Sheet1',na_rep='',float_format=None,columns=None,header=True,index=True,index_label=None,startrow=0,startcol=0,engine=None,encoding=None)
"""
Parameters
----------
excel_writer : str or ExcelWriter object
File path or existing ExcelWriter.
sheet_name : str, default 'Sheet1'
Name of sheet which will contain DataFrame.
na_rep : str, default ''
Missing data representation. 缺失的数据表示
float_format : str, optional
Format string for floating point numbers. For example
``float_format="%.2f"`` will format 0.1234 to 0.12.
columns : sequence or list of str, optional
Columns to write.
header : bool or list of str, default True
Write out the column names. If a list of string is given it is
assumed to be aliases for the column names.
index : bool, default True
Write row names (index).
index_label : str or sequence, optional
Column label for index column(s) if desired. If not specified, and `header` and `index` are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
startrow : int, default 0
Upper left cell row to dump data frame.
startcol : int, default 0
Upper left cell column to dump data frame.
engine : str, optional
Write engine to use, 'openpyxl' or 'xlsxwriter'. You can also set this via the options ``io.excel.xlsx.writer``, ``io.excel.xls.writer``, and ``io.excel.xlsm.writer``.
merge_cells : bool, default True
Write MultiIndex and Hierarchical Rows as merged cells.
encoding : str, optional
Encoding of the resulting excel file. Only necessary for xlwt,
other writers support unicode natively.
inf_rep : str, default 'inf'
Representation for infinity (there is no native representation for infinity in Excel).
verbose : bool, default True
Display more information in the error logs.
freeze_panes : tuple of int (length 2), optional
Specifies the one-based bottommost row and rightmost column that
is to be frozen.
"""

写入xls文件：

with pd.ExcelWriter('demo1.xlsx') as writer:
    a.to_excel(writer)

pd.DataFrame.to_dict

pd.DataFrame.to_excel

pd.DataFrame.to_json

…

df.to_excel

df.to_csv

df.to_dict

…

(3)Panel对象

Panel was removed in 0.25.0.

pd.Panel(data=None, items=None, major_axis=None, minor_axis=None,copy=False, dtype=None)
"""
data:可以是dict,字典的value对应类型为DataFrame
items:axis=0;
major_axis:axis=1;
minor_axis:axis=2;
"""

2 索引对象

(1)Index对象

pd.Index(data=None,dtype=None,copy=False,name=None,fastpath=False,tupleize_cols=True, **kwargs)
"""
Immutable ndarray implementing an ordered, sliceable set. The basic object
storing axis labels for all pandas objects.

Parameters
----------
data : array-like (1-dimensional)
dtype : NumPy dtype (default: object)
If dtype is None, we find the dtype that best fits the data.
If an actual dtype is provided, we coerce to that dtype if it's safe.
Otherwise, an error will be raised.
copy : bool
Make a copy of input ndarray.
name : object
Name to be stored in the index.
tupleize_cols : bool (default: True)
When True, attempt to create a MultiIndex if possible.
"""

ind=pd.Index(['a','b','c','d'])
a=pd.Series([12,15,19,20],ind)
print(a)

结果：

a    12
b    15
c    19
d    20
dtype: int64

ind2=pd.Index(['name','age'])
a2=pd.DataFrame([['zx',18],['hack',18]],index=[i for i in range(1,3)],columns=ind2)
print(a2)

结果：

   name  age
1    zx   18
2  hack   18

利用Index中传入data是元素为元组的列表，可实现实例化MultiIndex对象

f3=pd.Index([('a',1),('a',2),('b',1)])
print(f3)

结果：

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1)],
           )

(2)MultiIndex对象

多级索引

                  0         1
一级索引 二级索引                    
a    1     0.147329  0.377239
     2     0.785814  0.012596
b    1     0.794787  0.987981
     2     0.851144  0.796109

a,b为一级索引，1,2为二级索引

一级+二级是行索引

0，1为列索引

通过函数创建

pd.MultiIndex.from_tuples(tuples, sortorder=None, names=None)
"""
Convert list of tuples to MultiIndex.
Parameters
----------
tuples : list / sequence of tuple-likes
    Each tuple is the index of one row/column.
sortorder : int or None
    Level of sortedness (must be lexicographically sorted by that level).
names : list / sequence of str, optional
    Names for the levels in the index.

pd.MultiIndex.from_arrays(cls, arrays, sortorder=None, names=lib.no_default)
pd.MultiIndex.from_product(cls, iterables, sortorder=None, names=lib.no_default)
"""
Make a MultiIndex from the cartesian product of multiple iterables.
从多个迭代对象的笛卡尔积创建一个多索引.
"""

笛卡尔积：

[‘a’ , ’b’]与[100, 200] 相乘结果为 [‘a’, 100], [‘a’, 200], [‘b’, 100],[‘b’, 200]

例子:

tup=[('China','Beijing'),('China','Hongkong'),('USA','chicago'),('USA','NewYork')]
b=pd.MultiIndex.from_tuples(tup)
print(b)
arr=np.array([['china','china','USA','USA'],
             ['beijing','hongkong','newyork','chicago']])
c=pd.MultiIndex.from_arrays(arr)
print(c)

结果:

MultiIndex([('China',  'Beijing'),
            ('China', 'Hongkong'),
            (  'USA',  'chicago'),
            (  'USA',  'NewYork')],
           )
MultiIndex([('china',  'beijing'),
            ('china', 'hongkong'),
            (  'USA',  'newyork'),
            (  'USA',  'chicago')],
           )

通过实例化类创建

pd.MultiIndex(levels=None,codes=None,sortorder=None,names=None,copy=False,verify_integrity=True,**kwargs)
levels:以序列类数据表示标签索引
codes:以整数方式表示每个标签索引的位置
levels : sequence of arrays
The unique labels for each level.
codes : sequence of arrays
Integers for each level designating which label at each location.

f=pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])
print(f)
f2=pd.MultiIndex(levels=[['a','b','c','d'],[1,8,9,5]],codes=[[0,1,2,3],[1,2,0,3]])
print(f2)

结果:

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
MultiIndex([('a', 8),
            ('b', 9),
            ('c', 1),
            ('d', 5)],
           )

gdp_index=[('shanghai',2015),('shanghai',2016),('beijing',2016),('beijing',2015)]
gdp_mind=pd.Index(gdp_index)
# gdp_mind2=pd.MultiIndex.from_tuples(gdp_index,names=('city','year')) #与上一句话等价
gdp=pd.Series([25300,27466,23000,24899],index=gdp_mind)
print(gdp)
gdp2=gdp.unstack()
print(gdp2)

结果:

shanghai  2015    25300
          2016    27466
beijing   2016    23000
          2015    24899
dtype: int64
           2015   2016
beijing   24899  23000
shanghai  25300  27466

ser.unstack(level=-1, fill_value=None)

用于将具有多级索引的Series转换为DataFrame

Unstack, also known as pivot, Series with MultiIndex to produce DataFrame. #也被称为透视表

注：

Index,MultiIndex是针对于索引的对象，无论是Series还是DataFrame对象都可以具有多级索引

3 数据索引和切片

(1)Series对象

gdp_index=['shanghai','shanghai','beijing','beijing']
gdp=pd.Series([25300,27466,23000,24899],index=gdp_index)
print(gdp.keys())
for i in gdp.items():
	print(i)

Index(['shanghai', 'shanghai', 'beijing', 'beijing'], dtype='object')
('shanghai', 25300)
('shanghai', 27466)
('beijing', 23000)
('beijing', 24899)

Series对象a的方法:

a.keys():返回Index对象

a.items():返回可迭代对象，与字典相似

a.pop(label index):移除某个元素，用标签索引

a.iloc[]:只接受位置索引，不分索引级别,返回的为数值

a.loc[]:只接受标签索引 前包括，后也包括，会返回索引

对于多级索引，用a[ , ]方法，‘ ,’左面是0级索引，右面是1级索引

索引用法基本与列表类似

gdp_index=[('shanghai',2015),('shanghai',2016),('beijing',2015),('beijing',2016),
           ('changchun',2015),('changchun',2016)]
index=pd.MultiIndex.from_tuples(gdp_index)
ser=pd.Series([25300,27466,23000,24899,18100,19611],index=index)
print(ser.iloc[0])  #25300
print(ser.iloc[1:5])
"""
(shanghai, 2016)     27466
(beijing, 2015)      23000
(beijing, 2016)      24899
(changchun, 2015)    18100
dtype: int64
"""
print(ser.loc['shanghai'])
"""
2015    25300
2016    27466
"""

(2)DataFrame对象

仅取某列数据

根据列名称或字段名称直接获得

获取某行数据

可使用loc,iloc

通过行列约束取数据

loc,iloc传入行列索引

对于横纵索引都是MultiIndex类型的DataFrame数据a：

a[(h1 ,h2 ),(l1 , l2)]

h1:横索引的一级索引

h2:横索引的二级索引

如果“经验主义”行不通：

引用idx

idx=pd.IndexSlice
a.loc[idx[:,1],idx[:,'phy']]

使用索引查找数据要注意顺序，由外层向内层，层层递进

Hack.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据分析Pandas库学习笔记

Pandas库1 常用数据对象(1)Series对象pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)"""data:可以是数组、列表等数组对象，也可以是字典，还可以是数字、字符串。传入字典时，索引用字典的键代替。index:索引，传入应为列表name:为对象起名字"""Series对象的属性gdp.indexgdp.valuesgdp.namegdp.index.
复制链接

扫一扫