Python Pandas 官方教程1~50个知识点拍了拍你

最新推荐文章于 2024-02-04 11:05:13 发布

程志伟

最新推荐文章于 2024-02-04 11:05:13 发布

阅读量312

点赞数

文章标签： python

本文链接：https://blog.csdn.net/c1z2w3456789/article/details/107749523

版权

关注微信公共号：小程在线

关注CSDN博客：程志伟的博客

纯属个人学习笔记

Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 7.12.0 -- An enhanced Interactive Python.

#一、创建对象

#1、可以通过传递一个 list 对象来创建一个 Series ， pandas 会默认创建整型索引：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

s = pd.Series([1,3,5,np.nan,9,8])
print(s)
0 1.0
1 3.0
2 5.0
3 NaN
4 9.0
5 8.0
dtype: float64

#2.通过传递一个 numpy array ，时间索引以及列标签来创建一个 DataFrame

dates = pd.date_range('20130101', periods=6)
print(dates)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print(df)
A B C D
2013-01-01 -1.863938 0.402828 1.170145 1.219647
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-05 0.020218 0.861441 -0.710662 -1.181764
2013-01-06 -0.315920 -2.371423 -0.572653 0.889698

#3、通过传递一个能够被转换成类似序列结构的字典对象来创建一个 DataFrame

df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
print(df2)
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

#4、查看不同列的数据类型：
print(df2.dtypes)
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object

'''
二、查看数据
'''

1、查看 DataFrame 中头部和尾部的行：

print(df.head())
A B C D
2013-01-01 -1.863938 0.402828 1.170145 1.219647
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-05 0.020218 0.861441 -0.710662 -1.181764

print(df.tail(3))
A B C D
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-05 0.020218 0.861441 -0.710662 -1.181764
2013-01-06 -0.315920 -2.371423 -0.572653 0.889698

#2、显示索引、列和底层的 numpy 数据：

print(df.index)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

print(df.columns)
Index(['A', 'B', 'C', 'D'], dtype='object')

print(df.values)
[[-1.86393821 0.40282769 1.17014473 1.21964729]
[-0.25573943 0.90825332 -1.19425769 0.92671242]
[-0.09869885 -0.79461508 -1.08363359 -0.71968247]
[ 1.17222981 -0.15551869 0.38053457 -0.69428843]
[ 0.02021829 0.86144052 -0.71066191 -1.18176438]
[-0.31592033 -2.37142305 -0.57265269 0.88969805]]

#3、 describe() 函数对于数据的快速统计汇总：

print(df.describe())
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.223641 -0.191506 -0.335088 0.073387
std 0.971973 1.248076 0.924535 1.049012
min -1.863938 -2.371423 -1.194258 -1.181764
25% -0.300875 -0.634841 -0.990391 -0.713334
50% -0.177219 0.123655 -0.641657 0.097705
75% -0.009511 0.746787 0.142238 0.917459
max 1.172230 0.908253 1.170145 1.219647

#4、对数据的转置：

print(df.T)
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A -1.863938 -0.255739 -0.098699 1.172230 0.020218 -0.315920
B 0.402828 0.908253 -0.794615 -0.155519 0.861441 -2.371423
C 1.170145 -1.194258 -1.083634 0.380535 -0.710662 -0.572653
D 1.219647 0.926712 -0.719682 -0.694288 -1.181764 0.889698

#5、按轴进行排序

print(df.sort_index(axis=1, ascending=False))
D C B A
2013-01-01 1.219647 1.170145 0.402828 -1.863938
2013-01-02 0.926712 -1.194258 0.908253 -0.255739
2013-01-03 -0.719682 -1.083634 -0.794615 -0.098699
2013-01-04 -0.694288 0.380535 -0.155519 1.172230
2013-01-05 -1.181764 -0.710662 0.861441 0.020218
2013-01-06 0.889698 -0.572653 -2.371423 -0.315920

#6、按值进行排序

print(df.sort_values(by='B'))
A B C D
2013-01-06 -0.315920 -2.371423 -0.572653 0.889698
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-01 -1.863938 0.402828 1.170145 1.219647
2013-01-05 0.020218 0.861441 -0.710662 -1.181764
2013-01-02 -0.255739 0.908253 -1.194258 0.926712

'''
三、选择
虽然标准的 Python/Numpy 的选择和设置表达式都能够直接派上用场，但是作为工
程使用的代码，我们推荐使用经过优化的 pandas 数据访问方式： .at , .iat ,
.loc , .iloc 和 .ix 。详情请参阅索引和选取数据和多重索引/高级索引。
'''

#1、选择一个单独的列，这将会返回一个 Series ，等同于 df.A ：

print(df['A'])
2013-01-01 -1.863938
2013-01-02 -0.255739
2013-01-03 -0.098699
2013-01-04 1.172230
2013-01-05 0.020218
2013-01-06 -0.315920
Freq: D, Name: A, dtype: float64

#2、通过 [] 进行选择，这将会对行进行切片

print(df[0:3])
A B C D
2013-01-01 -1.863938 0.402828 1.170145 1.219647
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682

print(df['20130102':'20130104'])
A B C D
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682
2013-01-04 1.172230 -0.155519 0.380535 -0.694288

'''
四、
通过标签选择
'''

#1、使用标签来获取一个交叉的区域

print(df.loc[dates[0]])
A -1.863938
B 0.402828
C 1.170145
D 1.219647
Name: 2013-01-01 00:00:00, dtype: float64

#2、通过标签来在多个轴上进行选择

print(df.loc[:,['A','B']])
A B
2013-01-01 -1.863938 0.402828
2013-01-02 -0.255739 0.908253
2013-01-03 -0.098699 -0.794615
2013-01-04 1.172230 -0.155519
2013-01-05 0.020218 0.861441
2013-01-06 -0.315920 -2.371423

#3. 标签切片

print(df.loc['20130102':'20130104',['A','B']])
A B
2013-01-02 -0.255739 0.908253
2013-01-03 -0.098699 -0.794615
2013-01-04 1.172230 -0.155519

#4、对于返回的对象进行维度缩减

print(df.loc['20130102',['A','B']])
A -0.255739
B 0.908253
Name: 2013-01-02 00:00:00, dtype: float64

#5、获取一个标量

print(df.loc[dates[0],'A'])
-1.8639382082056544

'''
五、
通过位置选择
'''

#1、通过传递数值进行位置选择（选择的是行）

print(df.iloc[3])
A 1.172230
B -0.155519
C 0.380535
D -0.694288
Name: 2013-01-04 00:00:00, dtype: float64

#2、通过数值进行切片，与 numpy/python 中的情况类似

print(df.iloc[3:5,0:2])
A B
2013-01-04 1.172230 -0.155519
2013-01-05 0.020218 0.861441

#3、通过指定一个位置的列表，与 numpy/python 中的情况类似

df.iloc[[1,2,4],[0,2]]
Out[26]:
A C
2013-01-02 -0.255739 -1.194258
2013-01-03 -0.098699 -1.083634
2013-01-05 0.020218 -0.710662

#4、对行进行切片

df.iloc[1:3,:]
Out[27]:
A B C D
2013-01-02 -0.255739 0.908253 -1.194258 0.926712
2013-01-03 -0.098699 -0.794615 -1.083634 -0.719682

#5、对列进行切片

df.iloc[:,1:3]
Out[28]:
B C
2013-01-01 0.402828 1.170145
2013-01-02 0.908253 -1.194258
2013-01-03 -0.794615 -1.083634
2013-01-04 -0.155519 0.380535
2013-01-05 0.861441 -0.710662
2013-01-06 -2.371423 -0.572653

#6、获取特定的值

df.iloc[1,1]
Out[29]: 0.9082533192518502

'''
六、
布尔索引
'''

#1、使用一个单独列的值来选择数据

df[df.A > 0]
Out[30]:
A B C D
2013-01-04 1.172230 -0.155519 0.380535 -0.694288
2013-01-05 0.020218 0.861441 -0.710662 -1.181764

#2、使用 where 操作来选择数据

df[df > 0]
Out[31]:
A B C D
2013-01-01 NaN 0.402828 1.170145 1.219647
2013-01-02 NaN 0.908253 NaN 0.926712
2013-01-03 NaN NaN NaN NaN
2013-01-04 1.172230 NaN 0.380535 NaN
2013-01-05 0.020218 0.861441 NaN NaN
2013-01-06 NaN NaN NaN 0.889698

#3、使用 isin() 方法来过滤

df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2
Out[33]:
A B C D E
2013-01-01 -0.309948 -0.216393 -0.056348 -0.820341 one
2013-01-02 -0.025978 -0.487552 -0.128323 -1.403817 one
2013-01-03 -0.675241 1.801300 -0.564131 -1.747483 two
2013-01-04 0.159498 -1.440484 -1.916870 -0.202248 three
2013-01-05 1.628307 -0.472663 -0.196632 0.561092 four
2013-01-06 -1.359379 0.442662 0.898418 -0.943424 three

df2[df2['E'].isin(['two','four'])]
Out[34]:
A B C D E
2013-01-03 -0.675241 1.801300 -0.564131 -1.747483 two
2013-01-05 1.628307 -0.472663 -0.196632 0.561092 four

'''
设置
'''

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1
Out[35]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64

#2、通过标签设置新的值

df.at[dates[0],'A'] = 0

df
Out[37]:
A B C D
2013-01-01 0.000000 -0.216393 -0.056348 -0.820341
2013-01-02 -0.025978 -0.487552 -0.128323 -1.403817
2013-01-03 -0.675241 1.801300 -0.564131 -1.747483
2013-01-04 0.159498 -1.440484 -1.916870 -0.202248
2013-01-05 1.628307 -0.472663 -0.196632 0.561092
2013-01-06 -1.359379 0.442662 0.898418 -0.943424

#3、通过位置设置新的

df.iat[0,1] = 0

#4、通过一个numpy数组设置一组新值

df.loc[:,'D'] = np.array([5] * len(df))

df
Out[40]:
A B C D
2013-01-01 0.000000 0.000000 -0.056348 5
2013-01-02 -0.025978 -0.487552 -0.128323 5
2013-01-03 -0.675241 1.801300 -0.564131 5
2013-01-04 0.159498 -1.440484 -1.916870 5
2013-01-05 1.628307 -0.472663 -0.196632 5
2013-01-06 -1.359379 0.442662 0.898418 5

df['F'] = s1

df
Out[42]:
A B C D F
2013-01-01 0.000000 0.000000 -0.056348 5 NaN
2013-01-02 -0.025978 -0.487552 -0.128323 5 1.0
2013-01-03 -0.675241 1.801300 -0.564131 5 2.0
2013-01-04 0.159498 -1.440484 -1.916870 5 3.0
2013-01-05 1.628307 -0.472663 -0.196632 5 4.0
2013-01-06 -1.359379 0.442662 0.898418 5 5.0

#5、通过where操作来设置新的值

df2 = df.copy()
df2[df2 > 0] = -df2
df2
Out[43]:
A B C D F
2013-01-01 0.000000 0.000000 -0.056348 -5 NaN
2013-01-02 -0.025978 -0.487552 -0.128323 -5 -1.0
2013-01-03 -0.675241 -1.801300 -0.564131 -5 -2.0
2013-01-04 -0.159498 -1.440484 -1.916870 -5 -3.0
2013-01-05 -1.628307 -0.472663 -0.196632 -5 -4.0
2013-01-06 -1.359379 -0.442662 -0.898418 -5 -5.0

'''
缺失值处理
'''

#1、 reindex() 方法可以对指定轴上的索引进行改变/增加/删除操作，这将返回原始数据的一个拷贝

df1.loc[dates[0]:dates[1],'E'] = 1
df1
Out[46]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.056348 5 NaN 1.0
2013-01-02 -0.025978 -0.487552 -0.128323 5 1.0 1.0
2013-01-03 -0.675241 1.801300 -0.564131 5 2.0 NaN
2013-01-04 0.159498 -1.440484 -1.916870 5 3.0 NaN

#2、去掉包含缺失值的行

df1.dropna(how='any')
Out[47]:
A B C D F E
2013-01-02 -0.025978 -0.487552 -0.128323 5 1.0 1.0

#3、对缺失值进行填充

df1.fillna(value=5)
Out[48]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.056348 5 5.0 1.0
2013-01-02 -0.025978 -0.487552 -0.128323 5 1.0 1.0
2013-01-03 -0.675241 1.801300 -0.564131 5 2.0 5.0
2013-01-04 0.159498 -1.440484 -1.916870 5 3.0 5.0

#4、对数据进行布尔填充

pd.isnull(df1)
Out[49]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True

程志伟

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python Pandas 官方教程1~50个知识点拍了拍你

关注微信公共号：小程在线关注CSDN博客：程志伟的博客纯属个人学习笔记Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]Type "copyright", "credits" or "license" for more information.IPython 7.12.0 -- An enhanced Interactive Python.#一、创建对象#1、可以通过传...
复制链接

扫一扫