pandas学习01

最新推荐文章于 2021-12-15 15:46:50 发布

龍言玄间

最新推荐文章于 2021-12-15 15:46:50 发布

阅读量304

点赞数

分类专栏： python和数据科学文章标签： python

本文链接：https://blog.csdn.net/matthewchen123/article/details/107667527

版权

python和数据科学专栏收录该内容

16 篇文章 0 订阅

订阅专栏

pandas的两种重要的数据类型DataFrame和Series

import numpy as np
import pandas as pd

import os
os.getcwd()#查找当前工作目录

'C:\\Users\\dell'

#DataFrame相当于有表格，有行表头和列表头。在数据分析中很少直接定义一般采用外部导入文件
df1 = pd.DataFrame(np.arange(10).reshape(2,5))
df1

	0	1	2	3	4
0	0	1	2	3	4
1	5	6	7	8	9

df2=pd.read_csv(r'C:\Users\dell\data.csv')
df2.head()
#注意当pandas导入外部文件时候将会自动转换为DataFrame对象

	account	name	street	city	state	postal-code	Jan	Feb	Mar
0	211829	Kerluke, Koepp and Hilpert	34456 Sean Highway	New Jaycob	Texas	28752	10000	62000	35000
1	320563	Walter-Trantow	1311 Alvis Tunnel	Port Khadijah	NorthCarolina	38365	95000	45000	35000
2	648336	Bashirian, Kunde and Price	62184 Schamberger Underpass Apt. 231	New Lilianland	Iowa	76517	91000	120000	35000
3	109996	D'Amore, Gleichner and Bode	155 Fadel Crescent Apt. 144	Hyattburgh	Maine	46021	45000	120000	10000
4	121213	Bauch-Goldner	7274 Marissa Common	Shanahanchester	California	49681	162000	120000	35000

#查看行名index属性
df2.index
#查看行数
df2.index.size
#查看列名
df2.columns
df2.columns.size

df2.shape#查看行列数目即形状

(15, 9)

#计算行列书的另一个方法
print("行数为",df2.shape[0])
print("列数为",df2.shape[1])

行数为 15
列数为 9

#访问元素的写法——按照列名读取

df2["name"].head()#第一种方法列名出现在下标
df2[name][2]#注意当列名和行号一起使用的时候数据框的第0轴是列

0     Kerluke, Koepp and Hilpert
1                 Walter-Trantow
2     Bashirian, Kunde and Price
3    D'Amore, Gleichner and Bode
4                  Bauch-Goldner
Name: name, dtype: object

df2.name.head()#列名当做数据框的一个属性来使用
df2.name[2]

0     Kerluke, Koepp and Hilpert
1                 Walter-Trantow
2     Bashirian, Kunde and Price
3    D'Amore, Gleichner and Bode
4                  Bauch-Goldner
Name: name, dtype: object

df2["city"][[2,5]]#fancy indexing

2    New Lilianland
5      Jeremieburgh
Name: city, dtype: object

#访问元素的写法——按照索引的方式 注意都用中括符
#loc标识显示索引 iloc标识隐式索引

df2.loc[1,"street"]

'1311 Alvis Tunnel'

df2.iloc[1,2]#先列在行

'1311 Alvis Tunnel'

#删除列的方法——del语句
del df2["street"]
df2.head()

	account	name	city	state	postal-code	Jan	Feb	Mar
0	211829	Kerluke, Koepp and Hilpert	New Jaycob	Texas	28752	10000	62000	35000
1	320563	Walter-Trantow	Port Khadijah	NorthCarolina	38365	95000	45000	35000
2	648336	Bashirian, Kunde and Price	New Lilianland	Iowa	76517	91000	120000	35000
3	109996	D'Amore, Gleichner and Bode	Hyattburgh	Maine	46021	45000	120000	10000
4	121213	Bauch-Goldner	Shanahanchester	California	49681	162000	120000	35000

#drop()删除或过滤不改变数据框对象本身
df3=df2[["account","name","city"]]
df3.drop(["name","city"],axis=1,inplace=True)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:3997: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,

#进行数据过滤
df2[df2.Feb>46000].head()
df2[df2.Jan>46000][["city","name"]].head()

df2[df2.city=='New Jaycob'].count() #频数统计

account        1
name           1
city           1
state          1
postal-code    1
Jan            1
Feb            1
Mar            1
dtype: int64

dff=df2[["name","Feb"]]
dff.sort_values(by="Feb",axis=0,ascending=True).head()#列按照值排序

	name	Feb
11	Hahn-Moore	10000
1	Walter-Trantow	45000
0	Kerluke, Koepp and Hilpert	62000
7	Kovacek-Johnston	95000
8	Champlin-Morar	95000

算数运算

#规则：数据框之间计算会补齐行列索引（新增加的行列索引对应值为NaN）得到相同结构之后进行计算
df4=pd.DataFrame(np.arange(6).reshape(2,3))
df4

	0	1	2
0	0	1	2
1	3	4	5

df5=pd.DataFrame(np.arange(10).reshape(2,5))
df5

	0	1	2	3	4
0	0	1	2	3	4
1	5	6	7	8	9

df4+df5

	0	1	2	3	4
0	0	2	4	NaN	NaN
1	8	10	12	NaN	NaN

#但是在数据分析中一般不用直接的运算符二十采用相应的函数，因为调用函数的灵活性更高而且可以设置更多的参数、计算方向等
df6=df4.add(df5,fill_value=10)
#数据框和series按行计算：先将行（第1轴）广播，把行改为等长，行内不作循环补齐，只是一行行计算而不跨行广播
s1=pd.Series(np.arange(3))
df6-s1
#等价于df5.sub(s1,axis=1)

	0	1	2	3	4
0	0.0	1.0	2.0	NaN	NaN
1	8.0	9.0	10.0	NaN	NaN

df4.rolling(2).sum#按列依次计算相邻两个元素的和，即为本元素和上一个元素的和

<bound method Rolling.sum of Rolling [window=2,center=False,axis=0]>

df4.cov()#协方差矩阵

	0	1	2
0	4.5	4.5	4.5
1	4.5	4.5	4.5
2	4.5	4.5	4.5

df4.corr()#相关稀疏 矩阵

	0	1	2
0	1.0	1.0	1.0
1	1.0	1.0	1.0
2	1.0	1.0	1.0

df7=df4.T
df7#矩阵转置

	0	1
0	0	3
1	1	4
2	2	5

缺失值处理

#判断一个数据框是否为空的方法——属性empty
df3.empty

False

#注意在python的基础语法里None和NaN处理方法 是不一样的
#在python基础语法里None不能参加计算但是NaN可以参加计算
#在pandas中二者都可以参加计算

np.nan+1
np.nan-np.nan

nan

A=pd.DataFrame(np.array([10,10,20,20]).reshape(2,2),columns=list("ab"),index=list("sw"))
A
A.stack()

s  a    10
   b    10
w  a    20
   b    20
dtype: int32

A.mean()

a    15.0
b    15.0
dtype: float64

#数据框处理缺失值的四个重要函数 isnull notnull dropna fillna
A.notnull()

	a	b
s	True	True
w	True	True

龍言玄间

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录