Pandas入门

最新推荐文章于 2024-01-17 21:47:20 发布

April123abc

最新推荐文章于 2024-01-17 21:47:20 发布

阅读量848

点赞数 22

文章标签： pandas

本文链接：https://blog.csdn.net/April123abc/article/details/134719904

版权

pandas:python data analysis library

安装：pip install pandas

导入： import pandas as pd

pandas 有两个重要的数据结构：Series 和 Dataframe

Series 表示一列，一个Series就是一列，多个Series组成一个二维的Dataframe.

此篇用jupyter notebook做代码演示。

Series

index 和 value

index 查看series索引，类型为rangeindex

value 查看series的值，类型为ndarray

import pandas as pd 
import numpy as np 
np.random.rand(5)

运行结果

Series 的创建

1.通过数组进行转换

# series 是pandas的一种结构，可以看成excel中的某列
s = pd.Series(np.random.rand(5))
s

运行结果

type(s)

运行结果

pandas.core.series.Series

s.index   # 查看索引

运行结果

RangeIndex(start=0, stop=5, step=1)

s.index.tolist() # 转成列表

运行结果

[0, 1, 2, 3, 4]

s.values # 查看它的值

运行结果

2.通过字典进行转换

dic = {"red":100, "black":500,"green":300,"pink":900}
print(dic)
print("*"*20)
# 转成Series 
s2 = pd.Series(dic)
print(s2)

运行结果

{'red': 100, 'black': 500, 'green': 300, 'pink': 900}
********************
red 100
black 500
green 300
pink 900
dtype: int64

字典的键会变成行索引

arr = np.random.randn(5)
s = pd.Series(arr,index=['a','b','c','d','e'],dtype=object)
s

运行结果

a b c d e 是标签索引，自动生成的0 1 2 ......是位置索引

3.通过标签进行创建

s = pd.Series(88, index=range(3))
s

运行结果

索引及切片

se1 = pd.Series(data=[1,2,3,4],index=list('abcd'))
print(se1)
print("*"*20)
# 取2 
#se1['b']   # 通过标签索引取值
# se1[1]      # 通过位置索引取值 
# 取1-3
se1['a':'c']  # 进行标签索引切片取值时，会包含结束标签
# se1[0:3]    # 进行位置索引切片取值时，不包含结束标签

运行结果

reindex() 重新索引
索引和值的相对关系没有发生变化，但是返回的Series会依据重新设定的索引顺序变化 如果原数据里没有的索引，则对应值为空。

print(se1.reindex(['b','a','c','f','d']))

运行结果

b 2.0
a 1.0
c 3.0
f NaN
d 4.0
dtype: float64

rename() --- 直接改标签索引

print(se1.rename({'a':'A','b':'B'}))   # 字典一一对应的改

运行结果

A 1
B 2
c 3
d 4
dtype: int64

Series 属性

data = [1,2,3,4,5,6]
index = ['A','B','C','D','E','F']
s1 = pd.Series(data, index=index,name='MySeries')
print(s1.dtype)  # 查看元素类型 
print(s1.size)   # 元素个数 
print(s1.shape)  # 查看有几行几列

运行结果

int64
6
(6,)

常见操作

数据对齐

# 数据对齐 
s1 = pd.Series(np.random.rand(3),index=['Jack','Mary','Willa'])
print(s1)
print("*"*20)
s2 = pd.Series(np.random.rand(3),index=['Wang','Willa','Mary'])
print(s2)
print("*"*20)
print(s1+s2)     # 对应索引的数据进行计算，只有两个数据里都有索引，才能进行计算

运行结果

Jack 0.281982
Mary 0.810402
Willa 0.258101
dtype: float64
********************
Wang 0.211839
Willa 0.357276
Mary 0.614404
dtype: float64
********************

Jack NaN
Mary 1.424806
Wang NaN
Willa 0.615377
dtype: float64

删除

s = pd.Series(np.random.rand(5),index=list('abcde'))
print(s)
print("*"*20)
# 删除 b 
print(s.drop('b'))
print("*"*20)
# 删除 c d 
print(s.drop(['c','d']))

运行结果

a 0.259408
b 0.277649
c 0.792446
d 0.232483
e 0.561788
dtype: float64
********************

a 0.259408
c 0.792446
d 0.232483
e 0.561788
dtype: float64
********************

a 0.259408
b 0.277649
e 0.561788
dtype: float64

添加和修改

s2 = pd.Series(np.random.randn(5))
print(s2)
print("*"*20)
# 添加索引为5的数据 
s2[5] = 99       # 键没有则进行新增
print(s2)
print("*"*20)
# 修改索引为2的数据
s2[2] = 88
print(s2)

运行结果

0 0.848025
1 -0.295804
2 0.513866
3 -0.158378
4 -0.411714
dtype: float64
********************

0 0.848025
1 -0.295804
2 0.513866
3 -0.158378
4 -0.411714
5 99.000000
dtype: float64
********************

0 0.848025
1 -0.295804
2 88.000000
3 -0.158378
4 -0.411714
5 99.000000
dtype: float64

Dataframe

Dataframe的创建

1.通过二维数组进行创建

# 1.通过二维数组进行创建
df1 = pd.DataFrame(np.random.randint(0,20,(4,5)),index=[1,2,3,4],columns=['a','b','c','d','e'])
print(df1)

运行结果

a b c d e
1 16 14 9 2 13
2 8 7 9 5 15
3 8 7 16 8 13
4 13 15 19 11 19

2.通过字典创建

# 2. 字典创建 
data1 = {'a':[1,2,3],
         'b':[3,4,5],
         'c':[5,6,7]
}
df2 = pd.DataFrame(data1,index=[1,2, 3])
print(df2,type(df2))

运行结果

a b c
1 1 3 5
2 2 4 6
3 3 5 7 <class 'pandas.core.frame.DataFrame'>

3.由字典组成的列表创建

data4 = [{'one':1,'two':2},{'one':3,'two':4,'three':5}]
print(data4)
print("*"*20)
df3 = pd.DataFrame(data4)
print(df3)

运行结果

[{'one': 1, 'two': 2}, {'one': 3, 'two': 4, 'three': 5}]
********************
one two three
0 1 2 NaN
1 3 4 5.0

4.字典套字典创建DataFrame格式

data5 = {'Jack':{'math':90,'english':89,'art':78},
         'Mary':{'math':82,'english':95,'art':92},
         'Tom':{'math':78,'english':67}}
df4 = pd.DataFrame(data5)
print(df4)

运行结果

Jack Mary Tom
math 90 82 78.0
english 89 95 67.0
art 78 92 NaN

注意：里面的键用作行索引，外面的键用作列索引

实战小案例

同一个工作簿里写入多个表格数据

data1 = {'name':['rose','jack','tom'],
         'age':[20,30,35],
         'city':['CS','BJ','SH']}
df1 = pd.DataFrame(data1)
print(df1)
print("*"*20)

data2 = {'product':['A','B','C'],
         'price':[20,30,50],
         'quantity':[100,200,300]}
df2 = pd.DataFrame(data2)
print(df2)
print("*"*20)

# 创建一个ExcelWriter对象,haha.xlsx为工作簿
with pd.ExcelWriter('haha.xlsx',engine='openpyxl') as ex:
    df1.to_excel(ex,sheet_name='表1',index=False)
    df2.to_excel(ex,sheet_name='表2',index=False)

运行结果

name age city
0 rose 20 CS
1 jack 30 BJ
2 tom 35 SH
********************
product price quantity
0 A 20 100
1 B 30 200
2 C 50 300
********************

创建表haha.xlsx成功

DataFrame()对象常用属性

data = {'name':['Jack','Tom','Mary'],
        'age':[18,19,20],
        'gender':['F','M','F']}
df = pd.DataFrame(data, index=['a','b','c'])
print(df)
print("*"*20)
# 查看df是几行几列的数据
print(df.shape)
# 查看values属性
print(df.values)
# 查看df的维度
print(df.ndim)
# 查看前2行
print(df.head(2))  # 默认查看前5行的数据
# 查看后2行
print(df.tail(2)) # 默认查看后5行的数据

运行结果

b Tom 19 M
c Mary 20 F
********************
(3, 3)
[['Jack' 18 'F']
['Tom' 19 'M']
['Mary' 20 'F']]
2
name age gender
a Jack 18 F
b Tom 19 M

loc和iloc

# 取第一行数据  通过标签取值，默认取列
print(df[0:1])   # 取行可以通过切片的方式
print("*"*20)
print(df['name'] )  # 取列可以通过列索引
print("*"*20)
# 取前两行name列
print(df[0:2]['name'])
print("*"*20)
# 保险起见，使用loc或iloc取值 
# loc()通过标签索引取值
# iloc()通过位置索引取值 
# 取18
print(df.iloc[0,1])  # 中括号逗号前代表行，逗号后代表列 
print(df.loc['a','age']) 
print("*"*20)
# 取多行多列 
print(df.iloc[0:2,1:2])

运行结果

name age gender
a Jack 18 F
********************

a Jack
b Tom
c Mary
Name: name, dtype: object
********************
a Jack
b Tom
Name: name, dtype: object
********************
18
18
********************
age
a 18
b 19

排序

1.按值排序 .sort_values()

df.sort_values(by='age')  # 对age列进行排序，默认升序排序

运行结果

如果我想降序

df.sort_values(by='age',ascending=False)

运行结果

2.索引排序 .sort_index()

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index=[5,4,3,2],
                   columns=['a','b','c','d'])
print(df1)
print("*"*20)
print(df1.sort_index())  # 默认升序

运行结果

a b c d
5 83.407554 64.455458 13.757196 99.695256
4 80.902137 11.005131 67.157480 56.424335
3 97.388129 21.861354 88.210198 53.802029
2 70.590632 35.586542 28.422749 5.635254
********************
a b c d
2 70.590632 35.586542 28.422749 5.635254
3 97.388129 21.861354 88.210198 53.802029
4 80.902137 11.005131 67.157480 56.424335
5 83.407554 64.455458 13.757196 99.695256

3.rank排名有名次

s1 = pd.Series([7,-5,7,4,2,0,4])
print(s1.rank())  # 排名方式 默认平均的排名，两个7，一个是第六名，一个第7名，都不想当第六名，就都当6.5名，所以会出现小数

运行结果

0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64

print(s1.rank(method='first'))   # 两个都是7，在前面的就是第6名，在后面的就是第7名，这样有点不公平

运行结果

0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64

April123abc

关注

22
点赞
踩
23

收藏

觉得还不错? 一键收藏
1
评论
Pandas入门

安装：pip install pandas导入： import pandas as pdpandas 有两个重要的数据结构：Series 和 DataframeSeries 表示一列，一个Series就是一列，多个Series组成一个二维的Dataframe.此篇用jupyter notebook做代码演示。
复制链接

扫一扫