pandas基础数据结构.md

import pandas as pd
import numpy as np

pandas

1. Series 数据结构(Series 是带有标签的一维数组,可以保存任何数据类型(整数,字符串,浮点数,Python对象等),轴标签统称为索引)

1.1 创建对象

1.1.1 由字典创建,字典的key就是index,values就是values
dic = {'a':1 ,'b':2 , 'c':3, '4':4, '5':5}
pd.Series(dic)
4    4
5    5
a    1
b    2
c    3
dtype: int64
1.1.2 由数组创建(一维数组)
arr = np.random.randn(5)
arr
array([-0.07012061,  0.11267954, -0.39431225,  1.01689252, -1.09012858])
pd.Series(arr, index = ['a','b','c','d','e'],dtype = np.object)
a   -0.0701206
b      0.11268
c    -0.394312
d      1.01689
e     -1.09013
dtype: object
1.1.3 由标量创建
pd.Series(10, index = range(4))# 如果data是标量值,则必须提供索引。该值会重复,来匹配索引的长度
0    10
1    10
2    10
3    10
dtype: int64

1.2 属性

s = pd.Series(np.random.randn(5),name= "test")
s
0    1.157269
1   -0.869221
2   -1.288676
3   -0.313955
4   -0.366169
Name: test, dtype: float64
1.2.1 name
s.name
'test'
1.2.2 dtypes
s.dtypes
dtype('float64')
1.2.3 对数据快速统计汇总
s.describe()
count    5.000000
mean    -0.336150
std      0.925090
min     -1.288676
25%     -0.869221
50%     -0.366169
75%     -0.313955
max      1.157269
Name: test, dtype: float64
1.2.4 排序
s.sort_values()# 值排序
2   -1.288676
1   -0.869221
4   -0.366169
3   -0.313955
0    1.157269
Name: test, dtype: float64
s.sort_index()#索引排序
0    1.157269
1   -0.869221
2   -1.288676
3   -0.313955
4   -0.366169
Name: test, dtype: float64

1.3 索引

s = pd.Series(np.random.rand(5),index = ['a','b','c','d','e'])
s
a    0.801643
b    0.915875
c    0.759831
d    0.017935
e    0.989988
dtype: float64
1.3.1 位置下标
s[0]
0.80164301518126946
1.3.2 标签索引
s["a"]
0.80164301518126946
s[["a","c"]]
a    0.801643
c    0.759831
dtype: float64
1.3.3 切片索引
s[1:3]# 下标索引做切片,和list写法一样
b    0.915875
c    0.759831
dtype: float64
s["a":"b"]# 注意:用index做切片是末端包含
a    0.801643
b    0.915875
dtype: float64
1.3.4 布尔型索引
s[s>0.8]# 布尔型索引方法:用[判断条件]表示,其中判断条件可以是 一个语句,或者是 一个布尔型数组!
a    0.801643
b    0.915875
e    0.989988
dtype: float64

1.4 基本技巧

s = pd.Series(np.random.rand(50))
1.4.1 数据查看
s.head()
0    0.782876
1    0.051283
2    0.649146
3    0.142833
4    0.383419
dtype: float64
s.tail()
45    0.755592
46    0.844982
47    0.726627
48    0.378381
49    0.201560
dtype: float64
1.4.2 重新索引
s = pd.Series(np.random.rand(3), index = ['a','b','c'])
s
a    0.572041
b    0.638441
c    0.209887
dtype: float64
s1 = s.reindex(['c','b','a','d'])# 这里'd'索引不存在,所以值为NaN
s1
c    0.209887
b    0.638441
a    0.572041
d         NaN
dtype: float64
s2 = s.reindex(['c','b','a','d'], fill_value = 0)
s2
c    0.209887
b    0.638441
a    0.572041
d    0.000000
dtype: float64
1.4.3 对齐
s1 = pd.Series(np.random.rand(3), index = ['Jack','Marry','Tom'])
s2 = pd.Series(np.random.rand(3), index = ['Wang','Jack','Marry'])
s1
Jack     0.266121
Marry    0.899194
Tom      0.629672
dtype: float64
s2
Wang     0.350321
Jack     0.602482
Marry    0.081977
dtype: float64
s1+s2
Jack     0.868602
Marry    0.981171
Tom           NaN
Wang          NaN
dtype: float64
1.4.4 添加. 修改 删除
s = pd.Series(np.random.rand(5), index = list('ngjur'))
s
n    0.722115
g    0.999095
j    0.350186
u    0.763943
r    0.944230
dtype: float64
s.drop("n")
g    0.999095
j    0.350186
u    0.763943
r    0.944230
dtype: float64
s.drop(["g","n"])
j    0.350186
u    0.763943
r    0.944230
dtype: float64
s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index = list('ngjur'))
s1.append(s2)
0    0.716950
1    0.382762
2    0.518129
3    0.849587
4    0.322931
n    0.321734
g    0.818017
j    0.129185
u    0.134461
r    0.327531
dtype: float64
s = pd.Series(np.random.rand(3), index = ['a','b','c'])
print(s)
s['a'] = 100
s[['b','c']] = 200
print(s)
a    0.114684
b    0.491650
c    0.482090
dtype: float64
a    100.0
b    200.0
c    200.0
dtype: float64

2. Dataframe 数据结构(Dataframe是一个表格型的数据结构,“带有标签的二维数组;Dataframe带有index(行标签)和columns(列标签))

2.1 创建对象

2.1.1 数组/list组成的字典
data1 = {'a':[1,2,3],
        'b':[3,4,5],
        'c':[5,6,7]}
pd.DataFrame(data1)
abc
0135
1246
2357
2.1.2 Series组成的字典
data1 = {'one':pd.Series(np.random.rand(2)),
        'two':pd.Series(np.random.rand(3))}
pd.DataFrame(data1)
onetwo
00.0632420.413140
10.7386290.572936
2NaN0.153727
2.1.3 二维数组直接创建
ar = np.random.rand(9).reshape(3,3)
ar
array([[ 0.69920788,  0.63388493,  0.95545456],
       [ 0.85046889,  0.62151678,  0.95159924],
       [ 0.36499264,  0.09285466,  0.3064868 ]])
pd.DataFrame(ar,index = ['a', 'b', 'c'], columns = ['one','two','three'])
onetwothree
a0.6992080.6338850.955455
b0.8504690.6215170.951599
c0.3649930.0928550.306487
2.1.4 有字典组成的列表
data = [{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 20}]
data
[{'one': 1, 'two': 2}, {'one': 5, 'three': 20, 'two': 10}]
pd.DataFrame(data)
onethreetwo
01NaN2
1520.010
2.1.5 有字典组成的字典
data = {'Jack':{'math':90,'english':89,'art':78},
       'Marry':{'math':82,'english':95,'art':92},
       'Tom':{'math':78,'english':67}}
pd.DataFrame(data)
JackMarryTom
art7892NaN
english899567.0
math908278.0

2.2 索引

2.2.1 选择列
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
df
abcd
one44.51962050.01562530.07969442.898252
two10.10942855.79736353.2883224.932462
three82.24796578.28497069.45217027.152858
# 直接用列名选
df[["a","c"]]
ac
one44.51962030.079694
two10.10942853.288322
three82.24796569.452170
df.loc[:,["a","c"]]
ac
one44.51962030.079694
two10.10942853.288322
three82.24796569.452170
# 也可以用索引
df.iloc[:,[0,2]]
ac
one44.51962030.079694
two10.10942853.288322
three82.24796569.452170
2.2.2 选择行
df.loc["one"]
a    44.519620
b    50.015625
c    30.079694
d    42.898252
Name: one, dtype: float64
df.loc[["one","two"]]
abcd
one44.51962050.01562530.07969442.898252
two10.10942855.79736353.2883224.932462
df.iloc[0]
a    44.519620
b    50.015625
c    30.079694
d    42.898252
Name: one, dtype: float64
df.iloc[[0,1]]
abcd
one44.51962050.01562530.07969442.898252
two10.10942855.79736353.2883224.932462
2.2.3 切片
df
abcd
one44.51962050.01562530.07969442.898252
two10.10942855.79736353.2883224.932462
three82.24796578.28497069.45217027.152858
# 用标签
df.loc["one":"two","a":"c"]
abc
one44.51962050.01562530.079694
two10.10942855.79736353.288322
# 用索引
df.iloc[0:2,0:3]
abc
one44.51962050.01562530.079694
two10.10942855.79736353.288322
2.2.4 布尔判断
df[df<40]
abcd
oneNaNNaN30.079694NaN
two10.109428NaNNaN4.932462
threeNaNNaNNaN27.152858

2.3 基本技巧

2.3.1 数据查看. 转置
df = pd.DataFrame(np.random.rand(16).reshape(8,2)*100,
                   columns = ['a','b'])
df
ab
093.69407610.585479
120.9060190.805435
260.68809144.387455
394.55400411.026580
451.19674460.110108
549.55410777.915304
64.94755890.967949
713.15234696.102279
df.head(2)
ab
093.69407610.585479
120.9060190.805435
df.tail(2)
ab
64.94755890.967949
713.15234696.102279
df.T
01234567
a93.69407620.90601960.68809194.55400451.19674449.5541074.94755813.152346
b10.5854790.80543544.38745511.02658060.11010877.91530490.96794996.102279
2.3.2 添加 修改 删除值
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
df
abcd
072.75206326.04634212.06445698.062747
111.05797780.18040637.31146436.185763
220.36373344.36982494.95082755.851955
379.79765853.62233631.72609983.414271
df["e"] = 20
df
abcde
072.75206326.04634212.06445698.06274720
111.05797780.18040637.31146436.18576320
220.36373344.36982494.95082755.85195520
379.79765853.62233631.72609983.41427120
df[['a','c']] = 100
df
abcde
010026.04634210098.06274720
110080.18040610036.18576320
210044.36982410055.85195520
310053.62233610083.41427120
# 删除行
df.drop([1,2])
abcde
010026.04634210098.06274720
310053.62233610083.41427120
# 删除列
df.drop(["d"],axis=1)
abce
010026.04634210020
110080.18040610020
210044.36982410020
310053.62233610020
2.3.3 对齐
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
df1
ABCD
01.0891010.371419-0.0963480.397983
1-0.979242-0.8499510.054136-1.596409
2-0.133808-1.436406-0.0628710.376788
3-0.676031-0.157631-0.533043-0.510467
40.390883-1.2537270.177204-0.002852
50.825071-0.163355-1.2046150.742660
61.377831-1.170601-0.734310-1.271898
71.1638990.069660-0.889569-1.143764
8-1.7702800.073562-1.3313470.158275
90.769114-1.269013-0.830343-0.615827
df2
ABC
00.064734-0.0163150.251051
10.080493-0.621427-0.362038
20.552462-0.429362-0.145449
30.9708272.155149-0.748711
4-0.641491-1.1334941.383980
50.5409440.9057770.703850
6-2.282045-0.097482-1.760575
df1 + df2
ABCD
01.1538360.3551030.154703NaN
1-0.898749-1.471378-0.307902NaN
20.418654-1.865768-0.208320NaN
30.2947961.997519-1.281754NaN
4-0.250609-2.3872211.561184NaN
51.3660150.742422-0.500765NaN
6-0.904214-1.268082-2.494885NaN
7NaNNaNNaNNaN
8NaNNaNNaNNaN
9NaNNaNNaNNaN
2.3.4 排序
df= pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
df
abcd
05.15219749.07965869.33582275.125638
183.08643255.4147639.85635218.925750
230.28085577.8191766.75798359.269951
321.7992110.69338769.47375384.004438
df.sort_values(["a"],ascending=True)
abcd
05.15219749.07965869.33582275.125638
321.7992110.69338769.47375384.004438
230.28085577.8191766.75798359.269951
183.08643255.4147639.85635218.925750
df.sort_values(["a","c"])# 多列排序,按列顺序排序
abcd
05.15219749.07965869.33582275.125638
321.7992110.69338769.47375384.004438
230.28085577.8191766.75798359.269951
183.08643255.4147639.85635218.925750
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                  index = [5,4,3,2],
                   columns = ['a','b','c','d'])
df
abcd
571.17683565.53036752.84949869.301327
468.15373868.17200846.08007286.103846
386.81630524.45988453.67394780.592007
281.35615647.90007285.54873819.770766
df.sort_index()
abcd
281.35615647.90007285.54873819.770766
386.81630524.45988453.67394780.592007
468.15373868.17200846.08007286.103846
571.17683565.53036752.84949869.301327
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值