Pandas基础复习-DataFrame

数据类型-DataFrame

  • DataFrame是由多个Series数据列组成的表格数据类型,每行Series值都增加了一个共用的索引
  • 既有行索引,又有列索引
    • 行索引,表明不同行,横向索引,叫index,0轴,axis=0
    • 列索引,表名不同列,纵向索引,叫columns,1轴,axis=1
  • DataFrame数据类型可视为:二维 带标签 数组
  • 每列值的类型可以不同
  • 基本操作类似Series,依据行列索引操作
  • 常用于表达二维数据,但也可以表达多维数据(Dataframe嵌套,极少用)

DataFrame数据类型创建

Python list列表 创建DataFrame

import pandas as pd

df = pd.DataFrame([True, 1, 2.3, 'a', '你好']) # 1维
df
0
0True
11
22.3
3a
4你好
df = pd.DataFrame([[True,1,2.3,'a','你好'],[1,2,3,4,5]]) #2维
df
01234
0True12.3a你好
1123.045
# 3维,不建议
df = pd.DataFrame([[[True,1,2.3,'a','你好'],
                    [1,2,3,4,5]],
                   [[True,1,2.3,'a','你好'],
                    [1,2,3,4,5]]
                  ]) 
df
01
0[True, 1, 2.3, a, 你好][1, 2, 3, 4, 5]
1[True, 1, 2.3, a, 你好][1, 2, 3, 4, 5]

Python 字典 创建DataFrame

df = pd.DataFrame({'one':[1,2,3,4],
                   'two':[9,8,7,6]})
df
onetwo
019
128
237
346
# 自定义行索引
df = pd.DataFrame({'one':[1,2,3,4],
                   'two':[9,8,7,6]},index = ['a','b','c','d']) 
df
onetwo
a19
b28
c37
d46
df = pd.DataFrame({
    'A' : 1,
    'B' : 2.3,
    'C' : ['x','y',5] #需要多行
})
df
ABC
012.3x
112.3y
212.35
dt = {
    'one' : pd.Series([1,2,3],index=['a','b','c']),
    'two' : pd.Series([9,8,7,6],index=['a','b','c','d',])
}
dt
{'one': a    1
 b    2
 c    3
 dtype: int64, 'two': a    9
 b    8
 c    7
 d    6
 dtype: int64}
# one two自动列索引,abcd自动行索引.每个元素对应DataFrame的一列,每个元素内的键值对应一行
d = pd.DataFrame(dt) 
d
onetwo
a1.09
b2.08
c3.07
dNaN6
# 数据根据行列索引自动补齐
d_2 = pd.DataFrame(dt,index=['b','c','d'],columns=['two','three']) 
d_2
twothree
b8NaN
c7NaN
d6NaN

ndarray数组 创建DataFrame

import numpy as np

df = pd.DataFrame(np.arange(10).reshape(2,5)) # 自动生成行/列索引
df
01234
001234
156789
# 自定义行列索引
df = pd.DataFrame(np.random.randn(6,4),
                  index=[1,2,3,4,5,6],
                  columns=['a','b','c','d']) 
df
abcd
10.2743400.2965070.7511980.763512
20.1811340.6753800.5536950.632163
3-0.0597650.3477021.138297-0.143998
4-1.370677-0.9516400.135964-0.665875
51.4906100.4205390.6287842.119896
6-1.6697371.1677651.254722-0.948624

Series 创建DataFrame

e = pd.DataFrame([pd.Series([1,2,3]),
                  pd.Series([9,8,7,6])],
                 index=['a','b'])
e
0123
a1.02.03.0NaN
b9.08.07.06.0

DataFrame属性


di = {
    '姓名':['张三','李四','王五','赵六'],
    '性别':['男','女','女','男'],
    '年龄':[12,22,32,42],
    '地址':['北京','上海','广州','深圳']
}
di
{'地址': ['北京', '上海', '广州', '深圳'],
 '姓名': ['张三', '李四', '王五', '赵六'],
 '年龄': [12, 22, 32, 42],
 '性别': ['男', '女', '女', '男']}
d = pd.DataFrame(di,index=['d1','d2','d3','d4'])
d
地址姓名年龄性别
d1北京张三12
d2上海李四22
d3广州王五32
d4深圳赵六42
d.head() # 显示头部几行
地址姓名年龄性别
d1北京张三12
d2上海李四22
d3广州王五32
d4深圳赵六42
d.tail(3) # 显示末尾几行
地址姓名年龄性别
d2上海李四22
d3广州王五32
d4深圳赵六42
d.info() # 相关信息概览
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, d1 to d4
Data columns (total 4 columns):
地址    4 non-null object
姓名    4 non-null object
年龄    4 non-null int64
性别    4 non-null object
dtypes: int64(1), object(3)
memory usage: 160.0+ bytes
d.shape # 行数 列数
(4, 4)
d.dtypes # 列数据类型
地址    object
姓名    object
年龄     int64
性别    object
dtype: object
d.index # 获取行索引
Index(['d1', 'd2', 'd3', 'd4'], dtype='object')
d.columns # 获取列索引
Index(['地址', '姓名', '年龄', '性别'], dtype='object')
d.values # 获取值
array([['北京', '张三', 12, '男'],
       ['上海', '李四', 22, '女'],
       ['广州', '王五', 32, '女'],
       ['深圳', '赵六', 42, '男']], dtype=object)

DataFrame查增改删

查 Read

类list/ndarray数据访问方式

dates = pd.date_range('20130101',periods=10)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(10,4),index=dates,columns=['A','B','C','D'])
df
ABCD
2013-01-010.754077-0.346202-0.5570500.778106
2013-01-020.103394-1.051044-0.4130540.268955
2013-01-030.1747302.0560071.7813791.643397
2013-01-04-0.950517-0.226887-0.097138-0.442010
2013-01-050.076178-0.5189701.142290-0.952401
2013-01-061.371702-1.028873-1.470106-0.113098
2013-01-070.126720-0.251519-2.2125071.050036
2013-01-08-1.2469181.5302661.7614990.940741
2013-01-090.941099-2.4209321.927863-0.549143
2013-01-101.951555-0.264012-0.1716900.869293
#索引
df['A']
2013-01-01    0.754077
2013-01-02    0.103394
2013-01-03    0.174730
2013-01-04   -0.950517
2013-01-05    0.076178
2013-01-06    1.371702
2013-01-07    0.126720
2013-01-08   -1.246918
2013-01-09    0.941099
2013-01-10    1.951555
Freq: D, Name: A, dtype: float64
df.A
2013-01-01    0.754077
2013-01-02    0.103394
2013-01-03    0.174730
2013-01-04   -0.950517
2013-01-05    0.076178
2013-01-06    1.371702
2013-01-07    0.126720
2013-01-08   -1.246918
2013-01-09    0.941099
2013-01-10    1.951555
Freq: D, Name: A, dtype: float64
df['A']['2013-01-01'] # 先列后行
0.75407705661157032
df.A['2013-01-01']
0.75407705661157032
df[['A','C']]
AC
2013-01-010.754077-0.557050
2013-01-020.103394-0.413054
2013-01-030.1747301.781379
2013-01-04-0.950517-0.097138
2013-01-050.0761781.142290
2013-01-061.371702-1.470106
2013-01-070.126720-2.212507
2013-01-08-1.2469181.761499
2013-01-090.9410991.927863
2013-01-101.951555-0.171690
Pandas专用的数据访问方式 — .loc 通过自定义索引获取数据
#选取某行
df.loc['2013-01-01']
A    0.754077
B   -0.346202
C   -0.557050
D    0.778106
Name: 2013-01-01 00:00:00, dtype: float64
#选取某列
df.loc[:,'A']
2013-01-01    0.754077
2013-01-02    0.103394
2013-01-03    0.174730
2013-01-04   -0.950517
2013-01-05    0.076178
2013-01-06    1.371702
2013-01-07    0.126720
2013-01-08   -1.246918
2013-01-09    0.941099
2013-01-10    1.951555
Freq: D, Name: A, dtype: float64
# 选取特定值
df.loc['2013-01-01','A'] # 先行后列
0.75407705661157032
# 选取指定的行/列
df.loc[[dates[0],dates[2]],:] # 指定行
ABCD
2013-01-010.754077-0.346202-0.5570500.778106
2013-01-030.1747302.0560071.7813791.643397
df.loc[:,['A','B']] # 指定列
AB
2013-01-010.754077-0.346202
2013-01-020.103394-1.051044
2013-01-030.1747302.056007
2013-01-04-0.950517-0.226887
2013-01-050.076178-0.518970
2013-01-061.371702-1.028873
2013-01-070.126720-0.251519
2013-01-08-1.2469181.530266
2013-01-090.941099-2.420932
2013-01-101.951555-0.264012
df.loc[[dates[0],dates[2]],['A','B']] # 指定行列
AB
2013-01-010.754077-0.346202
2013-01-030.1747302.056007
# 切片
df.loc['2013-01-01':'2013-01-04',:] # 对行切片
ABCD
2013-01-010.754077-0.346202-0.5570500.778106
2013-01-020.103394-1.051044-0.4130540.268955
2013-01-030.1747302.0560071.7813791.643397
2013-01-04-0.950517-0.226887-0.097138-0.442010
df.loc[:,'A':'C'] # 对列切片
ABC
2013-01-010.754077-0.346202-0.557050
2013-01-020.103394-1.051044-0.413054
2013-01-030.1747302.0560071.781379
2013-01-04-0.950517-0.226887-0.097138
2013-01-050.076178-0.5189701.142290
2013-01-061.371702-1.028873-1.470106
2013-01-070.126720-0.251519-2.212507
2013-01-08-1.2469181.5302661.761499
2013-01-090.941099-2.4209321.927863
2013-01-101.951555-0.264012-0.171690
# 切片选取连续区块。行,列。左开右闭
df.loc['2013-01-01':'2013-01-04','A':'C'] 
ABC
2013-01-010.754077-0.346202-0.557050
2013-01-020.103394-1.051044-0.413054
2013-01-030.1747302.0560071.781379
2013-01-04-0.950517-0.226887-0.097138

.iloc 通过默认索引获取数据

# 选取某行
df.iloc[3]
A   -0.950517
B   -0.226887
C   -0.097138
D   -0.442010
Name: 2013-01-04 00:00:00, dtype: float64
# 选取某列
df.iloc[:,2]
2013-01-01   -0.557050
2013-01-02   -0.413054
2013-01-03    1.781379
2013-01-04   -0.097138
2013-01-05    1.142290
2013-01-06   -1.470106
2013-01-07   -2.212507
2013-01-08    1.761499
2013-01-09    1.927863
2013-01-10   -0.171690
Freq: D, Name: C, dtype: float64
# 选取特定值:
df.iloc[1,2]
-0.41305425875508139
# 选取指定的行/列
df.iloc[[1,2,4],:] # 指定行
ABCD
2013-01-020.103394-1.051044-0.4130540.268955
2013-01-030.1747302.0560071.7813791.643397
2013-01-050.076178-0.5189701.142290-0.952401
df.iloc[:,[0,2]] # 指定列
AC
2013-01-010.754077-0.557050
2013-01-020.103394-0.413054
2013-01-030.1747301.781379
2013-01-04-0.950517-0.097138
2013-01-050.0761781.142290
2013-01-061.371702-1.470106
2013-01-070.126720-2.212507
2013-01-08-1.2469181.761499
2013-01-090.9410991.927863
2013-01-101.951555-0.171690
df.iloc[[1,2,4],[0,2]] # 指定行列 ,先行后列
AC
2013-01-020.103394-0.413054
2013-01-030.1747301.781379
2013-01-050.0761781.142290
# 切片
df.iloc[1:3,:] # 对行切片:
ABCD
2013-01-020.103394-1.051044-0.4130540.268955
2013-01-030.1747302.0560071.7813791.643397
df.iloc[:,1:3] # 对列切片:
BC
2013-01-01-0.346202-0.557050
2013-01-02-1.051044-0.413054
2013-01-032.0560071.781379
2013-01-04-0.226887-0.097138
2013-01-05-0.5189701.142290
2013-01-06-1.028873-1.470106
2013-01-07-0.251519-2.212507
2013-01-081.5302661.761499
2013-01-09-2.4209321.927863
2013-01-10-0.264012-0.171690
df.iloc[3:5,0:2] # 切片选取连续区块。行,列。左开右闭
AB
2013-01-04-0.950517-0.226887
2013-01-050.076178-0.518970

Boolean索引

# 通过某列选择数据:
df[df.A > 0]
ABCD
2013-01-010.754077-0.346202-0.5570500.778106
2013-01-020.103394-1.051044-0.4130540.268955
2013-01-030.1747302.0560071.7813791.643397
2013-01-050.076178-0.5189701.142290-0.952401
2013-01-061.371702-1.028873-1.470106-0.113098
2013-01-070.126720-0.251519-2.2125071.050036
2013-01-090.941099-2.4209321.927863-0.549143
2013-01-101.951555-0.264012-0.1716900.869293
# 通过where选择数据:
b = df[df > 0]
b
ABCD
2013-01-010.754077NaNNaN0.778106
2013-01-020.103394NaNNaN0.268955
2013-01-030.1747302.0560071.7813791.643397
2013-01-04NaNNaNNaNNaN
2013-01-050.076178NaN1.142290NaN
2013-01-061.371702NaNNaNNaN
2013-01-070.126720NaNNaN1.050036
2013-01-08NaN1.5302661.7614990.940741
2013-01-090.941099NaN1.927863NaN
2013-01-101.951555NaNNaN0.869293
type(b['A']['2013-01-01'])
numpy.float64
# 通过 isin() 过滤数据:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three','five','four','three','five']
df2
ABCDE
2013-01-010.754077-0.346202-0.5570500.778106one
2013-01-020.103394-1.051044-0.4130540.268955one
2013-01-030.1747302.0560071.7813791.643397two
2013-01-04-0.950517-0.226887-0.097138-0.442010three
2013-01-050.076178-0.5189701.142290-0.952401four
2013-01-061.371702-1.028873-1.470106-0.113098three
2013-01-070.126720-0.251519-2.2125071.050036five
2013-01-08-1.2469181.5302661.7614990.940741four
2013-01-090.941099-2.4209321.927863-0.549143three
2013-01-101.951555-0.264012-0.1716900.869293five
df2['E'].isin(['one','four'])
2013-01-01     True
2013-01-02     True
2013-01-03    False
2013-01-04    False
2013-01-05     True
2013-01-06    False
2013-01-07    False
2013-01-08     True
2013-01-09    False
2013-01-10    False
Freq: D, Name: E, dtype: bool
df2[df2['E'].isin(['one','four'])]
ABCDE
2013-01-010.754077-0.346202-0.5570500.778106one
2013-01-020.103394-1.051044-0.4130540.268955one
2013-01-050.076178-0.5189701.142290-0.952401four
2013-01-08-1.2469181.5302661.7614990.940741four

增 Create

s1 = pd.Series([1,2,3,4,5,6], 
               index=pd.date_range('20130102', periods=6))
s1
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64
# 新增一列数据
df2['F'] = s1
df2
ABCDEF
2013-01-010.754077-0.346202-0.5570500.778106oneNaN
2013-01-020.103394-1.051044-0.4130540.268955one1.0
2013-01-030.1747302.0560071.7813791.643397two2.0
2013-01-04-0.950517-0.226887-0.097138-0.442010three3.0
2013-01-050.076178-0.5189701.142290-0.952401four4.0
2013-01-061.371702-1.028873-1.470106-0.113098three5.0
2013-01-070.126720-0.251519-2.2125071.050036five6.0
2013-01-08-1.2469181.5302661.7614990.940741fourNaN
2013-01-090.941099-2.4209321.927863-0.549143threeNaN
2013-01-101.951555-0.264012-0.1716900.869293fiveNaN

改 Update

# 更新一列值
df2.loc[:,'D']
2013-01-01    0.778106
2013-01-02    0.268955
2013-01-03    1.643397
2013-01-04   -0.442010
2013-01-05   -0.952401
2013-01-06   -0.113098
2013-01-07    1.050036
2013-01-08    0.940741
2013-01-09   -0.549143
2013-01-10    0.869293
Freq: D, Name: D, dtype: float64
df2.loc[:,'D'] = 5
df2
ABCDEF
2013-01-010.754077-0.346202-0.5570505oneNaN
2013-01-020.103394-1.051044-0.4130545one1.0
2013-01-030.1747302.0560071.7813795two2.0
2013-01-04-0.950517-0.226887-0.0971385three3.0
2013-01-050.076178-0.5189701.1422905four4.0
2013-01-061.371702-1.028873-1.4701065three5.0
2013-01-070.126720-0.251519-2.2125075five6.0
2013-01-08-1.2469181.5302661.7614995fourNaN
2013-01-090.941099-2.4209321.9278635threeNaN
2013-01-101.951555-0.264012-0.1716905fiveNaN
df2.iloc[1,3]
5
df2.iloc[1,3] = 10.1
df2
ABCDEF
2013-01-010.754077-0.346202-0.5570505.0oneNaN
2013-01-020.103394-1.051044-0.41305410.1one1.0
2013-01-030.1747302.0560071.7813795.0two2.0
2013-01-04-0.950517-0.226887-0.0971385.0three3.0
2013-01-050.076178-0.5189701.1422905.0four4.0
2013-01-061.371702-1.028873-1.4701065.0three5.0
2013-01-070.126720-0.251519-2.2125075.0five6.0
2013-01-08-1.2469181.5302661.7614995.0fourNaN
2013-01-090.941099-2.4209321.9278635.0threeNaN
2013-01-101.951555-0.264012-0.1716905.0fiveNaN
# 通过where更新
df3 = df.copy()
df3[df3 > 0] = -df3
df3
ABCD
2013-01-01-0.754077-0.346202-0.557050-0.778106
2013-01-02-0.103394-1.051044-0.413054-0.268955
2013-01-03-0.174730-2.056007-1.781379-1.643397
2013-01-04-0.950517-0.226887-0.097138-0.442010
2013-01-05-0.076178-0.518970-1.142290-0.952401
2013-01-06-1.371702-1.028873-1.470106-0.113098
2013-01-07-0.126720-0.251519-2.212507-1.050036
2013-01-08-1.246918-1.530266-1.761499-0.940741
2013-01-09-0.941099-2.420932-1.927863-0.549143
2013-01-10-1.951555-0.264012-0.171690-0.869293
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值