Pandas（一）：数据结构介绍

最新推荐文章于 2022-02-06 06:13:02 发布

leotongxue1234

最新推荐文章于 2022-02-06 06:13:02 发布

阅读量315

点赞数

分类专栏：数据分析 Pandas 文章标签： Pandas 数据分析

本文链接：https://blog.csdn.net/zeroooorez/article/details/100926260

版权

数据分析同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

Pandas

2 篇文章 0 订阅

订阅专栏

Series

Series 一维，带标签数组
Series是一种类似数组的对象，它是由一组数据（各种Numpy数据类型）以及一组与之相关的数据标签（即索引）组成

a.创建简单的Series

import pandas as pd
t = pd.Series([1,2,3,4])
print(t)
print(type(t))

0    1
1    2
2    3
3    4
dtype: int64
<class 'pandas.core.series.Series'>

b.Series的属性values与index

通过属性取值：

print(t.values)
print(t.index)

[1 2 3 4]
RangeIndex(start=0, stop=4, step=1)

自定义创建Series的索引值

t2 = pd.Series([1,2,3,4],index=list('abcd'))
print(t2)
print(list('abcd'))

注：index传入一个列表

c.用字典创建Series

索引就是字典的键

temp_dict  = {'name':'xiaohong','age':12,'tel':1234}
t3 = pd.Series(temp_dict)
print(t3)

name    xiaohong
age           12
tel         1234
dtype: object

重新定义其他索引之后，如果能对应上，就取其值，如果不能就为NaN

import string
a = {string.ascii_uppercase[i]:i for i in range(10)}
print(a)
t4 = pd.Series(a)
print(t4)
t5 = pd.Series(a, index=list(string.ascii_uppercase[5:15]))
print(t5)

{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'J': 9}
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64
F    5.0
G    6.0
H    7.0
I    8.0
J    9.0
K    NaN
L    NaN
M    NaN
N    NaN
O    NaN
dtype: float64

如果查找索引没有对应的值，其结果就为NaN

t4 = pd.Series(temp_dict,index=['age','name','addr'])
print(t4)

name    xiaohong
addr         NaN
dtype: object

检测缺失数据
pandas中isnull和notnull函数

pd.isnull(t4)
age     False
name    False
addr     True
dtype: bool

pd.notnull(t4)
age      True
name     True
addr    False
dtype: bool

Series中isnull和notnull

t4.isnull()
t4.notnull()

d.切片和索引

切片：直接传入start end或步长即可
索引：一个的时候直接传入序号或者index，多个的时候传入序号或者是index的列表
注：标签切片左右全闭，位置切片左闭右开

print(t3['name'])
print(t3[0])
print('*'*20)
print(t3[['name','age']])
print(t3[[0,2]])
print('*'*20)
print(t3[:2])

xiaohong
xiaohong
********************
name    xiaohong
age           12
dtype: object
name    xiaohong
tel         1234
dtype: object
********************
name    xiaohong
age           12
dtype: object

print(t4[2:10:2])
print(t4[[2,3,4]])
print(t4[['A','B','C']])
print(t4['A':'C'])

C    2
E    4
G    6
I    8
dtype: int64
C    2
D    3
E    4
dtype: int64
A    0
B    1
C    2
dtype: int64
A    0
B    1
C    2
dtype: int64

Series对象本身及其索引都有一个name属性

temp_dict  = {'name':'xiaohong','age':12,'tel':1234}
t3 = pd.Series(temp_dict)
t3.name = 'info'
t3.index.name = 'content'
print(t3)

content
name    xiaohong
age           12
tel         1234
addr       anhui
Name: info, dtype: object

DataFrame

DataFrame 二维，Series容器
DataFrame是一个表格型的数据结构，它含有一组有序列，每列可以是不同的值类型（数值，字符串，布尔值等）DataFrame既有行索引也有列索引，它可以看作由Series组成的字典（共同同一个索引）

行索引，横向索引，叫index
列索引，纵向索引，叫column

a.创建简单的DataFrame

t1 = pd.DataFrame(np.arange(12).reshape(3,4))
print(t1)

   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

t2 = pd.DataFrame(np.arange(12).reshape(3,4),index=list("abc"),columns=list('wxyz'))
print(t2)

   w  x   y   z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

b.用字典创建DataFrame

方法一：直接传入一个由等长列表或NumPy数组组成的点单

temp_dict  = {'name':['xiaohong','xiaohei'],'age':[12,31],'tel':[1234,1221]}
t1 = pd.DataFrame(temp_dict)
print(t1)
       name  age   tel
0  xiaohong   12  1234
1   xiaohei   31  1221

方法二：列表里多个字典,无对应值用NaN替代

temp_dict2 = [{'name':'xiaohong','age':12,'tel':1234},
              {'name':'xiaohei','age':13},
              {'name':'xiaogang','tel':1000},
              {'name':'xiaoda','age':14,'tel':100},
              {'name':'xiaoxiao','age':15,'tel':34},
              {'name':'xiaobai','age':12,'tel':119}
             ]
t2 = pd.DataFrame(temp_dict2)
print(t2)
    age      name     tel
0  12.0  xiaohong  1234.0
1  13.0   xiaohei     NaN
2   NaN  xiaogang  1000.0
3  14.0    xiaoda   100.0
4  15.0  xiaoxiao    34.0
5  12.0   xiaobai   119.0

如果指定列序列，DataFrame就会按照指定的顺序进行排列

temp_dict  = {'name':['xiaohong','xiaohei'],'age':[12,31],'tel':[1234,1221]}
t1 = pd.DataFrame(temp_dict,columns=['name','tel','age'])
print(t1)

      name   tel  age
0  xiaohong  1234   12
1   xiaohei  1221   31

c.嵌套字典（字典的字典）

temp_dict3 = {
              'name':{0:'xiaohong',1:'xiaohei',2:'xiaogang'},
              'age':{0:13,1:12,2:13},
              'tel':{0:123,1:134},
              }
t3 = pd.DataFrame(temp_dict3)
print(t3)

       name  age    tel
0  xiaohong   13  123.0
1   xiaohei   12  134.0
2  xiaogang   13    NaN

d.DataFrame的基础属性

df.index      行索引
df.columns    列索引
df.shape      行数 列数
df.dtypes     列数据类型
df.ndim       数据维度
df.values    对象值，二维ndarray数组

print(t2.index)
print(t2.columns)
print(t2.shape)
print(t2.dtypes)
print('*'*40)
print(t2.values)

RangeIndex(start=0, stop=6, step=1)
Index(['name', 'tel', 'age'], dtype='object')
(6, 3)
name     object
tel     float64
age     float64
dtype: object
2
****************************************
[['xiaohong' 1234.0 12.0]
 ['xiaohei' nan 13.0]
 ['xiaogang' 1000.0 nan]
 ['xiaoda' 100.0 14.0]
 ['xiaoxiao' 34.0 15.0]
 ['xiaobai' 119.0 12.0]]
 ****************************************

通过类似字典标记的方式或属性的方法，可以将DataFrame的列获取为一个Series；
行也可以通过位置或名称的方式进行获取，如用索引字段ix

print(t2.tel)
print(t2['name'])
print(t2.ix[3])
0    1234.0
1       NaN
2    1000.0
3     100.0
4      34.0
5     119.0
Name: tel, dtype: float64
0    xiaohong
1     xiaohei
2    xiaogang
3      xiaoda
4    xiaoxiao
5     xiaobai
Name: name, dtype: object
age         14
name    xiaoda
tel        100
Name: 3, dtype: object

e.DataFrame整体情况查询

df.head()  显示头部，默认五行
df.tail()  显示末尾，默认五行
df.info()  相关信息概览：行数，列数，列索引，列空值个数，列类型，内存占用
df.describe()  快速统计结果：计数，均值，标准差，最大值，最小值，四分位数

print(t2.head()) 
print(t2.head(1)) 

print(t2.tail(2))
    age      name     tel
0  12.0  xiaohong  1234.0
1  13.0   xiaohei     NaN
2   NaN  xiaogang  1000.0
3  14.0    xiaoda   100.0
4  15.0  xiaoxiao    34.0
    age      name     tel
0  12.0  xiaohong  1234.0
    age      name    tel
4  15.0  xiaoxiao   34.0
5  12.0   xiaobai  119.0

print(t2.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
age     5 non-null float64
name    6 non-null object
tel     5 non-null float64
dtypes: float64(2), object(1)
memory usage: 224.0+ bytes
None

print(t2.describe())
            age        tel
count   5.00000     5.0000
mean   13.20000   497.4000
std     1.30384   572.5031
min    12.00000    34.0000
25%    12.00000   100.0000
50%    13.00000   119.0000
75%    14.00000  1000.0000
max    15.00000  1234.0000

f.DataFrame切片索引

df.loc 通过标签索引行数据
df.iloc 通过位置获取行数据

t = pd.DataFrame(np.arange(12).reshape(3,4),index=list("abc"),columns=list('wxyz'))
print(t)
print(t.loc['a','w'])
print(t.loc['a',['w','x']])
print(t.loc['a':'c',['w','z']])#左右全闭

   w  x   y   z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11
0
w    0
x    1
Name: a, dtype: int32
   w   z
a  0   3
b  4   7
c  8  11

a  = pd.Series(range(3),index=[0,1,2])
b  = pd.Series(range(3),index=list('abc'))
print(a[0:2])  #左闭右开
print(b['a':'c'])  #左右全闭
0    0
1    1
dtype: int64
a    0
b    1
c    2
dtype: int64

print(t.iloc[1])
print(t.iloc[:,[2,1]])
print(t.iloc[[2,0],[2,1]])
w    4
x    5
y    6
z    7
Name: b, dtype: int32
    y  x
a   2  1
b   6  5
c  10  9
    y  x
c  10  9
a   2  1

DataFrame对象的index和columns都有一个name属性

t.index.name = 'x轴'
t.columns.name = 'y轴'
print(t)

y轴  w  x   y   z
x轴              
a   0  1   2   3
b   4  5   6   7
c   8  9  10  11

索引对象

构建Series和DataFrame时，所用到的任何数组或其他标签都会被转换成一个Index

import pandas as pd
t = pd.Series(range(3),index=list('abc'))
print(t.index)
print(t.index[1:])

Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')

Index对象是不可修改的，所以才能使Index对象在多个数据结构之间安全共享

t.index[1] = 'd'
TypeError: Index does not support mutable operations

import numpy as np
import pandas as pd
index = pd.Index(np.arange(3))
t1 = pd.Series([1,2,3],index=index)
t1.index is index

True

方法	说明
append	连接另一个Index对象，产生一个新的Index
diff	计算差集，并得到一个Index
intersection	计算交集
union	计算并集
isin	计算指定各值是否都包含在参数集合中的布尔型数组
delete	删除索引i处的元素，并得到新的Index
drop	删除传入值，并得到新的Index
insert	将元素插入索引i处，并得到新的Index
is_monotonic	当各元素杜宇等于前一个元素时，返回True
is_unique	当Index没有重复值时，返回True
unique	计算Index中唯一值的数组

t2 = pd.Series(range(4),index=list('xyzw'))
t1.index.append(t2.index)

Index([0, 1, 2, 'x', 'y', 'z', 'w'], dtype='object')

leotongxue1234

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录