利用Python进行数据分析第二版复现（四）

最新推荐文章于 2023-08-30 16:43:53 发布

三街打工人

最新推荐文章于 2023-08-30 16:43:53 发布

阅读量545

点赞数

分类专栏： python数据分析

本文链接：https://blog.csdn.net/u010654659/article/details/104116723

版权

python数据分析专栏收录该内容

18 篇文章 3 订阅

订阅专栏

import pandas as pd

from pandas import Series,DataFrame

5.1 pandas的数据结构介绍

要使用pandas，你首先就得熟悉它的两个主要数据结构：Series和DataFrame。虽然它们并不能解决所有问题，但它们为大多数应用提供了一种可靠的、易于使用的基础。

Series

Series是一种类似于一维数组的对象，它由⼀组数据（各种NumPy数据类型）以及⼀组与之相关的数据标签（即索引）组成。仅由⼀组数据即可产⽣最简单的Series.
Series的字符串表现形式为：索引在左边，值在右边。由于我们没有为数据指定索引，于是会自动创建一个0到N-1（N为数据的长度）的整数型索引。你可以通过Series 的values和index属性获取其数组表示形式和索引对象。
与普通NumPy数组相比，你可以通过索引的方式选取Series中的单个或一组值。
如果数据被存放在一个Python字典中，也可以直接通过这个字典来创建Series.
如果只传入一个字典，则结果Series中的索引就是原字典的键（有序排列）。你可以传入排好序的字典的键以改变顺序。
pandas的isnull和notnull函数可用于检测缺失数据。
Series对象本身及其索引都有1个name属性，该属性跟pandas其他的关键功能关系非常密切.

obj = pd.Series([4,7,-5,3])

obj

0    4
1    7
2   -5
3    3
dtype: int64

obj.values

array([ 4,  7, -5,  3], dtype=int64)

obj.index

RangeIndex(start=0, stop=4, step=1)

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

obj2*2-1

d     7
b    13
a   -11
c     5
dtype: int64

import numpy as np
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

 pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

 pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

print(obj3)
print(obj4)
obj3+obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64





California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

DataFrame

DataFrame是1个表格型的数据结构，它含有1组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同1个索引）。DataFrame中的数据是以1个或多个2维块存放的（而不是列表、字典或别的1维数据结构）。
对于特别大的DataFrame，head方法会选取前五行。
如果指定了列序列，则DataFrame的列就会按照指定顺序进行排列.
如果传入的列在数据中找不到，就会在结果中产生缺失值。

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

frame

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

print(frame.head())
pd.DataFrame(data, columns=['year', 'state', 'pop'])

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9

	year	state	pop
0	2000	Ohio	1.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9
5	2003	Nevada	3.2

frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four','five', 'six'])

frame2

	year	state	pop	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN
six	2003	Nevada	3.2	NaN

frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

frame2['debt'] = np.arange(6.)
frame2

	year	state	pop	debt
one	2000	Ohio	1.5	0.0
two	2001	Ohio	1.7	1.0
three	2002	Ohio	3.6	2.0
four	2001	Nevada	2.4	3.0
five	2002	Nevada	2.9	4.0
six	2003	Nevada	3.2	5.0

将列表或数组赋值给某个列时，其长度必须跟DataFrame的⻓度相匹配。如果赋值的是1个Series，就会精确匹配DataFrame的索引，所有的空位都将被填上缺失值。
del关键词可以用于删除列.也可以使用类似NumPy数组的方法，对DataFrame进行转置（交换行和列）。

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

	year	state	pop	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	-1.2
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	-1.5
five	2002	Nevada	2.9	-1.7
six	2003	Nevada	3.2	NaN

frame2['eastern'] = frame2.state == 'Ohio'
frame2

	year	state	pop	debt	eastern
one	2000	Ohio	1.5	NaN	True
two	2001	Ohio	1.7	-1.2	True
three	2002	Ohio	3.6	NaN	True
four	2001	Nevada	2.4	-1.5	False
five	2002	Nevada	2.9	-1.7	False
six	2003	Nevada	3.2	NaN	False

del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
frame3

	Nevada	Ohio
2001	2.4	1.7
2002	2.9	3.6
2000	NaN	1.5

 frame3.T

	2001	2002	2000
Nevada	2.4	2.9	NaN
Ohio	1.7	3.6	1.5

如果设置了DataFrame的index和columns的name属性，则这些信息也会被显示出来.

frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state	Nevada	Ohio
year
2001	2.4	1.7
2002	2.9	3.6
2000	NaN	1.5

索引对象

pandas的索引对象负责管理轴标签和其他元数据（比如轴名称等）。构建Series或DataFrame时，所用到的任何数组或其他序列的标签都会被转换成1个Index。
Index对象是不可变的，因此用户不能对其进行修改。

基本功能

重新索引

pandas对象的1个重要方法是reindex，其作用是创建1个新对象，它的数据符合新的索引。

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

对于时间序列这样的有序数据，重新索引时可能需要做⼀些插值处理。method选项即可达到此目的，例如，使用ffill可以实现前向值填充。

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

在这里插入图片描述

丢弃指定轴上的项

丢弃某条轴上的1个或多个项很简单，只要有⼀个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑，所以drop⽅法返回的是1个在指定轴上删除了指定值的新对象。
对于DataFrame，可以删除任意轴上的索引值。

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data.drop(['Colorado', 'Ohio'])

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

data.drop('two', axis=1)

	one	three	four
Ohio	0	2	3
Colorado	4	6	7
Utah	8	10	11
New York	12	14	15

data.drop(['two', 'four'], axis='columns')

	one	three
Ohio	0	2
Colorado	4	6
Utah	8	10
New York	12	14

索引、选取和过滤

Series索引可以用[]进行索引。
还可以用一些索引进行选择。

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data[:2]

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

data[data['three'] > 5]

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data < 5

	one	two	three	four
Ohio	True	True	True	True
Colorado	True	False	False	False
Utah	False	False	False	False
New York	False	False	False	False

data[data < 5] = 0
data

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

用loc和iloc进行选取

使用轴标签（loc）或整数索引（iloc），从DataFrame选择行和列子集。

data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

整数索引

为了进行统一，如果轴索引含有整数，数据选取总会使⽤标签。为了更准确，请使用loc（标签）或iloc（整数）。

ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

ser2[-1]

2.0

算术运算和数据对齐

pandas最重要的1个功能是，它可以对不同索引的对象进行算术运算。在将对象相加时，如果存在不同的索引对，则结果的索引就是该索引对的并集。对于有数据库经验的用户，这就像在索引标签上进行自动外连接。

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

	b	c	d
Ohio	0.0	1.0	2.0
Texas	3.0	4.0	5.0
Colorado	6.0	7.0	8.0

df2

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

df1+df2
#需要有共用的列或者行标签

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utah	NaN	NaN	NaN	NaN

在算术方法中填

在对不同索引的对象进行算术运算时，你可能希望当1个对象中某个轴标签在另1个对象中找不到时填充⼀个特殊值（比如0）。

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), 
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

df2

	a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	NaN	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

df1+df2

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	NaN	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

df1.add(df2, fill_value=0)

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	5.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

DataFrame和Series之间的运算

跟不同维度的NumPy数组1样，DataFrame和Series之间算术运算也是有明确规定的.
当我们从arr减去arr[0]，每一行都会执行这个操作。这就叫做广播（broadcasting）。

arr = np.arange(12.).reshape((3, 4))
arr-arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

 frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                      columns=list('bde'),
                      index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

frame-series

	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame+series2

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

函数应用和映射

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

	b	d	e
Utah	-1.158749	-1.251239	-0.410391
Ohio	-0.189705	1.137946	1.385836
Texas	0.822484	0.281094	-0.813671
Oregon	-1.338115	-0.748630	0.186450

np.abs(frame)

	b	d	e
Utah	1.158749	1.251239	0.410391
Ohio	0.189705	1.137946	1.385836
Texas	0.822484	0.281094	0.813671
Oregon	1.338115	0.748630	0.186450

f = lambda x: x.max() - x.min()
frame.apply(f)

b    2.160599
d    2.389185
e    2.199507
dtype: float64

frame.apply(f, axis='columns')

Utah      0.840848
Ohio      1.575541
Texas     1.636155
Oregon    1.524565
dtype: float64

def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

	b	d	e
min	-1.338115	-1.251239	-0.813671
max	0.822484	1.137946	1.385836

format = lambda x: '%.2f' % x
frame.applymap(format)

	b	d	e
Utah	-1.16	-1.25	-0.41
Ohio	-0.19	1.14	1.39
Texas	0.82	0.28	-0.81
Oregon	-1.34	-0.75	0.19

排序和排名

根据条件对数据集排序（sorting）也是1种重要的内置运算。要对行或列索引进⾏排序（按字典顺序），可使用sort_index方法，它将返回1个已排序的新对象。

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame.sort_index()

	d	a	b	c
one	4	5	6	7
three	0	1	2	3

#数据默认是按升序排序的，但也可以降序排序
frame.sort_index(axis=1, ascending=False)
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

#任何缺失值默认都会被放到Series的末尾
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

当排序1个DataFrame时，你可能希望根据1个或多个列中的值进行排序。将1个或多个列的名字传递给sort_values的by选项即可达到该目的。

frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

	b	a
0	4	0
1	7	1
2	-3	0
3	2	1

frame.sort_values(by='b')

	b	a
2	-3	0
3	2	1
0	4	0
1	7	1

#要根据多个列进行排序
frame.sort_values(by=['a', 'b'])

	b	a
2	-3	0
0	4	0
3	2	1
1	7	1

带有重复标签的轴索引

索引的is_unique属性可以告诉你它的值是否是唯1的.

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

obj.index.is_unique

False

汇总和计算描述统计

调用DataFrame的sum方法将会返回1个含有列的和的Series，传入axis='columns’或axis=1将会按列进行求和运算。

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

print(df.sum())
df.sum(axis=1)

one    9.25
two   -5.80
dtype: float64





a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

#累计型
df.cumsum()

	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f5SAbxGo-1580375789016)(attachment:8.png)]

	AAPL	IBM	MSFT	GOOG
2020-01-23	0.004816	-0.007089	0.006156	0.000471
2020-01-24	-0.002882	-0.016169	-0.010077	-0.013413
2020-01-27	-0.029405	-0.013802	-0.016723	-0.022370
2020-01-28	0.028289	0.006709	0.019596	0.013013
2020-01-29	0.020932	-0.013329	0.015593	0.004179

	AAPL	IBM	MSFT	GOOG
AAPL	1.000000	0.399140	0.585232	0.535808
IBM	0.399140	1.000000	0.478459	0.415225
MSFT	0.585232	0.478459	1.000000	0.674048
GOOG	0.535808	0.415225	0.674048	1.000000

	AAPL	IBM	MSFT	GOOG
AAPL	0.000240	0.000080	0.000130	0.000125
IBM	0.000080	0.000166	0.000088	0.000080
MSFT	0.000130	0.000088	0.000206	0.000145
GOOG	0.000125	0.000080	0.000145	0.000226

唯1值、值计数以及成员资格