python的pandas学习笔记

最新推荐文章于 2024-07-22 21:10:20 发布

ggwcr

最新推荐文章于 2024-07-22 21:10:20 发布

阅读量582

点赞数

分类专栏： python数据分析文章标签： python numpy

本文链接：https://blog.csdn.net/ggwcr/article/details/76974859

版权

python数据分析专栏收录该内容

2 篇文章 0 订阅

订阅专栏

import pandas as pd

import numpy as np

from pandas import Series,DataFrame

obj = Series(range(5),index=['a','a','b','b','c'])

obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

obj.index.is_unique   #判断索引值是否唯一

False

obj['a']              #返回多个索引值

a    0
a    1
dtype: int64

对于DataFrame也如此

df = DataFrame(np.random.randn(4,3),index=list('aabb'))

df

	0	1	2
a	0.599982	2.421799	0.081475
a	0.420616	2.265408	1.196068
b	-1.153728	-0.173130	-0.098733
b	0.540624	-0.286814	0.287023

df.ix['b']

	0	1	2
b	-1.153728	-0.173130	-0.098733
b	0.540624	-0.286814	0.287023

# #汇总和计算描述统计

df = DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index=list('abcd'),columns=['one','two'])

df

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

sum（）默认对列进行求和

df.sum()

one    9.25
two   -5.80
dtype: float64

df.sum(axis = 1)            #对行

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

NA值会自动排除，通过skipa选项可以禁止

df.mean(axis=1,skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

df.idxmax()       #返回最大值的索引值

one    b
two    d
dtype: object

df.cumsum()     #累计型

	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

汇总统计描述describe

df.describe()

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

对于非数值型，describe会产生另外一种汇总统计

obj = Series(['a','a','b','c']*4)

obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

相关系数与协方差唯一值、值计数以及成员资格

obj = Series(['c','a','d','a','a','b','b','c','c'])

uniques = obj.unique()

uniques

array([‘c’, ‘a’, ‘d’, ‘b’], dtype=object) 返回的唯一值是未排序的，如果需要则可以再次进行排序(unique.sort())

obj.value_counts()

c 3 a 3 b 2 d 1 dtype: int64 Series值频统计是按降序排列。value_counts还是一个顶级的pandas方法，可以用于任何数组和序列

pd.value_counts(obj.values,sort=False)

a 3 c 3 b 2 d 1 dtype: int64 isin,它用于判断矢量化的成员资格，可用于选取Series中或DataFrame列中数据的子集

mask = obj.isin(['b','c'])

mask

0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool

obj[mask]

0 c 5 b 6 b 7 c 8 c dtype: object

data = DataFrame({
        'Qu1':[1,3,4,3,4],
        'Qu2':[2,3,1,2,3],
        'Qu3':[1,5,2,4,4]
    })

data

	Qu1	Qu2	Qu3
0	1	2	1
1	3	3	5
2	4	1	2
3	3	2	4
4	4	3	4

#统计DataFrame的每一列中元素1,2,3，4,5出现的频率，缺失值用0填
result = data.apply(pd.value_counts).fillna(0)

result

	Qu1	Qu2	Qu3
1	1.0	1.0	1.0
2	0.0	2.0	1.0
3	2.0	2.0	0.0
4	2.0	0.0	2.0
5	0.0	0.0	1.0

# #处理缺失数据

string_data = Series(['aardvark','artichoke',np.nan,'avocado'])

string_data

0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object

string_data.isnull()

0 False 1 False 2 True 3 False dtype: bool python内置的None值也会被当做NA处理

string_data[0] = None

string_data

0 None 1 artichoke 2 NaN 3 avocado dtype: object

string_data.isnull()

0 True 1 False 2 True 3 False dtype: bool Na处理方法 dropna 根据各标签的值中是否存在缺失数据对轴标签进行过滤，可通过阈值调节对缺失值的容忍度 fillna 用指定值或插值方法（如ffill或bfill）填充缺失数据 isnull 返回一个含有布尔值的对象，这些布尔值表示哪些是缺失 notnull isnull的否定式 # #滤除缺失值

from numpy import nan as NA

data = Series([1,NA,3.5,NA,7])

data.dropna()

0 1.0 2 3.5 4 7.0 dtype: float64

#也可以通过布尔值索引达到目的
data[data.notnull()]

0 1.0 2 3.5 4 7.0 dtype: float64

#dropna默认丢弃任何含有缺失值的行
data = DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])

cleaned = data.dropna()

cleaned

	0	1	2
0	1.0	6.5	3.0

#传入how = 'all'将只丢弃全为NA的哪些行
data.dropna(how = 'all')

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	6.5	3.0

data[4] = NA
data

	0	1	2	4
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	6.5	3.0	NaN

#要用这种方法丢弃列，只需传入axis=1即可
data.dropna(axis=1,how='all')

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

df = DataFrame(np.random.randn(7,3));df

	0	1	2
0	-1.051300	-0.526329	-0.204891
1	-0.977547	-1.706029	0.946824
2	0.540648	-1.228170	-1.180031
3	-0.320932	-0.667305	0.239980
4	-0.303641	-1.096918	0.355744
5	-0.424176	1.880769	-0.013825
6	0.643725	0.301759	-1.520921

df.ix[:4,1] = NA;df

	0	1	2
0	-1.051300	NaN	-0.204891
1	-0.977547	NaN	0.946824
2	0.540648	NaN	-1.180031
3	-0.320932	NaN	0.239980
4	-0.303641	NaN	0.355744
5	-0.424176	1.880769	-0.013825
6	0.643725	0.301759	-1.520921

df.ix[:2,2] = NA;df

	0	1	2
0	-1.051300	NaN	NaN
1	-0.977547	NaN	NaN
2	0.540648	NaN	NaN
3	-0.320932	NaN	0.239980
4	-0.303641	NaN	0.355744
5	-0.424176	1.880769	-0.013825
6	0.643725	0.301759	-1.520921

df

	0	1	2
0	-1.051300	NaN	NaN
1	-0.977547	NaN	NaN
2	0.540648	NaN	NaN
3	-0.320932	NaN	0.239980
4	-0.303641	NaN	0.355744
5	-0.424176	1.880769	-0.013825
6	0.643725	0.301759	-1.520921

df.dropna(thresh=2)

	0	1	2
3	-0.320932	NaN	0.239980
4	-0.303641	NaN	0.355744
5	-0.424176	1.880769	-0.013825
6	0.643725	0.301759	-1.520921

help(df.dropna)

Help on method dropna in module pandas.core.frame: dropna(self, axis=0, how=’any’, thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance Return object with labels on given axis omitted where alternately any or all of the data are missing Parameters ———- axis : {0 or ‘index’, 1 or ‘columns’}, or tuple/list thereof Pass tuple or list to drop on multiple axes how : {‘any’, ‘all’} * any : if any NA values are present, drop that label * all : if all values are NA, drop that label thresh : int, default None int value : require that many non-NA values subset : array-like Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include inplace : boolean, default False If True, do operation inplace and return None. Returns ——- dropped : DataFrame

df.fillna(0)

	0	1	2
0	-1.051300	0.000000	0.000000
1	-0.977547	0.000000	0.000000
2	0.540648	0.000000	0.000000
3	-0.320932	0.000000	0.239980
4	-0.303641	0.000000	0.355744
5	-0.424176	1.880769	-0.013825
6	0.643725	0.301759	-1.520921

#若通过一个字典调用fillna，就可以实现对不同的列填充不同的值
df.fillna({1:0.5,3:-1})

	0	1	2
0	-1.051300	0.500000	NaN
1	-0.977547	0.500000	NaN
2	0.540648	0.500000	NaN
3	-0.320932	0.500000	0.239980
4	-0.303641	0.500000	0.355744
5	-0.424176	1.880769	-0.013825
6	0.643725	0.301759	-1.520921

fillna默认会返回新对象，但也可以对现有对象进行就地修改

_ = df.fillna(0,inplace=True)

df

	0	1	2
0	-1.051300	0.000000	0.000000
1	-0.977547	0.000000	0.000000
2	0.540648	0.000000	0.000000
3	-0.320932	0.000000	0.239980
4	-0.303641	0.000000	0.355744
5	-0.424176	1.880769	-0.013825
6	0.643725	0.301759	-1.520921

对reindex有效的哪些插值方法也可以用于fillna

df = DataFrame(np.random.randn(6,3))

df

	0	1	2
0	0.936874	0.226055	-0.008118
1	-1.885668	0.947839	-0.344767
2	-1.620408	-0.895714	1.133733
3	1.442455	0.959708	0.107022
4	-1.455846	0.572486	1.087657
5	1.189054	-1.623793	-0.334216

df.ix[2:,1] = NA; df.ix[4:,2] = NA

df

	0	1	2
0	0.936874	0.226055	-0.008118
1	-1.885668	0.947839	-0.344767
2	-1.620408	NaN	1.133733
3	1.442455	NaN	0.107022
4	-1.455846	NaN	NaN
5	1.189054	NaN	NaN

# ffill :将有效的观察传播到下一个有效的观察
df.fillna(method='ffill')

	0	1	2
0	0.936874	0.226055	-0.008118
1	-1.885668	0.947839	-0.344767
2	-1.620408	0.947839	1.133733
3	1.442455	0.947839	0.107022
4	-1.455846	0.947839	0.107022
5	1.189054	0.947839	0.107022

help(df.fillna)

Help on method fillna in module pandas.core.frame: fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) method of pandas.core.frame.DataFrame instance Fill NA/NaN values using the specified method Parameters ———- value : scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list. method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap axis : {0, 1, ‘index’, ‘columns’} inplace : boolean, default False If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame). limit : int, default None If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. downcast : dict, default is None a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible) See Also ——– reindex, asfreq Returns ——- filled : DataFrame

df.fillna(method='ffill',limit=2)

	0	1	2
0	0.936874	0.226055	-0.008118
1	-1.885668	0.947839	-0.344767
2	-1.620408	0.947839	1.133733
3	1.442455	0.947839	0.107022
4	-1.455846	NaN	0.107022
5	1.189054	NaN	0.107022

data = Series([1,NA,3.5,NA,7])

#使用平均值填充
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

层次化索引

data = Series(np.random.randn(10),
              index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])

data

a  1   -0.520847
   2    0.858349
   3   -1.048257
b  1    0.281738
   2    0.757592
   3    0.032117
c  1    0.526343
   2   -2.281655
d  2   -0.017352
   3    0.047178
dtype: float64

data.index

MultiIndex(levels=[[u’a’, u’b’, u’c’, u’d’], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

#对一个层次索引
data['b']

1    0.281738
2    0.757592
3    0.032117
dtype: float64

data['b':'c']

b  1    0.281738
   2    0.757592
   3    0.032117
c  1    0.526343
   2   -2.281655
dtype: float64

data.ix[['b','d']]

b  1    0.281738
   2    0.757592
   3    0.032117
d  2   -0.017352
   3    0.047178
dtype: float64

data[:,2]

a    0.858349
b    0.757592
c   -2.281655
d   -0.017352
dtype: float64

data

a  1   -0.520847
   2    0.858349
   3   -1.048257
b  1    0.281738
   2    0.757592
   3    0.032117
c  1    0.526343
   2   -2.281655
d  2   -0.017352
   3    0.047178
dtype: float64

层次化索引在数据重塑和基于分组的操作中扮演着重要的角色。可以使用unstack方法被重新安排到一个DataFrame中

data.unstack()

	1	2	3
a	-0.520847	0.858349	-1.048257
b	0.281738	0.757592	0.032117
c	0.526343	-2.281655	NaN
d	NaN	-0.017352	0.047178

#unstack的逆运算是stack
data.unstack().stack()

a  1   -0.520847
   2    0.858349
   3   -1.048257
b  1    0.281738
   2    0.757592
   3    0.032117
c  1    0.526343
   2   -2.281655
d  2   -0.017352
   3    0.047178
dtype: float64

frame = DataFrame(np.arange(12).reshape(4,3),index=[['a','a','b','b'],[1,2,1,2]],
                  columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])

frame

		Ohio		Colorado
		Green	Red	Green
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.index.names = ['key1','key2']

frame.columns.names = ['state','color']

frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame['Ohio']

	color	Green	Red
key1	key2
a	1	0	1
a	2	3	4
b	1	6	7
b	2	9	10

swaplevel接受两个级别编号或名称并返回一个互换了级别的新对象（但数据不会发生变化）

frame.swaplevel('key1','key2')

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
2	a	3	4	5
1	b	6	7	8
2	b	9	10	11

frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

stortlevel则根据单个级别中的值对数据进行排序（稳定的）

#两级分层，取0,1，分别表示第一层，第二层
frame.sortlevel(1)

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
b	1	6	7	8
a	2	3	4	5
b	2	9	10	11

frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.swaplevel(0,1)

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
2	a	3	4	5
1	b	6	7	8
2	b	9	10	11

frame.swaplevel(0,1).sortlevel(0)

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
1	b	6	7	8
2	a	3	4	5
2	b	9	10	11

# #根据级别汇总统计

frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.sum(level='key2')

state	Ohio		Colorado
color	Green	Red	Green
key2
1	6	8	10
2	12	14	16

frame.sum(level='color',axis=1)

	color	Green	Red
key1	key2
a	1	2	1
a	2	8	4
b	1	14	7
b	2	20	10

# #使用DataFrame的列

frame = DataFrame({
        'a':range(7),
        'b':range(7,0,-1),
        'c':['one','one','one','two','two','two','two'],
        'd':[0,1,2,0,1,2,3]
        })

frame

	a	b	c	d
0	0	7	one	0
1	1	6	one	1
2	2	5	one	2
3	3	4	two	0
4	4	3	two	1
5	5	2	two	2
6	6	1	two	3

DataFrame的set_index函数会将其中一个或多个列转换为行索引，并创建一个新的DataFrame

frame2 = frame.set_index(['c','d'])

frame2

		a	b
c	d
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1

默认情况下，哪些列会从DataFrame中移除，但也可以将其保留下来

frame.set_index(['c','d'],drop=False)

		a	b	c	d
c	d
one	0	0	7	one	0
	1	1	6	one	1
	2	2	5	one	2
two	0	3	4	two	0
	1	4	3	two	1
	2	5	2	two	2
	3	6	1	two	3

reset_index的功能跟set_index刚好相反，层次化索引的级别会被转移到列里

frame2.reset_index()

	c	d	a	b
0	one	0	0	7
1	one	1	1	6
2	one	2	2	5
3	two	0	3	4
4	two	1	4	3
5	two	2	5	2
6	two	3	6	1

整数索引

ser = Series(np.arange(3.))

#会以为是倒数第一的索引，其实报错，整数的索引值为0,1,2
ser[-1]

————————————————————————— KeyError Traceback (most recent call last) in () 1 #会以为是倒数第一的索引，其实报错，整数的索引值为0,1,2 —-> 2 ser[-1] C:\Anaconda\Anaconda2\lib\site-packages\pandas\core\series.pyc in __getitem__(self, key) 558 def __getitem__(self, key): 559 try: –> 560 result = self.index.get_value(self, key) 561 562 if not lib.isscalar(result): C:\Anaconda\Anaconda2\lib\site-packages\pandas\indexes\base.pyc in get_value(self, series, key) 1909 try: 1910 return self._engine.get_value(s, k, -> 1911 tz=getattr(series.dtype, ‘tz’, None)) 1912 except KeyError as e1: 1913 if len(self) > 0 and self.inferred_type in [‘integer’, ‘boolean’]: pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3234)() pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2931)() pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3891)() pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6527)() pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6465)() KeyError: -1L 相反，对于一个非整数索引，就没有这样的歧义

ser2 = Series(np.arange(3.),index=['a','b','c'])

ser2[-1]

2.0

ser.ix[:1]

0 0.0 1 1.0 dtype: float64 如果需要可靠的，不考虑索引类型的，基于位置的索引，可以使用Series的iget_value方法和Dataframe的irow和icol方法

ser3 = Series(range(3),index=[-5,1,3])

ser3

-5 0 1 1 3 2 dtype: int64

ser3.iget_value(2)

C:\Anaconda\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: iget_value(i) is deprecated. Please use .iloc[i] or .iat[i] if __name__ == ‘__main__’: 2

frame = DataFrame(np.arange(6).reshape(3,2),index=[2,0,1])

frame

	0	1
2	0	1
0	2	3
1	4	5

frame.irow(1)

C:\Anaconda\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: irow(i) is deprecated. Please use .iloc[i]
  if __name__ == '__main__':





0    2
1    3
Name: 0, dtype: int32