pandas-cheat-sheet

最新推荐文章于 2020-12-17 11:30:55 发布

ycycg

最新推荐文章于 2020-12-17 11:30:55 发布

阅读量924

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_42043940/article/details/107502063

版权

这里写自定义目录标题

数据结构

数据结构

Series

创建

>>> s = pd.Series(data=list('一二三四五'), index=list('abcde'), name='This is a Series.')
a    一
b    二
c    三
d    四
e    五
Name: This is a Series., dtype: object

属性

>>> s.values
array(['一', '二', '三', '四', '五'], dtype=object)

>>> s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

>>> s.name
'This is a Series.'

>>> s.dtype
dtype('O')

取值

loc 是左闭右闭

iloc 是左闭右开

都可以像 numpy 的索引一样使用魔法切片，包括各种单列索引、多列索引、联合索引、函数式索引(选择函数返回True的项)、布尔索引等。

[]

[] 取值，类似列表，可以切片。
注意，索引为浮点时，切片是索引值比较而不是位置比较，索引无序还会报错。

>>> s_int = pd.Series([1,2,3,4], index=[1,3,5,6])
>>> s_float = pd.Series([1,2,3,4], index=[1.,3.,5.,6.])
>>> s_int[2:]
5    3
6    4
dtype: int64
>>> s_float[2:]
3.0    2
5.0    3
6.0    4
dtype: int64

取一个值

>>> s.loc['a']
'一'
>>> s.iloc[0]
'一'

取Series(切片，注意 loc 和 iloc 的区别)

>>> s.loc['a':'c'] # !!!!注意哦
a    一
b    二
c    三
Name: This is a Series., dtype: object
>>> s.iloc[0:2] 
a    一
b    二
Name: This is a Series., dtype: object

DataFrame

创建与添加列

>>> df = pd.DataFrame(data=np.random.normal(size=(5, 3)), index=list('abcde'), columns=list('一二三'))
>>> df['四'] = pd.Series(['one', 'two', 'three', 'four', 'five'])
一	二	三	四
a	-0.849012	-0.613929	0.179685	NaN
b	-0.599515	0.076521	0.396257	NaN
c	1.011487	0.113627	0.285235	NaN
d	-1.409431	1.072680	0.443549	NaN
e	-1.348019	0.273230	0.118538	NaN

属性

>>> df.values
array([[ 0.09102421,  0.28402622, -0.06825601],
       [ 0.27612001,  0.86260381,  0.46526601],
       [ 1.50256847,  1.87798034,  0.40763852],
       [ 1.78042358,  0.93830879,  1.09419721],
       [-0.05455816, -0.18235814,  0.07744923]])
       
>>> df.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

>>> df.columns
Index(['一', '二', '三', '四'], dtype='object')

>>> df.dtypes
一    float64
二    float64
三    float64
dtype: object

>>> df.shape
(5, 4)

取值

loc 是左闭右闭

iloc 是左闭右开

都可以像 numpy 的索引一样使用魔法切片，包括各种单列索引、多列索引、联合索引、函数式索引(选择函数返回True的项)、布尔索引等。

[]

[] 单值取列，切片取行。

>>> df[:2]
一	二	三	四
a	1.540417	-0.178028	-0.200006	NaN
b	-1.528491	-0.924294	0.071763	NaN
>>> df['一']
a    1.540417
b   -1.528491
c    0.196924
d    0.011444
e    2.383037
Name: 一, dtype: float64

那怎么取多列呢？可以函数式索引：

>>> df[lambda x: ['一', '二']]
一	二
a	1.540417	-0.178028
b	-1.528491	-0.924294
c	0.196924	0.911356
d	0.011444	1.068396
e	2.383037	-0.115746

取DataFrame(切片，注意 loc 和 iloc 的区别)

>>> df.loc['a':'c', '一':'三'] # !!!注意哦
一	二	三
a	-0.849012	-0.613929	0.179685
b	-0.599515	0.076521	0.396257
c	1.011487	0.113627	0.285235
>>> df.iloc[0:2, 0:2]
一	二
a	-0.849012	-0.613929
b	-0.599515	0.076521

取Series(切片，注意 loc 和 iloc 的区别)

可以横着也可以竖着，可以取完一项，也可以只取部分。

>>> df.loc['a']
一   -0.849012
二   -0.613929
三    0.179685
四         NaN
Name: a, dtype: object
>>> df.iloc[0]
一   -0.849012
二   -0.613929
三    0.179685
四         NaN
Name: a, dtype: object

取值

>>> df.loc['a', '一']
-0.8490115761208608

>>> df.iloc[0, 0]
-0.8490115761208608

布尔索引可以用 `&` `|` `~` （`and` `or` `not`）结合条件

根据值选择符合条件的某些行，注意使用括号。

>>> df[(df['一']>0)&(df['二']<0)]
一	二	三	四
a	1.540417	-0.178028	-0.200006	NaN
e	2.383037	-0.115746	-1.618292	NaN

>>> df.loc[df['三']>0, df.columns=='二']
二
b	-0.924294
d	1.068396

非数值可以用 isin 函数

>>> df1 = pd.DataFrame([list('abcdefg'), list('ABCDEFG'), list('壹贰叁肆伍陆柒')], index=['one', 'two', 'three'], columns=list('一二三四五六七'))
>>> df1
一	二	三	四	五	六	七
one	a	b	c	d	e	f	g
two	A	B	C	D	E	F	G
three	壹	贰	叁	肆	伍	陆	柒
肆
>>> df1[df1.四.isin(['d', '肆']) & df1.七.isin(['G', '柒'])]
一	二	三	四	五	六	七
three	壹	贰	叁	肆	伍	陆	柒

取单个元素可以用 `at` `iat`

修改行名或列名(Series通用)

设置参数 inplace=True 原地操作

>>> df.rename(index={'a':'A'}, columns={'一':'壹'})
壹	二	三	四
A	-0.849012	-0.613929	0.179685	NaN
b	-0.599515	0.076521	0.396257	NaN
c	1.011487	0.113627	0.285235	NaN
d	-1.409431	1.072680	0.443549	NaN
e	-1.348019	0.273230	0.118538	NaN

删除

>>> df.drop(index='d', columns='二')
一	三	四
a	-0.849012	0.179685	NaN
b	-0.599515	0.396257	NaN
c	1.011487	0.285235	NaN
e	-1.348019	0.118538	NaN
# 设置参数 inplace=True 原地操作

>>> del df['一'] # 原地操作

>>> df.pop('一') # 原地操作，返回删除的列
a   -0.847616
b   -0.646367
c   -0.144770
d   -1.296774
e   -0.965623
Name: 一, dtype: float64

依类型选择列

df.select_dtypes(include=None, exclude=None)
Parameters
----------
include, exclude : scalar or list-like
    A selection of dtypes or strings to be included/excluded. At least
    one of these parameters must be supplied.

Returns
-------
DataFrame
    The subset of the frame including the dtypes in ``include`` and
    excluding the dtypes in ``exclude``.

Notes
-----
* To select all *numeric* types, use ``np.number`` or ``'number'``
* To select strings you must use the ``object`` dtype, but note that
  this will return *all* object dtype columns
* To select datetimes, use ``np.datetime64``, ``'datetime'`` or
  ``'datetime64'``
* To select timedeltas, use ``np.timedelta64``, ``'timedelta'`` or
  ``'timedelta64'``
* To select Pandas categorical dtypes, use ``'category'``
* To select Pandas datetimetz dtypes, use ``'datetimetz'`` or ``'datetime64[ns, tz]'``

重复元素处理

df.duplicated(subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = ‘first’)

返回某列的元素是否重复。keep='last' 表示最后一个是首次出现的。

>>> df = pd.DataFrame(np.random.randint(0, 10, (4, 5)), columns=list('abcde'))
>>> df
	a	b	c	d	e
0	9	6	4	8	3
1	6	9	5	7	3
2	5	0	9	9	7
3	1	1	5	6	5
>>> df.duplicated('e')
0    False
1     True
2    False
3    False
dtype: bool

df.drop_duplicates(subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = ‘first’, inplace: bool = False, ignore_index: bool = False)

返回去重了行的列表

>>> df.drop_duplicates('e')
	a	b	c	d	e
0	9	6	4	8	3
2	5	0	9	9	7
3	1	1	5	6	5

常用函数

s.nunique(dropna=True)

>>> df['一'].nunique() # Series 的唯一值个数
5

s.unique()

>>> df['一'].unique() # Series 的所有唯一值
array([-1.16909486,  2.67058491,  0.33543619, -0.97217834, -0.5271507 ])

s.count(level=None)

df.count(axis=0, level=None, numeric_only=False)
# 返回非缺失值的个数

s.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

# 仅 Series
# 返回唯一值的个数
>>> s
a    一
b    二
c    三
d    四
e    五
Name: This is a Series., dtype: object
>>> s.value_counts()
四    1
一    1
三    1
五    1
二    1
Name: This is a Series., dtype: int64

df.describe(percentiles=None, include=None, exclude=None)

返回统计信息

>>> df.describe()
一	二	三
count	5.000000	5.000000	5.000000
mean	0.067519	-0.016268	0.726868
std	1.566170	0.630012	0.723670
min	-1.169095	-0.981688	-0.191667
25%	-0.972178	-0.279319	0.215404
50%	-0.527151	0.158416	0.893424
75%	0.335436	0.492137	1.072577
max	2.670585	0.529113	1.644603

df.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)

返回有哪些列(column)、有多少非缺失值(non-null count)、类型(dtype)

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a to e
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   一       5 non-null      float64
 1   二       5 non-null      float64
 2   三       5 non-null      float64
 3   四       0 non-null      object 
dtypes: float64(3), object(1)
memory usage: 360.0+ bytes

df.idxmax(axis=0, skipna=True)

>>> df['一'].idxmax() # 返回最大值的索引
'b'

df.nlargest(n, columns, keep=‘first’)

>>> df['一'].nlargest(3) # 返回最大 n 个的索引和取值
b    2.670585
c    0.335436
e   -0.527151
Name: 一, dtype: float64

s.clip(lower=None, upper=None, axis=None, inplace=False, *args, **kwargs,) -> Series

df.clip(lower=None, upper=None, axis=None, inplace=False, *args, **kwargs) -> Frame

# 截断。可lower 和 upper 可以是 Series.
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5
>>> df.clip(-4, 6)
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4
>>> t = pd.Series([2, -4, -1, 6, 3])
>>> t
0    2
1   -4
2   -1
3    6
4    3
dtype: int64
>>> df.clip(t, t + 4, axis=0)
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3

df.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method=‘pad’)

>>> df['一'].replace(df.loc['a':'b', '一'].to_list(), [1, 2], inplace=False)
a    1.000000
b    2.000000
c    0.335436
d   -0.972178
e   -0.527151
Name: 一, dtype: float64

df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

将某些列设为索引。

keys 为列名，可多个
drop 删除要用作索引的该列
append 是否在原索引的基础上增加索引，False 表示取代

df.where(cond, other=nan, inplace=False, axis=None, level=None, errors=‘raise’, try_cast=False)

将不满足条件的行全设置为 nan。结合 dropna 可达到取值的目的。和索引没啥区别啊。

df.mask(cond, other=nan, inplace=False, axis=None, level=None, errors=‘raise’, try_cast=False)

将满足条件的行全设置为 nan。结合 dropna 可达到取值的目的。和索引没啥区别啊。

df.query(expr, inplace=False, **kwargs)

查询函数。表达式还可以使用 in and or not 等等。

>>> df.query('(一<0)|(三<0)&(二<0)')
一	二	三	四
b	-0.410303	-0.285667	-2.524211	NaN
c	-1.536862	2.035034	0.349553	NaN
d	-0.033990	0.299646	0.106703	NaN
e	0.407070	-1.369581	-0.525784	NaN