pandas中的几个高效的用于数据分析的函数

最新推荐文章于 2021-05-01 18:00:38 发布

Neo的作战室

最新推荐文章于 2021-05-01 18:00:38 发布

阅读量307

点赞数

分类专栏： python 数据分析文章标签： python

本文链接：https://blog.csdn.net/qq_34741466/article/details/105081217

版权

python 同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

数据分析

1 篇文章 0 订阅

订阅专栏

Pandas 擅长处理的类型如下所示：

容易处理浮点数据和非浮点数据中的缺失数据（用 NaN 表示）；
大小可调整性: 可以从 DataFrame 或者更高维度的对象中插入或者是删除列；
显式数据可自动对齐: 对象可以显式地对齐至一组标签内，或者用户可以简单地选择忽略标签，使 Series、 DataFrame等自动对齐数据；
灵活的分组功能，对数据集执行拆分-应用-合并等操作，对数据进行聚合和转换；
简化将数据转换为 DataFrame 对象的过程，而这些数据基本是 Python 和 NumPy 数据结构中不规则、不同索引的数据；
基于标签的智能切片、索引以及面向大型数据集的子设定；
更加直观地合并以及连接数据集；
更加灵活地重塑、转置（pivot）数据集；
轴的分级标记 (可能包含多个标记)；
具有鲁棒性的 IO 工具，用于从平面文件 (CSV 和 delimited)、 Excel 文件、数据库中加在数据，以及从 HDF5格式中保存 / 加载数据；
时间序列的特定功能: 数据范围的生成以及频率转换、移动窗口统计、数据移动和滞后等。

1.read_csv(nrows=n)

我们需要做的只是从.csv 文件中导入几行，之后根据需要继续导入。

>>> df = pd.read_csv(‘data.csv’,nrows=10 , index_col=0)
>>> df

# 0	1	2	3	4	5	6	7	8	9	...	29	30	31	32	33	34	35	36	37	filename
# 0	12325.511892	7.092492e+06	7.069441e+06	7.119130e+06	7.087014e+06	2.645499e+09	0.977247	-0.251634	7709.052316	5.918416e+06	...	51234	0.400544	-1.046827	-0.283916	0.009277	-0.111152	0.155025	-0.009825	0.152713	7000.csv
# 1	8643.355279	6.239019e+06	6.216428e+06	6.254992e+06	6.240004e+06	2.857471e+09	-0.659959	-0.073468	17736.813086	5.212498e+06	...	68520	0.022398	-1.341807	0.908607	-0.016664	0.044525	-0.172820	0.006349	0.056972	7001.csv
# 2	43699.910307	6.678498e+06	6.566008e+06	6.745702e+06	6.670449e+06	2.738184e+09	0.134083	-0.581888	51102.185380	5.482646e+06	...	65369	0.168657	-1.514273	0.976057	-0.331418	0.171253	-0.299731	0.150363	-0.569118	7002.csv
# 3	15798.471167	6.160467e+06	6.150675e+06	6.205663e+06	6.150875e+06	2.618198e+09	1.380211	0.630973	7206.189166	5.202435e+06	...	51953	0.550111	-1.086101	-0.933335	0.421756	0.215704	-0.393008	-0.167108	0.256546	7003.csv
# 4	17275.368489	6.357542e+06	6.348864e+06	6.407177e+06	6.349065e+06	2.530302e+09	1.841131	1.885843	16357.722009	5.398000e+06	...	49288	0.476934	-1.050420	-0.949506	0.458661	0.239662	-0.436177	-0.265248	0.183004	7004.csv
# 5 rows × 39 columns

2.map()

根据相应的输入来映射 Series 的值。用于将一个 Series 中的每个值替换为另一个值，该值可能来自一个函数、也可能来自于一个 dict 或 Series.

Series.map(self, arg, na_action=None)[source]
Map values of Series according to input correspondence.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

Parameters
argfunction, collections.abc.Mapping subclass or Series
Mapping correspondence.

na_action{None, ‘ignore’}, default None
If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.

Returns
Series
Same index as caller.

>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0      cat
1      dog
2      NaN
3   rabbit
dtype: object
>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

>>> s.map('I am a {}'.format)
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object


>>> s.map('I am a {}'.format, na_action='ignore')
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object


>>> dframe = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['India', 'USA', 'China', 'Russia'])
>>> dframe

               b	   d	      e
India	-0.198547	1.253089	0.304832
USA  	-1.429178	0.972379	0.844679
China	1.231675	2.231760	-0.168356
Russia	-0.769768	-0.888504	0.310137

>>> changefn = lambda x: x / 10# Make changes element-wise
>>> dframe['d'].map(changefn)
>India     0.125309
USA       0.097238
China     0.223176
Russia   -0.088850
Name: d, dtype: float64

3.apply()

允许传递函数，并将其应用于 Pandas 序列中的每个值。

DataFrame.apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwds)[source]
Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.

Parameters
funcfunction
Function to apply to each column or row.

axis{0 or ‘index’, 1 or ‘columns’}, default 0
Axis along which the function is applied:

0 or ‘index’: apply function to each column.

1 or ‘columns’: apply function to each row.

rawbool, default False
Determines if row or column is passed as a Series or ndarray object:

False : passes each row or column as a Series to the function.

True : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.

result_type{‘expand’, ‘reduce’, ‘broadcast’, None}, default None
These only act when axis=1 (columns):

‘expand’ : list-like results will be turned into columns.

‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.

‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.

The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.

New in version 0.23.0.

argstuple
Positional arguments to pass to func in addition to the array/series.

**kwds
Additional keyword arguments to pass as keywords arguments to func.

Returns
Series or DataFrame
Result of applying func along the given axis of the DataFrame.

>>> dframe = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['India', 'USA', 'China', 'Russia'])
>>> dframe
              b        	d       	e
India	-0.904215	-0.764773	0.107618
USA	0.297659	-0.201756	0.194368
China	-0.436726	-0.811713	-1.461039
Russia	-1.045325	-1.388524	2.413348
>>> fn = lambda x: x.max() - x.min()
>>> dframe.apply(fn)
b    1.342984
d    1.186768
e    3.874387
dtype: float64

4.isin()

快速筛选符合要求的数值

Series.isin(self, values)[source]
Check whether values are contained in Series.

Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.

Parameters
valuesset or list-like
The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

Returns
Series
Series of booleans indicating if each element is in values.

Raises
TypeError
If values is a string

>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama',
...                'hippo'], name='animal')
>>> s
0      lama
1       cow
2      lama
3    beetle
4      lama
5     hippo
Name: animal, dtype: object
>>> s.isin(['cow', 'lama'])
0     True
1     True
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

>>> s.isin(['lama'])
0     True
1    False
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

5.copy()

用于复制 Pandas 对象。当一个数据帧分配给另一个数据帧时，如果对其中一个数据帧进行更改，另一个数据帧的值也将发生更改。为了防止这类问题，可以使用 copy () 函数。

DataFrame.copy(self: ~FrameOrSeries, deep: bool = True) → ~FrameOrSeries[source]
Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

Parameters
deepbool, default True
Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices nor the data are copied.

Returns
copySeries or DataFrame
Object type matches caller.

Notes

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

While Index objects are copied when deep=True, the underlying numpy array is not copied for performance reasons. Since Index is immutable, the underlying data can be safely shared and a copy is not needed.

>>> data = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])
0    India
1    Pakistan
2       China
3    Mongolia
dtype: object
>>> d[0] = "uu"
>>> d
0          uu
1    Pakistan
2       China
dtype: object
>>> data
0          uu
1    Pakistan
2       China
3    Mongolia
dtype: object
>>> new = data.copy()
>>> new[1] = "sdf"
>>> new
0          uu
1         sdf
2       China
3    Mongolia
dtype: object
>>> data
0          uu
1    Pakistan
2       China
3    Mongolia
dtype: object