使用pandas处理表格数据时,会经常的用到Series.map,Series.apply,Pandas.apply三个方法,熟悉这几个方法的使用,可以极大的提高工作效率。
一、常用的处理表格的函数
pd.read_csv('xx.csv',encoding='utf-8')
pd.read_excel('xxx.xls',encoding='gbk')
to_csv('xxx.csv',encoding='gbk')
to_excel('xxx.xls',encoding='gbk')
pd.concat([df1,df2],ignore_index=True,axis=1)
二、pandas的遍历
pandas提供了iter*系列函数,来遍历DataFrame
二、Series方法
在jupyter notebook中的在某个方法后输入?,然后Ctrl+Enter会显示出文档
Series.map
Series.map的文档
Signature: a.map(arg, na_action=None)
Docstring:
Map values of Series according to input correspondence.
Used for substituting each value in a Series with another value,
that may be derived from a function, a ``dict`` or
a :class:`Series`.
Parameters
----------
arg : function, dict, or Series
Mapping correspondence.
na_action : {None, 'ignore'}, default None
If 'ignore', propagate NaN values, without passing them to the
mapping correspondence.
Returns
-------
Series
Same index as caller.
See Also
--------
Series.apply : For applying more complex functions on a Series.
DataFrame.apply : Apply a function row-/column-wise.
DataFrame.applymap : Apply a function elementwise on a whole DataFrame.
Notes
-----
When ``arg`` is a dictionary, values in Series that are not in the
dictionary (as keys) are converted to ``NaN``. However, if the
dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
provides a method for default values), then this default is used
rather than ``NaN``.
Examples
--------
>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0 cat
1 dog
2 NaN
3 rabbit
dtype: object
``map`` accepts a ``dict`` or a ``Series``. Values that are not found
in the ``dict`` are converted to ``NaN``, unless the dict has a default
value (e.g. ``defaultdict``):
>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0 kitten
1 puppy
2 NaN
3 NaN
dtype: object
It also accepts a function:
>>> s.map('I am a {}'.format)
0 I am a cat
1 I am a dog
2 I am a nan
3 I am a rabbit
dtype: object
To avoid applying the function to missing values (and keep them as
``NaN``) ``na_action='ignore'`` can be used:
>>> s.map('I am a {}'.format, na_action='ignore')
0 I am a cat
1 I am a dog
2 NaN
3 I am a rabbit
dtype: object
File: c:\anaconda3\lib\site-packages\pandas\core\series.py
Type: method
示例
#Series.map的参数为一个字典
data['gender1'] = data['gender'].map({'男':1,'女':0})
#Series.map的参数为一个函数
data['gender2'] = data['gender'].map(lambda x: 1 if x == '男' else 0)
data.head()
结果
Series.apply
文档
Signature: a.apply(func, convert_dtype=True, args=(), **kwds)
Docstring:
Invoke function on values of Series.
Can be ufunc (a NumPy function that applies to the entire Series)
or a Python function that only works on single values.
Parameters
----------
func : function
Python function or NumPy ufunc to apply.
convert_dtype : bool, default True
Try to find better dtype for elementwise function results. If
False, leave as dtype=object.
args : tuple
Positional arguments passed to func after the series value.
**kwds
Additional keyword arguments passed to func.
Returns
-------
Series or DataFrame
If func returns a Series object the result will be a DataFrame.
See Also
--------
Series.map: For element-wise operations.
Series.agg: Only perform aggregating type operations.
Series.transform: Only perform transforming type operations.
Examples
--------
Create a series with typical summer temperatures for each city.
>>> s = pd.Series([20, 21, 12],
... index=['London', 'New York', 'Helsinki'])
>>> s
London 20
New York 21
Helsinki 12
dtype: int64
Square the values by defining a function and passing it as an
argument to ``apply()``.
>>> def square(x):
... return x ** 2
>>> s.apply(square)
London 400
New York 441
Helsinki 144
dtype: int64
Square the values by passing an anonymous function as an
argument to ``apply()``.
>>> s.apply(lambda x: x ** 2)
London 400
New York 441
Helsinki 144
dtype: int64
Define a custom function that needs additional positional
arguments and pass these additional arguments using the
``args`` keyword.
>>> def subtract_custom_value(x, custom_value):
... return x - custom_value
>>> s.apply(subtract_custom_value, args=(5,))
London 15
New York 16
Helsinki 7
dtype: int64
Define a custom function that takes keyword arguments
and pass these arguments to ``apply``.
>>> def add_custom_values(x, **kwargs):
... for month in kwargs:
... x += kwargs[month]
... return x
>>> s.apply(add_custom_values, june=30, july=20, august=25)
London 95
New York 96
Helsinki 87
dtype: int64
Use a function from the Numpy library.
>>> s.apply(np.log)
London 2.995732
New York 3.044522
Helsinki 2.484907
dtype: float64
File: c:\anaconda3\lib\site-packages\pandas\core\series.py
Type: method
Series.agg
Signature: a.agg(func, axis=0, *args, **kwargs)
Docstring:
Aggregate using one or more operations over the specified axis.
.. versionadded:: 0.20.0
Parameters
----------
func : function, str, list or dict
Function to use for aggregating the data. If a function, must either
work when passed a Series or when passed to Series.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g. ``[np.sum, 'mean']``
- dict of axis labels -> functions, function names or list of such.
axis : {0 or 'index'}
Parameter needed for compatibility with DataFrame.
*args
Positional arguments to pass to `func`.
**kwargs
Keyword arguments to pass to `func`.
Returns
-------
DataFrame, Series or scalar
if DataFrame.agg is called with a single function, returns a Series
if DataFrame.agg is called with several functions, returns a DataFrame
if Series.agg is called with single function, returns a scalar
if Series.agg is called with several functions, returns a Series
See Also
--------
Series.apply : Invoke function on a Series.
Series.transform : Transform function producing a Series with like indexes.
Notes
-----
`agg` is an alias for `aggregate`. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
--------
>>> s = pd.Series([1, 2, 3, 4])
>>> s
0 1
1 2
2 3
3 4
dtype: int64
>>> s.agg('min')
1
>>> s.agg(['min', 'max'])
min 1
max 4
dtype: int64
File: c:\anaconda3\lib\site-packages\pandas\core\series.py
Type: method
四、pandas方法
pandas.apply
Signature:
data.apply(
func,
axis=0,
broadcast=None,
raw=False,
reduce=None,
result_type=None,
args=(),
**kwds,
)
Docstring:
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is
either the DataFrame's index (``axis=0``) or the DataFrame's columns
(``axis=1``). By default (``result_type=None``), the final return type
is inferred from the return type of the applied function. Otherwise,
it depends on the `result_type` argument.
Parameters
----------
func : function
Function to apply to each column or row.
axis : {0 or 'index', 1 or 'columns'}, default 0
Axis along which the function is applied:
* 0 or 'index': apply function to each column.
* 1 or 'columns': apply function to each row.
broadcast : bool, optional
Only relevant for aggregation functions:
* ``False`` or ``None`` : returns a Series whose length is the
length of the index or the number of columns (based on the
`axis` parameter)
* ``True`` : results will be broadcast to the original shape
of the frame, the original index and columns will be retained.
.. deprecated:: 0.23.0
This argument will be removed in a future version, replaced
by result_type='broadcast'.
raw : bool, default False
* ``False`` : passes each row or column as a Series to the
function.
* ``True`` : the passed function will receive ndarray objects
instead.
If you are just applying a NumPy reduction function this will
achieve much better performance.
reduce : bool or None, default None
Try to apply reduction procedures. If the DataFrame is empty,
`apply` will use `reduce` to determine whether the result
should be a Series or a DataFrame. If ``reduce=None`` (the
default), `apply`'s return value will be guessed by calling
`func` on an empty Series
(note: while guessing, exceptions raised by `func` will be
ignored).
If ``reduce=True`` a Series will always be returned, and if
``reduce=False`` a DataFrame will always be returned.
.. deprecated:: 0.23.0
This argument will be removed in a future version, replaced
by ``result_type='reduce'``.
result_type : {'expand', 'reduce', 'broadcast', None}, default None
These only act when ``axis=1`` (columns):
* 'expand' : list-like results will be turned into columns.
* 'reduce' : returns a Series if possible rather than expanding
list-like results. This is the opposite of 'expand'.
* 'broadcast' : results will be broadcast to the original shape
of the DataFrame, the original index and columns will be
retained.
The default behaviour (None) depends on the return value of the
applied function: list-like results will be returned as a Series
of those. However if the apply function returns a Series these
are expanded to columns.
.. versionadded:: 0.23.0
args : tuple
Positional arguments to pass to `func` in addition to the
array/series.
**kwds
Additional keyword arguments to pass as keywords arguments to
`func`.
Returns
-------
applied : Series or DataFrame
See Also
--------
DataFrame.applymap: For elementwise operations.
DataFrame.aggregate: Only perform aggregating type operations.
DataFrame.transform: Only perform transforming type operations.
Notes
-----
In the current implementation apply calls `func` twice on the
first column/row to decide whether it can take a fast or slow
code path. This can lead to unexpected behavior if `func` has
side-effects, as they will take effect twice for the first
column/row.
Examples
--------
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
Using a numpy universal function (in this case the same as
``np.sqrt(df)``):
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Using a reducing function on either axis
>>> df.apply(np.sum, axis=0)
A 12
B 27
dtype: int64
>>> df.apply(np.sum, axis=1)
0 13
1 13
2 13
dtype: int64
Retuning a list-like will result in a Series
>>> df.apply(lambda x: [1, 2], axis=1)
0 [1, 2]
1 [1, 2]
2 [1, 2]
dtype: object
Passing result_type='expand' will expand list-like results
to columns of a Dataframe
>>> df.apply(lambda x: [1, 2], axis=1, result_type='expand')
0 1
0 1 2
1 1 2
2 1 2
Returning a Series inside the function is similar to passing
``result_type='expand'``. The resulting column names
will be the Series index.
>>> df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
Passing ``result_type='broadcast'`` will ensure the same shape
result, whether list-like or scalar is returned by the function,
and broadcast it along the axis. The resulting column names will
be the originals.
>>> df.apply(lambda x: [1, 2], axis=1, result_type='broadcast')
A B
0 1 2
1 1 2
2 1 2
File: c:\anaconda3\lib\site-packages\pandas\core\frame.py
Type: method
pandas.applymap
Signature: data.applymap(func)
Docstring:
Apply a function to a Dataframe elementwise.
This method applies a function that accepts and returns a scalar
to every element of a DataFrame.
Parameters
----------
func : callable
Python function, returns a single value from a single value.
Returns
-------
DataFrame
Transformed DataFrame.
See Also
--------
DataFrame.apply : Apply a function along input axis of DataFrame.
Notes
-----
In the current implementation applymap calls `func` twice on the
first column/row to decide whether it can take a fast or slow
code path. This can lead to unexpected behavior if `func` has
side-effects, as they will take effect twice for the first
column/row.
Examples
--------
>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])
>>> df
0 1
0 1.000 2.120
1 3.356 4.567
>>> df.applymap(lambda x: len(str(x)))
0 1
0 3 4
1 5 5
Note that a vectorized version of `func` often exists, which will
be much faster. You could square each number elementwise.
>>> df.applymap(lambda x: x**2)
0 1
0 1.000000 4.494400
1 11.262736 20.857489
But it's better to avoid applymap in that case.
>>> df ** 2
0 1
0 1.000000 4.494400
1 11.262736 20.857489
File: c:\anaconda3\lib\site-packages\pandas\core\frame.py
Type: method
pandas.agg
Signature: data.agg(func, axis=0, *args, **kwargs)
Docstring:
Aggregate using one or more operations over the specified axis.
.. versionadded:: 0.20.0
Parameters
----------
func : function, str, list or dict
Function to use for aggregating the data. If a function, must either
work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
- function
- string function name
- list of functions and/or function names, e.g. ``[np.sum, 'mean']``
- dict of axis labels -> functions, function names or list of such.
axis : {0 or 'index', 1 or 'columns'}, default 0
If 0 or 'index': apply function to each column.
If 1 or 'columns': apply function to each row.
*args
Positional arguments to pass to `func`.
**kwargs
Keyword arguments to pass to `func`.
Returns
-------
DataFrame, Series or scalar
if DataFrame.agg is called with a single function, returns a Series
if DataFrame.agg is called with several functions, returns a DataFrame
if Series.agg is called with single function, returns a scalar
if Series.agg is called with several functions, returns a Series
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
`numpy` aggregation functions (`mean`, `median`, `prod`, `sum`, `std`,
`var`), where the default is to compute the aggregation of the flattened
array, e.g., ``numpy.mean(arr_2d)`` as opposed to ``numpy.mean(arr_2d,
axis=0)``.
`agg` is an alias for `aggregate`. Use the alias.
See Also
--------
DataFrame.apply : Perform any type of operations.
DataFrame.transform : Perform transformation type operations.
pandas.core.groupby.GroupBy : Perform operations over groups.
pandas.core.resample.Resampler : Perform operations over resampled bins.
pandas.core.window.Rolling : Perform operations over rolling window.
pandas.core.window.Expanding : Perform operations over expanding window.
pandas.core.window.EWM : Perform operation over exponential weighted
window.
Notes
-----
`agg` is an alias for `aggregate`. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
--------
>>> df = pd.DataFrame([[1, 2, 3],
... [4, 5, 6],
... [7, 8, 9],
... [np.nan, np.nan, np.nan]],
... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>> df.agg(['sum', 'min'])
A B C
sum 12.0 15.0 18.0
min 1.0 2.0 3.0
Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
A B
max NaN 8.0
min 1.0 2.0
sum 12.0 NaN
Aggregate over the columns.
>>> df.agg("mean", axis="columns")
0 2.0
1 5.0
2 8.0
3 NaN
dtype: float64
File: c:\anaconda3\lib\site-packages\pandas\core\frame.py
Type: method