萌新向Python数据分析及数据挖掘 第二章 pandas 第五节 Getting Started with pandas

Getting Started with pandas

In [1]:
 
 
 
 
 
import pandas as pd
 
 
In [2]:
 
 
 
 
 
from pandas import Series, DataFrame
 
 
In [3]:
 
 
 
 
 
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)
 
 
 

Introduction to pandas Data Structures

 

Series

In [4]:
 
 
 
 
 
obj = pd.Series([4, 7, -5, 3])
obj
 
 
Out[4]:
0    4
1    7
2   -5
3    3
dtype: int64
In [5]:
 
 
 
 
 
obj.values
obj.index  # like range(4)
 
 
Out[5]:
RangeIndex(start=0, stop=4, step=1)
In [6]:
 
 
 
 
 
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])#设置索引
obj2
obj2.index
 
 
Out[6]:
Index(['d', 'b', 'a', 'c'], dtype='object')
 
 
 
 
 
 
Init signature: pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
Docstring:    
One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).
Operations between Series (+, -, /, *, **) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.
Parameters
----------
data : array-like, dict, or scalar value
    Contains data stored in Series
index : array-like or Index (1d)
    Values must be hashable and have the same length as `data`.
    Non-unique index values are allowed. Will default to
    RangeIndex(len(data)) if not provided. If both a dict and index
    sequence are used, the index will override the keys found in the
    dict.
dtype : numpy.dtype or None
    If None, dtype will be inferred
copy : boolean, default False
    Copy input data
 
In [7]:
 
 
 
 
 
obj2['a']
obj2['d'] = 6
obj2[['c', 'a', 'd']]
 
 
Out[7]:
c    3
a   -5
d    6
dtype: int64
In [10]:
 
 
 
 
 
obj2[obj2 > 0]
 
 
Out[10]:
d    6
b    7
c    3
dtype: int64
In [11]:
 
 
 
 
 
obj2 * 2
np.exp(obj2)
 
 
Out[11]:
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64
In [ ]:
 
 
 
 
 
 
 
In [12]:
 
 
 
 
 
'b' in obj2
'e' in obj2
 
 
Out[12]:
False
In [13]:
 
 
 
 
 
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3
 
 
Out[13]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
In [14]:
 
 
 
 
 
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
 
 
Out[14]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
In [17]:
 
 
 
 
 
pd.isnull(obj4)
 
 
Out[17]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
 
 
 
 
 
 
Signature: pd.isnull(obj)
Docstring:
Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
Parameters
----------
arr : ndarray or object value
    Object to check for null-ness
Returns
-------
isna : array-like of bool or bool
    Array or bool indicating whether an object is null or if an array is
    given which of the element is null.
See also
--------
pandas.notna: boolean inverse of pandas.isna
pandas.isnull: alias of isna
 
In [18]:
 
 
 
 
 
pd.notnull(obj4)
 
 
Out[18]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
 
 
 
 
 
 
Signature: pd.notnull(obj)
Docstring:
Replacement for numpy.isfinite / -numpy.isnan which is suitable for use
on object arrays.
Parameters
----------
arr : ndarray or object value
    Object to check for *not*-null-ness
Returns
-------
notisna : array-like of bool or bool
    Array or bool indicating whether an object is *not* null or if an array
    is given which of the element is *not* null.
See also
--------
pandas.isna : boolean inverse of pandas.notna
pandas.notnull : alias of notna
 
In [ ]:
 
 
 
 
 
obj4.isnull()
 
 
In [19]:
 
 
 
 
 
obj3
 
 
Out[19]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
In [20]:
 
 
 
 
 
obj4
 
 
Out[20]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
In [21]:
 
 
 
 
 
obj3 + obj4 #同索引数值相加
 
 
Out[21]:
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64
In [23]:
 
 
 
 
 
obj4.name = 'population'
obj4.index.name = 'state'#设置index名称
obj4
 
 
Out[23]:
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64
In [24]:
 
 
 
 
 
obj
 
 
Out[24]:
0    4
1    7
2   -5
3    3
dtype: int64
In [25]:
 
 
 
 
 
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']#设置index
obj
 
 
Out[25]:
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64
 

DataFrame

In [26]:
 
 
 
 
 
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
 
 
In [27]:
 
 
 
 
 
frame
 
 
Out[27]:
 popstateyear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2002
53.2Nevada2003
In [29]:
 
 
 
 
 
frame.head()
 
 
Out[29]:
 popstateyear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2002
 
 
 
 
 
 
Signature: frame.head(n=5)
Docstring:
Return the first n rows.
Parameters
----------
n : int, default 5
    Number of rows to select.
Returns
-------
obj_head : type of caller
    The first n rows of the caller object.
 
In [30]:
 
 
 
 
 
pd.DataFrame(data, columns=['year', 'state', 'pop'])#设置列名
 
 
Out[30]:
 yearstatepop
02000Ohio1.5
12001Ohio1.7
22002Ohio3.6
32001Nevada2.4
42002Nevada2.9
52003Nevada3.2
 
 
 
 
 
 
Init signature: pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Docstring:    
Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects. The primary pandas data structure
Parameters
----------
data : numpy ndarray (structured or homogeneous), dict, or DataFrame
    Dict can contain Series, arrays, constants, or list-like objects
index : Index or array-like
    Index to use for resulting frame. Will default to np.arange(n) if
    no indexing information part of input data and no index provided
columns : Index or array-like
    Column labels to use for resulting frame. Will default to
    np.arange(n) if no column labels are provided
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer
copy : boolean, default False
    Copy data from inputs. Only affects DataFrame / 2d ndarray input
Examples
--------
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4
Notice that the inferred dtype is int64.
>>> df.dtypes
col1    int64
col2    int64
dtype: object
To enforce a single dtype:
>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object
Constructing DataFrame from numpy ndarray:
>>> df2 = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
...                    columns=['a', 'b', 'c', 'd', 'e'])
>>> df2
    a   b   c   d   e
0   2   8   8   3   4
1   4   2   9   0   9
2   1   0   7   8   0
3   5   1   7   1   3
4   6   0   2   4   2
 
In [31]:
 
 
 
 
 
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],#空数据会显示NaN
                      index=['one', 'two', 'three', 'four',
                             'five', 'six'])
frame2
 
 
Out[31]:
 yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7NaN
three2002Ohio3.6NaN
four2001Nevada2.4NaN
five2002Nevada2.9NaN
six2003Nevada3.2NaN
In [32]:
 
 
 
 
 
frame2.columns
 
 
Out[32]:
Index(['year', 'state', 'pop', 'debt'], dtype='object')
In [34]:
 
 
 
 
 
frame2['state']
 
 
Out[34]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
In [35]:
 
 
 
 
 
frame2.year#列名有空格不可用
 
 
Out[35]:
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64
In [36]:
 
 
 
 
 
frame2.loc['three']
 
 
Out[36]:
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
In [37]:
 
 
 
 
 
frame2['debt'] = 16.5
frame2
 
 
Out[37]:
 yearstatepopdebt
one2000Ohio1.516.5
two2001Ohio1.716.5
three2002Ohio3.616.5
four2001Nevada2.416.5
five2002Nevada2.916.5
six2003Nevada3.216.5
In [39]:
 
 
 
 
 
frame2['debt'] = np.arange(6.)
frame2
 
 
Out[39]:
 yearstatepopdebt
one2000Ohio1.50.0
two2001Ohio1.71.0
three2002Ohio3.62.0
four2001Nevada2.43.0
five2002Nevada2.94.0
six2003Nevada3.25.0
 
 
 
 
 
 
Docstring:
arange([start,] stop[, step,], dtype=None)
Return evenly spaced values within a given interval.
Values are generated within the half-open interval ``[start, stop)``
(in other words, the interval including `start` but excluding `stop`).
For integer arguments the function is equivalent to the Python built-in
`range <http://docs.python.org/lib/built-in-funcs.html>`_ function,
but returns an ndarray rather than a list.
When using a non-integer step, such as 0.1, the results will often not
be consistent.  It is better to use ``linspace`` for these cases.
Parameters
----------
start : number, optional
    Start of interval.  The interval includes this value.  The default
    start value is 0.
stop : number
    End of interval.  The interval does not include this value, except
    in some cases where `step` is not an integer and floating point
    round-off affects the length of `out`.
step : number, optional
    Spacing between values.  For any output `out`, this is the distance
    between two adjacent values, ``out[i+1] - out[i]``.  The default
    step size is 1.  If `step` is specified as a position argument,
    `start` must also be given.
dtype : dtype
    The type of the output array.  If `dtype` is not given, infer the data
    type from the other input arguments.
Returns
-------
arange : ndarray
    Array of evenly spaced values.
    For floating point arguments, the length of the result is
    ``ceil((stop - start)/step)``.  Because of floating point overflow,
    this rule may result in the last element of `out` being greater
    than `stop`.
See Also
--------
linspace : Evenly spaced numbers with careful handling of endpoints.
ogrid: Arrays of evenly spaced numbers in N-dimensions.
mgrid: Grid-shaped arrays of evenly spaced numbers in N-dimensions.
Examples
--------
>>> np.arange(3)
array([0, 1, 2])
>>> np.arange(3.0)
array([ 0.,  1.,  2.])
>>> np.arange(3,7)
array([3, 4, 5, 6])
>>> np.arange(3,7,2)
array([3, 5])
 
In [41]:
 
 
 
 
 
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2#匹配index填充数据
 
 
Out[41]:
 yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7-1.2
three2002Ohio3.6NaN
four2001Nevada2.4-1.5
five2002Nevada2.9-1.7
six2003Nevada3.2NaN
In [ ]:
 
 
 
 
 
 
 
In [42]:
 
 
 
 
 
frame2['eastern'] = frame2.state == 'Ohio'
frame2
 
 
Out[42]:
 yearstatepopdebteastern
one2000Ohio1.5NaNTrue
two2001Ohio1.7-1.2True
three2002Ohio3.6NaNTrue
four2001Nevada2.4-1.5False
five2002Nevada2.9-1.7False
six2003Nevada3.2NaNFalse
In [43]:
 
 
 
 
 
del frame2['eastern']#删除列
frame2.columns
 
 
Out[43]:
Index(['year', 'state', 'pop', 'debt'], dtype='object')
In [45]:
 
 
 
 
 
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
 
 
In [46]:
 
 
 
 
 
frame3 = pd.DataFrame(pop)
frame3
 
 
Out[46]:
 NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6
In [47]:
 
 
 
 
 
frame3.T#转置
 
 
Out[47]:
 200020012002
NevadaNaN2.42.9
Ohio1.51.73.6
In [48]:
 
 
 
 
 
pd.DataFrame(pop, index=[2001, 2002, 2003])
 
 
Out[48]:
 NevadaOhio
20012.41.7
20022.93.6
2003NaNNaN
In [49]:
 
 
 
 
 
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)
 
 
Out[49]:
 NevadaOhio
2000NaN1.5
20012.41.7
In [50]:
 
 
 
 
 
frame3.index.name = 'year'; frame3.columns.name = 'state'#设置列统称
frame3
 
 
Out[50]:
stateNevadaOhio
year  
2000NaN1.5
20012.41.7
20022.93.6
In [51]:
 
 
 
 
 
frame3.values
 
 
Out[51]:
array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])
In [52]:
 
 
 
 
 
frame2.values
 
 
Out[52]:
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)
 

Index Objects

In [53]:
 
 
 
 
 
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
index[1:]
 
 
Out[53]:
Index(['b', 'c'], dtype='object')
 

index[1] = 'd' # TypeError

In [54]:
 
 
 
 
 
labels = pd.Index(np.arange(3))
labels
 
 
Out[54]:
Int64Index([0, 1, 2], dtype='int64')
In [55]:
 
 
 
 
 
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
obj2.index is labels
 
 
Out[55]:
True
In [56]:
 
 
 
 
 
frame3
frame3.columns
 
 
Out[56]:
Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [57]:
 
 
 
 
 
'Ohio' in frame3.columns
 
 
Out[57]:
True
In [58]:
 
 
 
 
 
2003 in frame3.index
 
 
Out[58]:
False
In [59]:
 
 
 
 
 
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels
 
 
Out[59]:
Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
 

Essential Functionality

 

Reindexing

In [60]:
 
 
 
 
 
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
 
 
Out[60]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
In [62]:
 
 
 
 
 
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
 
 
Out[62]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
 
 
 
 
 
 
ignature: obj.reindex(index=None, **kwargs)
Docstring:
Conform Series to new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index. A new object
is produced unless the new index is equivalent to the current one and
copy=False
Parameters
----------
index : array-like, optional (should be specified using keywords)
    New labels / index to conform to. Preferably an Index object to
    avoid duplicating data
method : {None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}, optional
    method to use for filling holes in reindexed DataFrame.
    Please note: this is only  applicable to DataFrames/Series with a
    monotonically increasing/decreasing index.
    * default: don't fill gaps
    * pad / ffill: propagate last valid observation forward to next
      valid
    * backfill / bfill: use next valid observation to fill gap
    * nearest: use nearest valid observations to fill gap
copy : boolean, default True
    Return a new object, even if the passed indexes are the same
level : int or name
    Broadcast across a level, matching Index values on the
    passed MultiIndex level
fill_value : scalar, default np.NaN
    Value to use for missing values. Defaults to NaN, but can be any
    "compatible" value
limit : int, default None
    Maximum number of consecutive elements to forward or backward fill
tolerance : optional
    Maximum distance between original and new labels for inexact
    matches. The values of the index at the matching locations most
    satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
    Tolerance may be a scalar value, which applies the same tolerance
    to all values, or list-like, which applies variable tolerance per
    element. List-like includes list, tuple, array, Series, and must be
    the same size as the index and its dtype must exactly match the
    index's type.
    .. versionadded:: 0.17.0
    .. versionadded:: 0.21.0 (list-like tolerance)
Examples
--------
``DataFrame.reindex`` supports two calling conventions
* ``(index=index_labels, columns=column_labels, ...)``
* ``(labels, axis={'index', 'columns'}, ...)``
We *highly* recommend using keyword arguments to clarify your
intent.
Create a dataframe with some fictional data.
>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
>>> df = pd.DataFrame({
...      'http_status': [200,200,404,404,301],
...      'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
...       index=index)
>>> df
           http_status  response_time
Firefox            200           0.04
Chrome             200           0.02
Safari             404           0.07
IE10               404           0.08
Konqueror          301           1.00
Create a new index and reindex the dataframe. By default
values in the new index that do not have corresponding
records in the dataframe are assigned ``NaN``.
>>> new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10',
...             'Chrome']
>>> df.reindex(new_index)
               http_status  response_time
Safari               404.0           0.07
Iceweasel              NaN            NaN
Comodo Dragon          NaN            NaN
IE10                 404.0           0.08
Chrome               200.0           0.02
We can fill in the missing values by passing a value to
the keyword ``fill_value``. Because the index is not monotonically
increasing or decreasing, we cannot use arguments to the keyword
``method`` to fill the ``NaN`` values.
>>> df.reindex(new_index, fill_value=0)
               http_status  response_time
Safari                 404           0.07
Iceweasel                0           0.00
Comodo Dragon            0           0.00
IE10                   404           0.08
Chrome                 200           0.02
>>> df.reindex(new_index, fill_value='missing')
              http_status response_time
Safari                404          0.07
Iceweasel         missing       missing
Comodo Dragon     missing       missing
IE10                  404          0.08
Chrome                200          0.02
We can also reindex the columns.
>>> df.reindex(columns=['http_status', 'user_agent'])
           http_status  user_agent
Firefox            200         NaN
Chrome             200         NaN
Safari             404         NaN
IE10               404         NaN
Konqueror          301         NaN
Or we can use "axis-style" keyword arguments
>>> df.reindex(['http_status', 'user_agent'], axis="columns")
           http_status  user_agent
Firefox            200         NaN
Chrome             200         NaN
Safari             404         NaN
IE10               404         NaN
Konqueror          301         NaN
To further illustrate the filling functionality in
``reindex``, we will create a dataframe with a
monotonically increasing index (for example, a sequence
of dates).
>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D')
>>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]},
...                    index=date_index)
>>> df2
            prices
2010-01-01     100
2010-01-02     101
2010-01-03     NaN
2010-01-04     100
2010-01-05      89
2010-01-06      88
Suppose we decide to expand the dataframe to cover a wider
date range.
>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
>>> df2.reindex(date_index2)
            prices
2009-12-29     NaN
2009-12-30     NaN
2009-12-31     NaN
2010-01-01     100
2010-01-02     101
2010-01-03     NaN
2010-01-04     100
2010-01-05      89
2010-01-06      88
2010-01-07     NaN
The index entries that did not have a value in the original data frame
(for example, '2009-12-29') are by default filled with ``NaN``.
If desired, we can fill in the missing values using one of several
options.
For example, to backpropagate the last valid value to fill the ``NaN``
values, pass ``bfill`` as an argument to the ``method`` keyword.
>>> df2.reindex(date_index2, method='bfill')
            prices
2009-12-29     100
2009-12-30     100
2009-12-31     100
2010-01-01     100
2010-01-02     101
2010-01-03     NaN
2010-01-04     100
2010-01-05      89
2010-01-06      88
2010-01-07     NaN
Please note that the ``NaN`` value present in the original dataframe
(at index value 2010-01-03) will not be filled by any of the
value propagation schemes. This is because filling while reindexing
does not look at dataframe values, but only compares the original and
desired indexes. If you do want to fill in the ``NaN`` values present
in the original dataframe, use the ``fillna()`` method.
See the :ref:`user guide <basics.reindexing>` for more.
Returns
-------
 
In [63]:
 
 
 
 
 
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
 
 
Out[63]:
0      blue
2    purple
4    yellow
dtype: object
In [64]:
 
 
 
 
 
obj3.reindex(range(6), method='ffill')#按index空缺填充
 
 
Out[64]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
In [72]:
 
 
 
 
 
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame
 
 
Out[72]:
 OhioTexasCalifornia
a012
c345
d678
In [66]:
 
 
 
 
 
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
 
 
Out[66]:
 OhioTexasCalifornia
a0.01.02.0
bNaNNaNNaN
c3.04.05.0
d6.07.08.0
In [73]:
 
 
 
 
 
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
 
 
Out[73]:
 TexasUtahCalifornia
a1NaN2
c4NaN5
d7NaN8
In [74]:
 
 
 
 
 
frame.loc[['a', 'b', 'c', 'd'], states]#这里提醒,传入列表或有找不到的标签的,以后会报错,用.reindex代替
 
 
 
c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.
Out[74]:
 TexasUtahCalifornia
a1.0NaN2.0
bNaNNaNNaN
c4.0NaN5.0
d7.0NaN8.0
 

Dropping Entries from an Axis

In [91]:
 
 
 
 
 
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
 
 
Out[91]:
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
In [80]:
 
 
 
 
 
new_obj = obj.drop('c')
new_obj
 
 
Out[80]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
 
 
 
 
 
 
Signature: obj.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
Docstring:
Return new object with labels in requested axis removed.
Parameters
----------
labels : single label or list-like
    Index or column labels to drop.
axis : int or axis name
    Whether to drop labels from the index (0 / 'index') or
    columns (1 / 'columns').
index, columns : single label or list-like
    Alternative to specifying `axis` (``labels, axis=1`` is
    equivalent to ``columns=labels``).
    .. versionadded:: 0.21.0
level : int or level name, default None
    For MultiIndex
inplace : bool, default False
    If True, do operation inplace and return None.
errors : {'ignore', 'raise'}, default 'raise'
    If 'ignore', suppress error and existing labels are dropped.
Returns
-------
dropped : type of caller
Examples
--------
>>> df = pd.DataFrame(np.arange(12).reshape(3,4),
                      columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
Drop columns
>>> df.drop(['B', 'C'], axis=1)
   A   D
0  0   3
1  4   7
2  8  11
>>> df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11
Drop a row by index
>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11
Notes
 
In [79]:
 
 
 
 
 
obj.drop(['d', 'c'])#
 
 
Out[79]:
a    0.0
b    1.0
e    4.0
dtype: float64
In [87]:
 
 
 
 
 
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
 
 
Out[87]:
 onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
In [88]:
 
 
 
 
 
data.drop(['Colorado', 'Ohio'])
 
 
Out[88]:
 onetwothreefour
Utah891011
New York12131415
In [89]:
 
 
 
 
 
data.drop('two', axis=1)
data.drop(['two', 'four'], axis='columns')
 
 
Out[89]:
 onethree
Ohio02
Colorado46
Utah810
New York1214
In [92]:
 
 
 
 
 
obj.drop('c', inplace=True)
 
 
In [93]:
 
 
 
 
 
obj
 
 
Out[93]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
 

Indexing, Selection, and Filtering

In [94]:
 
 
 
 
 
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
 
 
Out[94]:
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
In [95]:
 
 
 
 
 
obj['b']
 
 
Out[95]:
1.0
In [96]:
 
 
 
 
 
obj[1]
 
 
Out[96]:
1.0
In [97]:
 
 
 
 
 
obj[2:4]
 
 
Out[97]:
c    2.0
d    3.0
dtype: float64
In [98]:
 
 
 
 
 
obj[['b', 'a', 'd']]
 
 
Out[98]:
b    1.0
a    0.0
d    3.0
dtype: float64
In [99]:
 
 
 
 
 
obj[[1, 3]]
 
 
Out[99]:
b    1.0
d    3.0
dtype: float64
In [100]:
 
 
 
 
 
obj[obj < 2]
 
 
Out[100]:
a    0.0
b    1.0
dtype: float64
In [ ]:
 
 
 
 
 
 
 
In [ ]:
 
 
 
 
 
 
 
In [ ]:
 
 
 
 
 
 
 
In [ ]:
 
 
 
 
 
 
 
In [101]:
 
 
 
 
 
obj['b':'c']
 
 
Out[101]:
b    1.0
c    2.0
dtype: float64
In [105]:
 
 
 
 
 
obj['b':'c'] = 5
 
 
In [104]:
 
 
 
 
 
obj
 
 
Out[104]:
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64
In [106]:
 
 
 
 
 
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
 
 
Out[106]:
 onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
In [107]:
 
 
 
 
 
data['two']
 
 
Out[107]:
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
In [108]:
 
 
 
 
 
data[['three', 'one']]
 
 
Out[108]:
 threeone
Ohio20
Colorado64
Utah108
New York1412
In [109]:
 
 
 
 
 
data[:2]
 
 
Out[109]:
 onetwothreefour
Ohio0123
Colorado4567
In [110]:
 
 
 
 
 
data[data['three'] > 5]
 
 
Out[110]:
 onetwothreefour
Colorado4567
Utah891011
New York12131415
In [111]:
 
 
 
 
 
data < 5
data[data < 5] = 0
data
 
 
Out[111]:
 onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415
 
Selection with loc and iloc
In [112]:
 
 
 
 
 
data.loc['Colorado', ['two', 'three']]
 
 
Out[112]:
two      5
three    6
Name: Colorado, dtype: int32
In [113]:
 
 
 
 
 
data.iloc[2, [3, 0, 1]]
data.iloc[2]
data.iloc[[1, 2], [3, 0, 1]]
 
 
Out[113]:
 fouronetwo
Colorado705
Utah1189
In [114]:
 
 
 
 
 
data.loc[:'Utah', 'two']#标签多选
data.iloc[:, :3][data.three > 5]#位置多选
 
 
Out[114]:
 onetwothree
Colorado056
Utah8910
New York121314
 

Integer Indexes

 

ser = pd.Series(np.arange(3.)) ser ser[-1]

In [115]:
 
 
 
 
 
ser = pd.Series(np.arange(3.))
 
 
In [116]:
 
 
 
 
 
ser
 
 
Out[116]:
0    0.0
1    1.0
2    2.0
dtype: float64
In [117]:
 
 
 
 
 
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]
 
 
Out[117]:
2.0
In [118]:
 
 
 
 
 
ser[:1]
 
 
Out[118]:
0    0.0
dtype: float64
In [119]:
 
 
 
 
 
ser.loc[:1]
 
 
Out[119]:
0    0.0
1    1.0
dtype: float64
In [120]:
 
 
 
 
 
ser.iloc[:1]
 
 
Out[120]:
0    0.0
dtype: float64
 

Arithmetic and Data Alignment

In [121]:
 
 
 
 
 
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=['a', 'c', 'e', 'f', 'g'])
s1
 
 
Out[121]:
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
In [122]:
 
 
 
 
 
s2
 
 
Out[122]:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
In [123]:
 
 
 
 
 
s1 + s2# 不匹配的不算单个,直接NaN
 
 
Out[123]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
In [124]:
 
 
 
 
 
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
 
 
Out[124]:
 bcd
Ohio0.01.02.0
Texas3.04.05.0
Colorado6.07.08.0
In [125]:
 
 
 
 
 
df2
 
 
Out[125]:
 bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
In [126]:
 
 
 
 
 
df1 + df2
 
 
Out[126]:
 bcde
ColoradoNaNNaNNaNNaN
Ohio3.0NaN6.0NaN
OregonNaNNaNNaNNaN
Texas9.0NaN12.0NaN
UtahNaNNaNNaNNaN
In [127]:
 
 
 
 
 
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df1
 
 
Out[127]:
 A
01
12
In [128]:
 
 
 
 
 
df2
 
 
Out[128]:
 B
03
14
In [129]:
 
 
 
 
 
df1 - df2 #需要 行标签 列表去都对上
 
 
Out[129]:
 AB
0NaNNaN
1NaNNaN
 
Arithmetic methods with fill values
In [130]:
 
 
 
 
 
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))
 
 
In [131]:
 
 
 
 
 
df2
 
 
Out[131]:
 abcde
00.01.02.03.04.0
15.06.07.08.09.0
210.011.012.013.014.0
315.016.017.018.019.0
In [132]:
 
 
 
 
 
df2.loc[1, 'b'] = np.nan
 
 
In [133]:
 
 
 
 
 
df1
 
 
Out[133]:
 abcd
00.01.02.03.0
14.05.06.07.0
28.09.010.011.0
In [134]:
 
 
 
 
 
df1 + df2
 
 
Out[134]:
 abcde
00.02.04.06.0NaN
19.0NaN13.015.0NaN
218.020.022.024.0NaN
3NaNNaNNaNNaNNaN
In [135]:
 
 
 
 
 
df1.add(df2, fill_value=0)#当找不到标签时等于0
 
 
Out[135]:
 abcde
00.02.04.06.04.0
19.05.013.015.09.0
218.020.022.024.014.0
315.016.017.018.019.0
 
 
 
 
 
 
Signature: df1.add(other, axis='columns', level=None, fill_value=None)
Docstring:
Addition of dataframe and other, element-wise (binary operator `add`).
Equivalent to ``dataframe + other``, but with support to substitute a fill_value for
missing data in one of the inputs.
Parameters
----------
other : Series, DataFrame, or constant
axis : {0, 1, 'index', 'columns'}
    For Series input, axis to match Series index on
fill_value : None or float value, default None
    Fill missing (NaN) values with this value. If both DataFrame
    locations are missing, the result will be missing
level : int or name
    Broadcast across a level, matching Index values on the
    passed MultiIndex level
Notes
-----
Mismatched indices will be unioned together
Returns
-------
result : DataFrame
See also
--------
 
In [136]:
 
 
 
 
 
1 / df1
 
 
Out[136]:
 abcd
0inf1.0000000.5000000.333333
10.2500000.2000000.1666670.142857
20.1250000.1111110.1000000.090909
In [138]:
 
 
 
 
 
df1.rdiv(1)
 
 
Out[138]:
 abcd
0inf1.0000000.5000000.333333
10.2500000.2000000.1666670.142857
20.1250000.1111110.1000000.090909
 
 
 
 
 
 
Signature: df1.rdiv(other, axis='columns', level=None, fill_value=None)
Docstring:
Floating division of dataframe and other, element-wise (binary operator `rtruediv`).
Equivalent to ``other / dataframe``, but with support to substitute a fill_value for
missing data in one of the inputs.
Parameters
----------
other : Series, DataFrame, or constant
axis : {0, 1, 'index', 'columns'}
    For Series input, axis to match Series index on
fill_value : None or float value, default None
    Fill missing (NaN) values with this value. If both DataFrame
    locations are missing, the result will be missing
level : int or name
    Broadcast across a level, matching Index values on the
    passed MultiIndex level
Notes
-----
Mismatched indices will be unioned together
Returns
-------
result : DataFrame
See also
--------
 
In [140]:
 
 
 
 
 
df1.reindex(columns=df2.columns, fill_value=0)
 
 
Out[140]:
 abcde
00.01.02.03.00
14.05.06.07.00
28.09.010.011.00
In [143]:
 
 
 
 
 
df1.reindex(index=df2.index,columns=df2.columns, fill_value=np.pi)
 
 
Out[143]:
 abcde
00.0000001.0000002.0000003.0000003.141593
14.0000005.0000006.0000007.0000003.141593
28.0000009.00000010.00000011.0000003.141593
33.1415933.1415933.1415933.1415933.141593
 
Operations between DataFrame and Series
In [144]:
 
 
 
 
 
arr = np.arange(12.).reshape((3, 4))
arr
 
 
Out[144]:
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])
In [145]:
 
 
 
 
 
arr[0]
 
 
Out[145]:
array([0., 1., 2., 3.])
In [146]:
 
 
 
 
 
arr - arr[0]
 
 
Out[146]:
array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])
In [147]:
 
 
 
 
 
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
 
 
In [149]:
 
 
 
 
 
series = frame.iloc[0]
frame
 
 
Out[149]:
 bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
In [150]:
 
 
 
 
 
series
 
 
Out[150]:
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
In [ ]:
 
 
 
 
 
 
 
In [ ]:
 
 
 
 
 
 
 
In [151]:
 
 
 
 
 
frame - series
 
 
Out[151]:
 bde
Utah0.00.00.0
Ohio3.03.03.0
Texas6.06.06.0
Oregon9.09.09.0
In [152]:
 
 
 
 
 
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2
 
 
Out[152]:
 bdef
Utah0.0NaN3.0NaN
Ohio3.0NaN6.0NaN
Texas6.0NaN9.0NaN
Oregon9.0NaN12.0NaN
In [153]:
 
 
 
 
 
series3 = frame['d']
frame
 
 
Out[153]:
 bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
In [154]:
 
 
 
 
 
series3
 
 
Out[154]:
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64
In [155]:
 
 
 
 
 
frame.sub(series3, axis='index')#标签减法
 
 
Out[155]:
 bde
Utah-1.00.01.0
Ohio-1.00.01.0
Texas-1.00.01.0
Oregon-1.00.01.0
 
 
 
 
 
 
Signature: frame.sub(other, axis='columns', level=None, fill_value=None)
Docstring:
Subtraction of dataframe and other, element-wise (binary operator `sub`).
Equivalent to ``dataframe - other``, but with support to substitute a fill_value for
missing data in one of the inputs.
Parameters
----------
other : Series, DataFrame, or constant
axis : {0, 1, 'index', 'columns'}
    For Series input, axis to match Series index on
fill_value : None or float value, default None
    Fill missing (NaN) values with this value. If both DataFrame
    locations are missing, the result will be missing
level : int or name
    Broadcast across a level, matching Index values on the
    passed MultiIndex level
Notes
-----
Mismatched indices will be unioned together
Returns
-------
result : DataFrame
See also
--------
DataFrame.rsub
 
 

Function Application and Mapping

In [156]:
 
 
 
 
 
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
 
 
Out[156]:
 bde
Utah-0.2047080.478943-0.519439
Ohio-0.5557301.9657811.393406
Texas0.0929080.2817460.769023
Oregon1.2464351.007189-1.296221
In [158]:
 
 
 
 
 
np.abs(frame)#取绝对值
 
 
Out[158]:
 bde
Utah0.2047080.4789430.519439
Ohio0.5557301.9657811.393406
Texas0.0929080.2817460.769023
Oregon1.2464351.0071891.296221
 
 
 
 
 
 
Call signature:  np.abs(*args, **kwargs)
Type:            ufunc
String form:     <ufunc 'absolute'>
File:            c:\users\qq123\anaconda3\lib\site-packages\numpy\__init__.py
Docstring:      
absolute(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])
Calculate the absolute value element-wise.
Parameters
----------
x : array_like
    Input array.
out : ndarray, None, or tuple of ndarray and None, optional
    A location into which the result is stored. If provided, it must have
    a shape that the inputs broadcast to. If not provided or `None`,
    a freshly-allocated array is returned. A tuple (possible only as a
    keyword argument) must have length equal to the number of outputs.
where : array_like, optional
    Values of True indicate to calculate the ufunc at that position, values
    of False indicate to leave the value in the output alone.
**kwargs
    For other keyword-only arguments, see the
    :ref:`ufunc docs <ufuncs.kwargs>`.
Returns
-------
absolute : ndarray
    An ndarray containing the absolute value of
    each element in `x`.  For complex input, ``a + ib``, the
    absolute value is :math:`\sqrt{ a^2 + b^2 }`.
Examples
--------
>>> x = np.array([-1.2, 1.2])
>>> np.absolute(x)
array([ 1.2,  1.2])
>>> np.absolute(1.2 + 1j)
1.5620499351813308
Plot the function over ``[-10, 10]``:
>>> import matplotlib.pyplot as plt
>>> x = np.linspace(start=-10, stop=10, num=101)
>>> plt.plot(x, np.absolute(x))
>>> plt.show()
Plot the function over the complex plane:
>>> xx = x + 1j * x[:, np.newaxis]
>>> plt.imshow(np.abs(xx), extent=[-10, 10, -10, 10], cmap='gray')
>>> plt.show()
Class docstring:
Functions that operate element by element on whole arrays.
To see the documentation for a specific ufunc, use `info`.  For
example, ``np.info(np.sin)``.  Because ufuncs are written in C
(for speed) and linked into Python with NumPy's ufunc facility,
Python's help() function finds this page whenever help() is called
on a ufunc.
A detailed explanation of ufuncs can be found in the docs for :ref:`ufuncs`.
Calling ufuncs:
===============
op(*x[, out], where=True, **kwargs)
Apply `op` to the arguments `*x` elementwise, broadcasting the arguments.
The broadcasting rules are:
* Dimensions of length 1 may be prepended to either array.
* Arrays may be repeated along dimensions of length 1.
Parameters
----------
*x : array_like
    Input arrays.
out : ndarray, None, or tuple of ndarray and None, optional
    Alternate array object(s) in which to put the result; if provided, it
    must have a shape that the inputs broadcast to. A tuple of arrays
    (possible only as a keyword argument) must have length equal to the
    number of outputs; use `None` for outputs to be allocated by the ufunc.
where : array_like, optional
    Values of True indicate to calculate the ufunc at that position, values
    of False indicate to leave the value in the output alone.
**kwargs
    For other keyword-only arguments, see the :ref:`ufunc docs <ufuncs.kwargs>`.
Returns
-------
r : ndarray or tuple of ndarray
    `r` will have the shape that the arrays in `x` broadcast to; if `out` is
    provided, `r` will be equal to `out`. If the function has more than one
    output, then the result will be a tuple of arrays.
 
In [160]:
 
 
 
 
 
f = lambda x: x.max() - x.min()
frame.apply(f)#每列最大值减最小值
 
 
Out[160]:
b    1.802165
d    1.684034
e    2.689627
dtype: float64
In [161]:
 
 
 
 
 
frame.apply(f,axis=1)#每一行最大值减最小值
 
 
Out[161]:
Utah      0.998382
Ohio      2.521511
Texas     0.676115
Oregon    2.542656
dtype: float64
In [162]:
 
 
 
 
 
frame.apply(f, axis='columns')
 
 
Out[162]:
Utah      0.998382
Ohio      2.521511
Texas     0.676115
Oregon    2.542656
dtype: float64
In [164]:
 
 
 
 
 
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
 
 
Out[164]:
 bde
min-0.5557300.281746-1.296221
max1.2464351.9657811.393406
In [165]:
 
 
 
 
 
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f,axis=1)
 
 
Out[165]:
 minmax
Utah-0.5194390.478943
Ohio-0.5557301.965781
Texas0.0929080.769023
Oregon-1.2962211.246435
In [166]:
 
 
 
 
 
format = lambda x: '%.2f' % x #设置格式
frame.applymap(format)
 
 
Out[166]:
 bde
Utah-0.200.48-0.52
Ohio-0.561.971.39
Texas0.090.280.77
Oregon1.251.01-1.30
In [167]:
 
 
 
 
 
frame['e'].map(format)
 
 
Out[167]:
Utah      -0.52
Ohio       1.39
Texas      0.77
Oregon    -1.30
Name: e, dtype: object
 

Sorting and Ranking

In [168]:
 
 
 
 
 
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
 
 
Out[168]:
a    1
b    2
c    3
d    0
dtype: int64
In [171]:
 
 
 
 
 
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame.sort_index()
 
 
Out[171]:
 dabc
one4567
three0123
 
 
 
 
 
 
Signature: frame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)
Docstring:
Sort object by labels (along an axis)
Parameters
----------
axis : index, columns to direct sorting
level : int or level name or list of ints or list of level names
    if not None, sort on values in specified index level(s)
ascending : boolean, default True
    Sort ascending vs. descending
inplace : bool, default False
    if True, perform operation in-place
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
     Choice of sorting algorithm. See also ndarray.np.sort for more
     information.  `mergesort` is the only stable algorithm. For
     DataFrames, this option is only applied when sorting on a single
     column or label.
na_position : {'first', 'last'}, default 'last'
     `first` puts NaNs at the beginning, `last` puts NaNs at the end.
     Not implemented for MultiIndex.
sort_remaining : bool, default True
    if true and sorting by level and index is multilevel, sort by other
    levels too (in order) after sorting by specified level
Returns
-------
sorted_obj : DataFrame
File:      c:\users\qq123\anaconda3\lib\site-packages\pandas\core\frame.py
Type:      method
 
In [170]:
 
 
 
 
 
frame.sort_index(axis=1)#列排序
 
 
Out[170]:
 abcd
three1230
one5674
In [172]:
 
 
 
 
 
frame.sort_index(axis=1, ascending=False)#降序
 
 
Out[172]:
 dcba
three0321
one4765
In [ ]:
 
 
 
 
 
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()
 
 
In [ ]:
 
 
 
 
 
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
 
 
In [173]:
 
 
 
 
 
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
 
 
Out[173]:
 ab
004
117
20-3
312
In [174]:
 
 
 
 
 
frame.sort_values(by='b')
 
 
Out[174]:
 ab
20-3
312
004
117
 
 
 
 
 
 
Signature: frame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
Docstring:
Sort by the values along either axis
.. versionadded:: 0.17.0
Parameters
----------
by : str or list of str
    Name or list of names which refer to the axis items.
axis : {0 or 'index', 1 or 'columns'}, default 0
    Axis to direct sorting
ascending : bool or list of bool, default True
     Sort ascending vs. descending. Specify list for multiple sort
     orders.  If this is a list of bools, must match the length of
     the by.
inplace : bool, default False
     if True, perform operation in-place
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
     Choice of sorting algorithm. See also ndarray.np.sort for more
     information.  `mergesort` is the only stable algorithm. For
     DataFrames, this option is only applied when sorting on a single
     column or label.
na_position : {'first', 'last'}, default 'last'
     `first` puts NaNs at the beginning, `last` puts NaNs at the end
Returns
-------
sorted_obj : DataFrame
Examples
--------
>>> df = pd.DataFrame({
...     'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'],
...     'col2' : [2, 1, 9, 8, 7, 4],
...     'col3': [0, 1, 9, 4, 2, 3],
... })
>>> df
    col1 col2 col3
0   A    2    0
1   A    1    1
2   B    9    9
3   NaN  8    4
4   D    7    2
5   C    4    3
Sort by col1
>>> df.sort_values(by=['col1'])
    col1 col2 col3
0   A    2    0
1   A    1    1
2   B    9    9
5   C    4    3
4   D    7    2
3   NaN  8    4
Sort by multiple columns
>>> df.sort_values(by=['col1', 'col2'])
    col1 col2 col3
1   A    1    1
0   A    2    0
2   B    9    9
5   C    4    3
4   D    7    2
3   NaN  8    4
Sort Descending
>>> df.sort_values(by='col1', ascending=False)
    col1 col2 col3
4   D    7    2
5   C    4    3
2   B    9    9
0   A    2    0
1   A    1    1
3   NaN  8    4
Putting NAs first
>>> df.sort_values(by='col1', ascending=False, na_position='first')
    col1 col2 col3
3   NaN  8    4
4   D    7    2
5   C    4    3
2   B    9    9
0   A    2    0
1   A    1    1
 
In [176]:
 
 
 
 
 
frame.sort_values(by=['a', 'b'])
 
 
Out[176]:
 ab
20-3
004
312
117
In [175]:
 
 
 
 
 
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
 
 
In [179]:
 
 
 
 
 
obj.rank()
 
 
Out[179]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
In [180]:
 
 
 
 
 
obj.rank(pct=True)
 
 
Out[180]:
0    0.928571
1    0.142857
2    0.928571
3    0.642857
4    0.428571
5    0.285714
6    0.642857
dtype: float64
 

Signature: obj.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False) Docstring: Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of those values

Parameters

axis : {0 or 'index', 1 or 'columns'}, default 0 index to direct ranking method : {'average', 'min', 'max', 'first', 'dense'}

* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups

numeric_only : boolean, default None Include only float, int, boolean data. Valid only for DataFrame or Panel objects na_option : {'keep', 'top', 'bottom'}

* keep: leave NA values where they are
* top: smallest rank if ascending
* bottom: smallest rank if descending

ascending : boolean, default True False for ranks by high (1) to low (N) pct : boolean, default False Computes percentage rank of data

Returns

In [181]:
 
 
 
 
 
obj.rank(method='first')
 
 
Out[181]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
In [182]:
 
 
 
 
 
# 数值相同去最大排名
obj.rank(ascending=False, method='max')
 
 
Out[182]:
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64
In [183]:
 
 
 
 
 
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame
frame.rank(axis='columns')#每行大小排名
 
 
Out[183]:
 abc
02.03.01.0
11.03.02.0
22.01.03.0
32.03.01.0
 

Axis Indexes with Duplicate Labels

In [184]:
 
 
 
 
 
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
 
 
Out[184]:
a    0
a    1
b    2
b    3
c    4
dtype: int64
In [185]:
 
 
 
 
 
obj.index.is_unique#判断行标签是否无重复
 
 
Out[185]:
False
In [187]:
 
 
 
 
 
obj['a']
 
 
Out[187]:
a    0
a    1
dtype: int64
In [188]:
 
 
 
 
 
obj['c']
 
 
Out[188]:
4
In [189]:
 
 
 
 
 
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
df.loc['b']
 
 
Out[189]:
 012
b1.669025-0.438570-0.539741
b0.4769853.248944-1.021228
 

Summarizing and Computing Descriptive Statistics

In [190]:
 
 
 
 
 
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df
 
 
Out[190]:
 onetwo
a1.40NaN
b7.10-4.5
cNaNNaN
d0.75-1.3
In [198]:
 
 
 
 
 
df.sum()
 
 
Out[198]:
one    9.25
two   -5.80
dtype: float64
 
 
 
 
 
 
Signature: df.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
Docstring:
 
In [192]:
 
 
 
 
 
df.sum(axis='columns')
 
 
Out[192]:
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
In [193]:
 
 
 
 
 
df.mean(axis='columns', skipna=False)#强制不跳过NaN
 
 
Out[193]:
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
In [196]:
 
 
 
 
 
df.idxmax()
 
 
Out[196]:
one    b
two    d
dtype: object
In [197]:
 
 
 
 
 
df.idxmax(axis=1)
 
 
Out[197]:
a    one
b    one
c    NaN
d    one
dtype: object
 
 
 
 
 
 
Signature: df.idxmax(axis=0, skipna=True)
Docstring:
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
    0 or 'index' for row-wise, 1 or 'columns' for column-wise
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA.
Raises
------
ValueError
    * If the row/column is empty
Returns
-------
idxmax : Series
Notes
-----
This method is the DataFrame version of ``ndarray.argmax``.
See Also
--------
Series.idxmax
 
In [200]:
 
 
 
 
 
df.cumsum()
 
 
Out[200]:
 onetwo
a1.40NaN
b8.50-4.5
cNaNNaN
d9.25-5.8
 
 
 
 
 
 
Signature: df.cumsum(axis=None, skipna=True, *args, **kwargs)
Docstring:
Return cumulative sum over requested axis.
Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA
Returns
-------
cumsum : Series
 
In [202]:
 
 
 
 
 
df.describe()
 
 
Out[202]:
 onetwo
count3.0000002.000000
mean3.083333-2.900000
std3.4936852.262742
min0.750000-4.500000
25%1.075000-3.700000
50%1.400000-2.900000
75%4.250000-2.100000
max7.100000-1.300000
 

Signature: df.describe(percentiles=None, include=None, exclude=None) Docstring: Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters

percentiles : list-like of numbers, optional The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles. include : 'all', list-like of dtypes or None (default), optional A white list of data types to include in the result. Ignored for Series. Here are the options:

- 'all' : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
  provided data types.
  To limit the result to numeric types submit
  ``numpy.number``. To limit it instead to object columns submit
  the ``numpy.object`` data type. Strings
  can also be used in the style of
  ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
  select pandas categorical columns, use ``'category'``
- None (default) : The result will include all numeric columns.

exclude : list-like of dtypes or None (default), optional, A black list of data types to omit from the result. Ignored for Series. Here are the options:

- A list-like of dtypes : Excludes the provided data types
  from the result. To exclude numeric types submit
  ``numpy.number``. To exclude object columns submit the data
  type ``numpy.object``. Strings can also be used in the style of
  ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
  exclude pandas categorical columns, use ``'category'``
- None (default) : The result will exclude nothing.

Returns

summary: Series/DataFrame of summary statistics

Notes

For numeric data, the result's index will include countmeanstdminmax as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50percentile is the same as the median.

For object data (e.g. strings or timestamps), the result's index will include countuniquetop, and freq. The top is the most common value. The freq is the most common value's frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

s = pd.Series([1, 2, 3]) s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0

Describing a categorical Series.

s = pd.Series(['a', 'a', 'b', 'c']) s.describe() count 4 unique 3 top a freq 2 dtype: object

Describing a timestamp Series.

s = pd.Series([ ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) s.describe() count 3 unique 2 top 2010-01-01 00:00:00 freq 2 first 2000-01-01 00:00:00 last 2010-01-01 00:00:00 dtype: object

Describing a DataFrame. By default only numeric fields are returned.

df = pd.DataFrame({ 'object': ['a', 'b', 'c'], ... 'numeric': [1, 2, 3], ... 'categorical': pd.Categorical(['d','e','f']) ... }) df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0

Describing all columns of a DataFrame regardless of data type.

df.describe(include='all') categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN c freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN

Describing a column from a DataFrame by accessing it as an attribute.

df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0

Including only string columns in a DataFrame description.

df.describe(include=[np.object]) object count 3 unique 3 top c freq 1

Including only categorical columns from a DataFrame description.

df.describe(include=['category']) categorical count 3 unique 3 top f freq 1

Excluding numeric columns from a DataFrame description.

df.describe(exclude=[np.number]) categorical object count 3 3 unique 3 3 top f c freq 1 1

Excluding object columns from a DataFrame description.

df.describe(exclude=[np.object]) categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0

See Also

DataFrame.count DataFrame.max DataFrame.min DataFrame.mean DataFrame.std DataFrame.select_dtypes

In [203]:
 
 
 
 
 
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()
 
 
Out[203]:
count     16
unique     3
top        a
freq       8
dtype: object
 

Correlation and Covariance

 

conda install pandas-datareader

In [204]:
 
 
 
 
 
price = pd.read_pickle('examples/yahoo_price.pkl')
volume = pd.read_pickle('examples/yahoo_volume.pkl')
 
 
 

import pandas_datareader.data as web all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()}) volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})

In [205]:
 
 
 
 
 
returns = price.pct_change()
returns.tail()
 
 
Out[205]:
 AAPLGOOGIBMMSFT
Date    
2016-10-17-0.0006800.0018370.002072-0.003483
2016-10-18-0.0006810.019616-0.0261680.007690
2016-10-19-0.0029790.0078460.003583-0.002255
2016-10-20-0.000512-0.0056520.001719-0.004867
2016-10-21-0.0039300.003011-0.0124740.042096
In [207]:
 
 
 
 
 
returns['MSFT'].corr(returns['IBM'])#相关性
 
 
Out[207]:
0.4997636114415116
In [208]:
 
 
 
 
 
returns['MSFT'].cov(returns['IBM'])#协方差
 
 
Out[208]:
8.870655479703549e-05
In [209]:
 
 
 
 
 
returns.MSFT.corr(returns.IBM)
 
 
Out[209]:
0.4997636114415116
In [214]:
 
 
 
 
 
returns.corr()
 
 
Out[214]:
 AAPLGOOGIBMMSFT
AAPL1.0000000.4079190.3868170.389695
GOOG0.4079191.0000000.4050990.465919
IBM0.3868170.4050991.0000000.499764
MSFT0.3896950.4659190.4997641.000000
In [212]:
 
 
 
 
 
returns.cov()
 
 
Out[212]:
 AAPLGOOGIBMMSFT
AAPL0.0002770.0001070.0000780.000095
GOOG0.0001070.0002510.0000780.000108
IBM0.0000780.0000780.0001460.000089
MSFT0.0000950.0001080.0000890.000215
In [ ]:
 
 
 
 
 
returns.corrwith(returns.IBM)
 
 
In [ ]:
 
 
 
 
 
returns.corrwith(volume)
 
 
 

Unique Values, Value Counts, and Membership

In [215]:
 
 
 
 
 
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
 
 
In [216]:
 
 
 
 
 
uniques = obj.unique()
uniques
 
 
Out[216]:
array(['c', 'a', 'd', 'b'], dtype=object)
In [217]:
 
 
 
 
 
obj.value_counts()
 
 
Out[217]:
c    3
a    3
b    2
d    1
dtype: int64
In [218]:
 
 
 
 
 
pd.value_counts(obj.values, sort=False)
 
 
Out[218]:
b    2
a    3
c    3
d    1
dtype: int64
In [ ]:
 
 
 
 
 
obj
mask = obj.isin(['b', 'c'])
mask
obj[mask]
 
 
In [220]:
 
 
 
 
 
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)
 
 
Out[220]:
array([0, 2, 1, 1, 0, 2], dtype=int64)
In [221]:
 
 
 
 
 
unique_vals
 
 
Out[221]:
0    c
1    b
2    a
dtype: object
In [222]:
 
 
 
 
 
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data
 
 
Out[222]:
 Qu1Qu2Qu3
0121
1335
2412
3324
4434
In [225]:
 
 
 
 
 
result = data.apply(pd.value_counts).fillna(0)
result
 
 
Out[225]:
 Qu1Qu2Qu3
11.01.01.0
20.02.01.0
32.02.00.0
42.00.02.0
50.00.01.0
 

Conclusion

In [226]:
 
 
 
 
 
pd.options.display.max_rows = PREVIOUS_MAX_ROWS
 
 
In [ ]:
 
 
 
 
 

转载于:https://www.cnblogs.com/romannista/p/10689353.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值