In [1]:

          
            import pandas as pd

In [2]:

          
            from pandas import Series, DataFrame

In [3]:

          
            import numpy as np 
            np.random.seed(12345) 
            import matplotlib.pyplot as plt 
            plt.rc('figure', figsize=(10, 6)) 
            PREVIOUS_MAX_ROWS = pd.options.display.max_rows 
            pd.options.display.max_rows = 20 
            np.set_printoptions(precision=4, suppress=True)

Introduction to pandas Data Structures

Series

In [4]:

          
            obj = pd.Series([4, 7, -5, 3]) 
            obj

Out[4]:

0    4
1    7
2   -5
3    3
dtype: int64

In [5]:

          
            obj.values 
            obj.index  # like range(4)

Out[5]:

RangeIndex(start=0, stop=4, step=1)

In [6]:

          
            obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])#设置索引 
            ​ 
            obj2 
            obj2.index

Out[6]:

Index(['d', 'b', 'a', 'c'], dtype='object')

         
     

              
          

              
          

              
          

              
          
 
           Init signature: pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False) 
           Docstring:     
           One-dimensional ndarray with axis labels (including time series). 
           ​ 
           Labels need not be unique but must be a hashable type. The object 
           supports both integer- and label-based indexing and provides a host of 
           methods for performing operations involving the index. Statistical 
           methods from ndarray have been overridden to automatically exclude 
           missing data (currently represented as NaN). 
           ​ 
           Operations between Series (+, -, /, *, **) align values based on their 
           associated index values-- they need not be the same length. The result 
           index will be the sorted union of the two indexes. 
           ​ 
           Parameters 
           ---------- 
           data : array-like, dict, or scalar value 
               Contains data stored in Series 
           index : array-like or Index (1d) 
               Values must be hashable and have the same length as `data`. 
               Non-unique index values are allowed. Will default to 
               RangeIndex(len(data)) if not provided. If both a dict and index 
               sequence are used, the index will override the keys found in the 
               dict. 
           dtype : numpy.dtype or None 
               If None, dtype will be inferred 
           copy : boolean, default False 
               Copy input data 
          

          
      

In [7]:

          
            obj2['a'] 
            obj2['d'] = 6 
            obj2[['c', 'a', 'd']]

Out[7]:

c    3
a   -5
d    6
dtype: int64

In [10]:

          
            obj2[obj2 > 0] 
            ​

Out[10]:

d    6
b    7
c    3
dtype: int64

In [11]:

          
            obj2 * 2 
            np.exp(obj2)

Out[11]:

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [ ]:

In [12]:

          
            'b' in obj2 
            'e' in obj2

Out[12]:

False

In [13]:

          
            sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} 
            obj3 = pd.Series(sdata) 
            obj3

Out[13]:

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [14]:

          
            states = ['California', 'Ohio', 'Oregon', 'Texas'] 
            obj4 = pd.Series(sdata, index=states) 
            obj4

Out[14]:

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [17]:

          
            pd.isnull(obj4) 
            ​

Out[17]:

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

         
           Signature: pd.isnull(obj) 
           Docstring: 
           Detect missing values (NaN in numeric arrays, None/NaN in object arrays) 
           ​ 
           Parameters 
           ---------- 
           arr : ndarray or object value 
               Object to check for null-ness 
           ​ 
           Returns 
           ------- 
           isna : array-like of bool or bool 
               Array or bool indicating whether an object is null or if an array is 
               given which of the element is null. 
           ​ 
           See also 
           -------- 
           pandas.notna: boolean inverse of pandas.isna 
           pandas.isnull: alias of isna

In [18]:

          
            pd.notnull(obj4)

Out[18]:

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

         
           Signature: pd.notnull(obj) 
           Docstring: 
           Replacement for numpy.isfinite / -numpy.isnan which is suitable for use 
           on object arrays. 
           ​ 
           Parameters 
           ---------- 
           arr : ndarray or object value 
               Object to check for *not*-null-ness 
           ​ 
           Returns 
           ------- 
           notisna : array-like of bool or bool 
               Array or bool indicating whether an object is *not* null or if an array 
               is given which of the element is *not* null. 
           ​ 
           See also 
           -------- 
           pandas.isna : boolean inverse of pandas.notna 
           pandas.notnull : alias of notna

In [ ]:

obj4.isnull()

In [19]:

obj3

Out[19]:

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [20]:

obj4

Out[20]:

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [21]:

          
            obj3 + obj4 #同索引数值相加

Out[21]:

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [23]:

          
            obj4.name = 'population' 
            obj4.index.name = 'state'#设置index名称 
            obj4

Out[23]:

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [24]:

obj

Out[24]:

0    4
1    7
2   -5
3    3
dtype: int64

In [25]:

          
            obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']#设置index 
            obj

Out[25]:

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

In [26]:

          
            data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 
                    'year': [2000, 2001, 2002, 2001, 2002, 2003], 
                    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} 
            frame = pd.DataFrame(data)

In [27]:

frame

Out[27]:

	pop	state	year
0	1.5	Ohio	2000
1	1.7	Ohio	2001
2	3.6	Ohio	2002
3	2.4	Nevada	2001
4	2.9	Nevada	2002
5	3.2	Nevada	2003

In [29]:

frame.head()

Out[29]:

	pop	state	year
0	1.5	Ohio	2000
1	1.7	Ohio	2001
2	3.6	Ohio	2002
3	2.4	Nevada	2001
4	2.9	Nevada	2002

         
           Signature: frame.head(n=5) 
           Docstring: 
           Return the first n rows. 
           ​ 
           Parameters 
           ---------- 
           n : int, default 5 
               Number of rows to select. 
           ​ 
           Returns 
           ------- 
           obj_head : type of caller 
               The first n rows of the caller object.

In [30]:

          
            pd.DataFrame(data, columns=['year', 'state', 'pop'])#设置列名

Out[30]:

	year	state	pop
0	2000	Ohio	1.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9
5	2003	Nevada	3.2

         
     

              
          

              
          

              
          

              
          
 
           Init signature: pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False) 
           Docstring:     
           Two-dimensional size-mutable, potentially heterogeneous tabular data 
           structure with labeled axes (rows and columns). Arithmetic operations 
           align on both row and column labels. Can be thought of as a dict-like 
           container for Series objects. The primary pandas data structure 
           ​ 
           Parameters 
           ---------- 
           data : numpy ndarray (structured or homogeneous), dict, or DataFrame 
               Dict can contain Series, arrays, constants, or list-like objects 
           index : Index or array-like 
               Index to use for resulting frame. Will default to np.arange(n) if 
               no indexing information part of input data and no index provided 
           columns : Index or array-like 
               Column labels to use for resulting frame. Will default to 
               np.arange(n) if no column labels are provided 
           dtype : dtype, default None 
               Data type to force. Only a single dtype is allowed. If None, infer 
           copy : boolean, default False 
               Copy data from inputs. Only affects DataFrame / 2d ndarray input 
           ​ 
           Examples 
           -------- 
           Constructing DataFrame from a dictionary. 
           ​ 
           >>> d = {'col1': [1, 2], 'col2': [3, 4]} 
           >>> df = pd.DataFrame(data=d) 
           >>> df 
              col1  col2 
           0     1     3 
           1     2     4 
           ​ 
           Notice that the inferred dtype is int64. 
           ​ 
           >>> df.dtypes 
           col1    int64 
           col2    int64 
           dtype: object 
           ​ 
           To enforce a single dtype: 
           ​ 
           >>> df = pd.DataFrame(data=d, dtype=np.int8) 
           >>> df.dtypes 
           col1    int8 
           col2    int8 
           dtype: object 
           ​ 
           Constructing DataFrame from numpy ndarray: 
           ​ 
           >>> df2 = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)), 
           ...                    columns=['a', 'b', 'c', 'd', 'e']) 
           >>> df2 
               a   b   c   d   e 
           0   2   8   8   3   4 
           1   4   2   9   0   9 
           2   1   0   7   8   0 
           3   5   1   7   1   3 
           4   6   0   2   4   2 
          

          
      

In [31]:

          
            frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],#空数据会显示NaN 
                                  index=['one', 'two', 'three', 'four', 
                                         'five', 'six']) 
            frame2 
            ​

Out[31]:

	year	state	pop	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN
six	2003	Nevada	3.2	NaN

In [32]:

frame2.columns

Out[32]:

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [34]:

frame2['state']

Out[34]:

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [35]:

          
            frame2.year#列名有空格不可用

Out[35]:

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [36]:

          
            frame2.loc['three']

Out[36]:

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [37]:

          
            frame2['debt'] = 16.5 
            frame2 
            ​

Out[37]:

	year	state	pop	debt
one	2000	Ohio	1.5	16.5
two	2001	Ohio	1.7	16.5
three	2002	Ohio	3.6	16.5
four	2001	Nevada	2.4	16.5
five	2002	Nevada	2.9	16.5
six	2003	Nevada	3.2	16.5

In [39]:

          
            frame2['debt'] = np.arange(6.) 
            frame2

Out[39]:

	year	state	pop	debt
one	2000	Ohio	1.5	0.0
two	2001	Ohio	1.7	1.0
three	2002	Ohio	3.6	2.0
four	2001	Nevada	2.4	3.0
five	2002	Nevada	2.9	4.0
six	2003	Nevada	3.2	5.0

         
     

              
          

              
          

              
          

              
          
 
           Docstring: 
           arange([start,] stop[, step,], dtype=None) 
           ​ 
           Return evenly spaced values within a given interval. 
           ​ 
           Values are generated within the half-open interval ``[start, stop)`` 
           (in other words, the interval including `start` but excluding `stop`). 
           For integer arguments the function is equivalent to the Python built-in 
           `range <http://docs.python.org/lib/built-in-funcs.html>`_ function, 
           but returns an ndarray rather than a list. 
           ​ 
           When using a non-integer step, such as 0.1, the results will often not 
           be consistent.  It is better to use ``linspace`` for these cases. 
           ​ 
           Parameters 
           ---------- 
           start : number, optional 
               Start of interval.  The interval includes this value.  The default 
               start value is 0. 
           stop : number 
               End of interval.  The interval does not include this value, except 
               in some cases where `step` is not an integer and floating point 
               round-off affects the length of `out`. 
           step : number, optional 
               Spacing between values.  For any output `out`, this is the distance 
               between two adjacent values, ``out[i+1] - out[i]``.  The default 
               step size is 1.  If `step` is specified as a position argument, 
               `start` must also be given. 
           dtype : dtype 
               The type of the output array.  If `dtype` is not given, infer the data 
               type from the other input arguments. 
           ​ 
           Returns 
           ------- 
           arange : ndarray 
               Array of evenly spaced values. 
           ​ 
               For floating point arguments, the length of the result is 
               ``ceil((stop - start)/step)``.  Because of floating point overflow, 
               this rule may result in the last element of `out` being greater 
               than `stop`. 
           ​ 
           See Also 
           -------- 
           linspace : Evenly spaced numbers with careful handling of endpoints. 
           ogrid: Arrays of evenly spaced numbers in N-dimensions. 
           mgrid: Grid-shaped arrays of evenly spaced numbers in N-dimensions. 
           ​ 
           Examples 
           -------- 
           >>> np.arange(3) 
           array([0, 1, 2]) 
           >>> np.arange(3.0) 
           array([ 0.,  1.,  2.]) 
           >>> np.arange(3,7) 
           array([3, 4, 5, 6]) 
           >>> np.arange(3,7,2) 
           array([3, 5]) 
          

          
      

In [41]:

          
            val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) 
            frame2['debt'] = val 
            frame2#匹配index填充数据

Out[41]:

	year	state	pop	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	-1.2
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	-1.5
five	2002	Nevada	2.9	-1.7
six	2003	Nevada	3.2	NaN

In [ ]:

In [42]:

          
            frame2['eastern'] = frame2.state == 'Ohio' 
            frame2

Out[42]:

	year	state	pop	debt	eastern
one	2000	Ohio	1.5	NaN	True
two	2001	Ohio	1.7	-1.2	True
three	2002	Ohio	3.6	NaN	True
four	2001	Nevada	2.4	-1.5	False
five	2002	Nevada	2.9	-1.7	False
six	2003	Nevada	3.2	NaN	False

In [43]:

          
            del frame2['eastern']#删除列 
            frame2.columns

Out[43]:

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [45]:

          
            pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 
                   'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [46]:

          
            frame3 = pd.DataFrame(pop) 
            frame3

Out[46]:

	Nevada	Ohio
2000	NaN	1.5
2001	2.4	1.7
2002	2.9	3.6

In [47]:

          
            frame3.T#转置

Out[47]:

	2000	2001	2002
Nevada	NaN	2.4	2.9
Ohio	1.5	1.7	3.6

In [48]:

          
            pd.DataFrame(pop, index=[2001, 2002, 2003])

Out[48]:

	Nevada	Ohio
2001	2.4	1.7
2002	2.9	3.6
2003	NaN	NaN

In [49]:

          
            pdata = {'Ohio': frame3['Ohio'][:-1], 
                     'Nevada': frame3['Nevada'][:2]} 
            pd.DataFrame(pdata)

Out[49]:

	Nevada	Ohio
2000	NaN	1.5
2001	2.4	1.7

In [50]:

          
            frame3.index.name = 'year'; frame3.columns.name = 'state'#设置列统称 
            frame3

Out[50]:

state	Nevada	Ohio
year
2000	NaN	1.5
2001	2.4	1.7
2002	2.9	3.6

In [51]:

frame3.values

Out[51]:

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [52]:

frame2.values

Out[52]:

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

Index Objects

In [53]:

          
            obj = pd.Series(range(3), index=['a', 'b', 'c']) 
            index = obj.index 
            index 
            index[1:]

Out[53]:

Index(['b', 'c'], dtype='object')

index[1] = 'd' # TypeError

In [54]:

          
            labels = pd.Index(np.arange(3)) 
            labels 
            ​

Out[54]:

Int64Index([0, 1, 2], dtype='int64')

In [55]:

          
            obj2 = pd.Series([1.5, -2.5, 0], index=labels) 
            obj2 
            obj2.index is labels

Out[55]:

True

In [56]:

          
            frame3 
            frame3.columns 
            ​

Out[56]:

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [57]:

          
            'Ohio' in frame3.columns 
            ​

Out[57]:

True

In [58]:

          
            2003 in frame3.index

Out[58]:

False

In [59]:

          
            dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar']) 
            dup_labels

Out[59]:

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Essential Functionality

Reindexing

In [60]:

          
            obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) 
            obj

Out[60]:

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [62]:

          
            obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) 
            obj2

Out[62]:

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

         
     

              
          

              
          

              
          

              
          
 
           ignature: obj.reindex(index=None, **kwargs) 
           Docstring: 
           Conform Series to new index with optional filling logic, placing 
           NA/NaN in locations having no value in the previous index. A new object 
           is produced unless the new index is equivalent to the current one and 
           copy=False 
           ​ 
           Parameters 
           ---------- 
           ​ 
           index : array-like, optional (should be specified using keywords) 
               New labels / index to conform to. Preferably an Index object to 
               avoid duplicating data 
           ​ 
           method : {None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}, optional 
               method to use for filling holes in reindexed DataFrame. 
               Please note: this is only  applicable to DataFrames/Series with a 
               monotonically increasing/decreasing index. 
           ​ 
               * default: don't fill gaps 
               * pad / ffill: propagate last valid observation forward to next 
                 valid 
               * backfill / bfill: use next valid observation to fill gap 
               * nearest: use nearest valid observations to fill gap 
           ​ 
           copy : boolean, default True 
               Return a new object, even if the passed indexes are the same 
           level : int or name 
               Broadcast across a level, matching Index values on the 
               passed MultiIndex level 
           fill_value : scalar, default np.NaN 
               Value to use for missing values. Defaults to NaN, but can be any 
               "compatible" value 
           limit : int, default None 
               Maximum number of consecutive elements to forward or backward fill 
           tolerance : optional 
               Maximum distance between original and new labels for inexact 
               matches. The values of the index at the matching locations most 
               satisfy the equation ``abs(index[indexer] - target) <= tolerance``. 
           ​ 
               Tolerance may be a scalar value, which applies the same tolerance 
               to all values, or list-like, which applies variable tolerance per 
               element. List-like includes list, tuple, array, Series, and must be 
               the same size as the index and its dtype must exactly match the 
               index's type. 
           ​ 
               .. versionadded:: 0.17.0 
               .. versionadded:: 0.21.0 (list-like tolerance) 
           ​ 
           Examples 
           -------- 
           ​ 
           ``DataFrame.reindex`` supports two calling conventions 
           ​ 
           * ``(index=index_labels, columns=column_labels, ...)`` 
           * ``(labels, axis={'index', 'columns'}, ...)`` 
           ​ 
           We *highly* recommend using keyword arguments to clarify your 
           intent. 
           ​ 
           Create a dataframe with some fictional data. 
           ​ 
           >>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] 
           >>> df = pd.DataFrame({ 
           ...      'http_status': [200,200,404,404,301], 
           ...      'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, 
           ...       index=index) 
           >>> df 
                      http_status  response_time 
           Firefox            200           0.04 
           Chrome             200           0.02 
           Safari             404           0.07 
           IE10               404           0.08 
           Konqueror          301           1.00 
           ​ 
           Create a new index and reindex the dataframe. By default 
           values in the new index that do not have corresponding 
           records in the dataframe are assigned ``NaN``. 
           ​ 
           >>> new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 
           ...             'Chrome'] 
           >>> df.reindex(new_index) 
                          http_status  response_time 
           Safari               404.0           0.07 
           Iceweasel              NaN            NaN 
           Comodo Dragon          NaN            NaN 
           IE10                 404.0           0.08 
           Chrome               200.0           0.02 
           ​ 
           We can fill in the missing values by passing a value to 
           the keyword ``fill_value``. Because the index is not monotonically 
           increasing or decreasing, we cannot use arguments to the keyword 
           ``method`` to fill the ``NaN`` values. 
           ​ 
           >>> df.reindex(new_index, fill_value=0) 
                          http_status  response_time 
           Safari                 404           0.07 
           Iceweasel                0           0.00 
           Comodo Dragon            0           0.00 
           IE10                   404           0.08 
           Chrome                 200           0.02 
           ​ 
           >>> df.reindex(new_index, fill_value='missing') 
                         http_status response_time 
           Safari                404          0.07 
           Iceweasel         missing       missing 
           Comodo Dragon     missing       missing 
           IE10                  404          0.08 
           Chrome                200          0.02 
           ​ 
           We can also reindex the columns. 
           ​ 
           >>> df.reindex(columns=['http_status', 'user_agent']) 
                      http_status  user_agent 
           Firefox            200         NaN 
           Chrome             200         NaN 
           Safari             404         NaN 
           IE10               404         NaN 
           Konqueror          301         NaN 
           ​ 
           Or we can use "axis-style" keyword arguments 
           ​ 
           >>> df.reindex(['http_status', 'user_agent'], axis="columns") 
                      http_status  user_agent 
           Firefox            200         NaN 
           Chrome             200         NaN 
           Safari             404         NaN 
           IE10               404         NaN 
           Konqueror          301         NaN 
           ​ 
           To further illustrate the filling functionality in 
           ``reindex``, we will create a dataframe with a 
           monotonically increasing index (for example, a sequence 
           of dates). 
           ​ 
           >>> date_index = pd.date_range('1/1/2010', periods=6, freq='D') 
           >>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, 
           ...                    index=date_index) 
           >>> df2 
                       prices 
           2010-01-01     100 
           2010-01-02     101 
           2010-01-03     NaN 
           2010-01-04     100 
           2010-01-05      89 
           2010-01-06      88 
           ​ 
           Suppose we decide to expand the dataframe to cover a wider 
           date range. 
           ​ 
           >>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D') 
           >>> df2.reindex(date_index2) 
                       prices 
           2009-12-29     NaN 
           2009-12-30     NaN 
           2009-12-31     NaN 
           2010-01-01     100 
           2010-01-02     101 
           2010-01-03     NaN 
           2010-01-04     100 
           2010-01-05      89 
           2010-01-06      88 
           2010-01-07     NaN 
           ​ 
           The index entries that did not have a value in the original data frame 
           (for example, '2009-12-29') are by default filled with ``NaN``. 
           If desired, we can fill in the missing values using one of several 
           options. 
           ​ 
           For example, to backpropagate the last valid value to fill the ``NaN`` 
           values, pass ``bfill`` as an argument to the ``method`` keyword. 
           ​ 
           >>> df2.reindex(date_index2, method='bfill') 
                       prices 
           2009-12-29     100 
           2009-12-30     100 
           2009-12-31     100 
           2010-01-01     100 
           2010-01-02     101 
           2010-01-03     NaN 
           2010-01-04     100 
           2010-01-05      89 
           2010-01-06      88 
           2010-01-07     NaN 
           ​ 
           Please note that the ``NaN`` value present in the original dataframe 
           (at index value 2010-01-03) will not be filled by any of the 
           value propagation schemes. This is because filling while reindexing 
           does not look at dataframe values, but only compares the original and 
           desired indexes. If you do want to fill in the ``NaN`` values present 
           in the original dataframe, use the ``fillna()`` method. 
           ​ 
           See the :ref:`user guide <basics.reindexing>` for more. 
           ​ 
           Returns 
           ------- 
          

          
      

In [63]:

          
            obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4]) 
            obj3 
            ​

Out[63]:

0      blue
2    purple
4    yellow
dtype: object

In [64]:

          
            obj3.reindex(range(6), method='ffill')#按index空缺填充

Out[64]:

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [72]:

          
            frame = pd.DataFrame(np.arange(9).reshape((3, 3)), 
                                 index=['a', 'c', 'd'], 
                                 columns=['Ohio', 'Texas', 'California']) 
            frame 
            ​

Out[72]:

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

In [66]:

          
            frame2 = frame.reindex(['a', 'b', 'c', 'd']) 
            frame2

Out[66]:

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

In [73]:

          
            states = ['Texas', 'Utah', 'California'] 
            frame.reindex(columns=states)

Out[73]:

	Texas	Utah	California
a	1	NaN	2
c	4	NaN	5
d	7	NaN	8

In [74]:

          
            frame.loc[['a', 'b', 'c', 'd'], states]#这里提醒，传入列表或有找不到的标签的，以后会报错，用.reindex代替

c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.

Out[74]:

	Texas	Utah	California
a	1.0	NaN	2.0
b	NaN	NaN	NaN
c	4.0	NaN	5.0
d	7.0	NaN	8.0

Dropping Entries from an Axis

In [91]:

          
            obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) 
            obj 
            ​

Out[91]:

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [80]:

          
            new_obj = obj.drop('c') 
            new_obj 
            ​

Out[80]:

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

         
     

              
          

              
          

              
          

              
          
 
           Signature: obj.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') 
           Docstring: 
           Return new object with labels in requested axis removed. 
           ​ 
           Parameters 
           ---------- 
           labels : single label or list-like 
               Index or column labels to drop. 
           axis : int or axis name 
               Whether to drop labels from the index (0 / 'index') or 
               columns (1 / 'columns'). 
           index, columns : single label or list-like 
               Alternative to specifying `axis` (``labels, axis=1`` is 
               equivalent to ``columns=labels``). 
           ​ 
               .. versionadded:: 0.21.0 
           level : int or level name, default None 
               For MultiIndex 
           inplace : bool, default False 
               If True, do operation inplace and return None. 
           errors : {'ignore', 'raise'}, default 'raise' 
               If 'ignore', suppress error and existing labels are dropped. 
           ​ 
           Returns 
           ------- 
           dropped : type of caller 
           ​ 
           Examples 
           -------- 
           >>> df = pd.DataFrame(np.arange(12).reshape(3,4), 
                                 columns=['A', 'B', 'C', 'D']) 
           >>> df 
              A  B   C   D 
           0  0  1   2   3 
           1  4  5   6   7 
           2  8  9  10  11 
           ​ 
           Drop columns 
           ​ 
           >>> df.drop(['B', 'C'], axis=1) 
              A   D 
           0  0   3 
           1  4   7 
           2  8  11 
           ​ 
           >>> df.drop(columns=['B', 'C']) 
              A   D 
           0  0   3 
           1  4   7 
           2  8  11 
           ​ 
           Drop a row by index 
           ​ 
           >>> df.drop([0, 1]) 
              A  B   C   D 
           2  8  9  10  11 
           ​ 
           Notes 
          

          
      

In [79]:

          
            obj.drop(['d', 'c'])#

Out[79]:

a    0.0
b    1.0
e    4.0
dtype: float64

In [87]:

          
            data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                                index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                                columns=['one', 'two', 'three', 'four']) 
            data

Out[87]:

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

In [88]:

          
            data.drop(['Colorado', 'Ohio'])

Out[88]:

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

In [89]:

          
            data.drop('two', axis=1) 
            data.drop(['two', 'four'], axis='columns')

Out[89]:

	one	three
Ohio	0	2
Colorado	4	6
Utah	8	10
New York	12	14

In [92]:

          
            obj.drop('c', inplace=True) 
            ​

In [93]:

obj

Out[93]:

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Indexing, Selection, and Filtering

In [94]:

          
            obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd']) 
            obj 
            ​

Out[94]:

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [95]:

obj['b']

Out[95]:

1.0

In [96]:

obj[1]

Out[96]:

1.0

In [97]:

          
            obj[2:4] 
            ​

Out[97]:

c    2.0
d    3.0
dtype: float64

In [98]:

          
            obj[['b', 'a', 'd']] 
            ​

Out[98]:

b    1.0
a    0.0
d    3.0
dtype: float64

In [99]:

          
            obj[[1, 3]] 
            ​

Out[99]:

b    1.0
d    3.0
dtype: float64

In [100]:

          
            obj[obj < 2]

Out[100]:

a    0.0
b    1.0
dtype: float64

In [ ]:

In [101]:

          
            obj['b':'c']

Out[101]:

b    1.0
c    2.0
dtype: float64

In [105]:

          
            obj['b':'c'] = 5 
            ​

In [104]:

obj

Out[104]:

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [106]:

          
            data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                                index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                                columns=['one', 'two', 'three', 'four']) 
            data 
            ​

Out[106]:

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

In [107]:

data['two']

Out[107]:

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [108]:

          
            data[['three', 'one']]

Out[108]:

	three	one
Ohio	2	0
Colorado	6	4
Utah	10	8
New York	14	12

In [109]:

data[:2]

Out[109]:

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

In [110]:

          
            data[data['three'] > 5]

Out[110]:

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

In [111]:

          
            data < 5 
            data[data < 5] = 0 
            data

Out[111]:

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

Selection with loc and iloc

In [112]:

          
            data.loc['Colorado', ['two', 'three']]

Out[112]:

two      5
three    6
Name: Colorado, dtype: int32

In [113]:

          
            data.iloc[2, [3, 0, 1]] 
            data.iloc[2] 
            data.iloc[[1, 2], [3, 0, 1]]

Out[113]:

	four	one	two
Colorado	7	0	5
Utah	11	8	9

In [114]:

          
            data.loc[:'Utah', 'two']#标签多选 
            data.iloc[:, :3][data.three > 5]#位置多选

Out[114]:

	one	two	three
Colorado	0	5	6
Utah	8	9	10
New York	12	13	14

Integer Indexes

ser = pd.Series(np.arange(3.)) ser ser[-1]

In [115]:

          
            ser = pd.Series(np.arange(3.))

In [116]:

ser

Out[116]:

0    0.0
1    1.0
2    2.0
dtype: float64

In [117]:

          
            ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c']) 
            ser2[-1]

Out[117]:

2.0

In [118]:

ser[:1]

Out[118]:

0    0.0
dtype: float64

In [119]:

          
            ser.loc[:1] 
            ​

Out[119]:

0    0.0
1    1.0
dtype: float64

In [120]:

          
            ser.iloc[:1]

Out[120]:

0    0.0
dtype: float64

Arithmetic and Data Alignment

In [121]:

          
            s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e']) 
            s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], 
                           index=['a', 'c', 'e', 'f', 'g']) 
            s1 
            ​

Out[121]:

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [122]:

s2

Out[122]:

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [123]:

          
            s1 + s2# 不匹配的不算单个，直接NaN

Out[123]:

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [124]:

          
            df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), 
                               index=['Ohio', 'Texas', 'Colorado']) 
            df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), 
                               index=['Utah', 'Ohio', 'Texas', 'Oregon']) 
            df1 
            ​

Out[124]:

	b	c	d
Ohio	0.0	1.0	2.0
Texas	3.0	4.0	5.0
Colorado	6.0	7.0	8.0

In [125]:

df2

Out[125]:

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

In [126]:

          
            df1 + df2

Out[126]:

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utah	NaN	NaN	NaN	NaN

In [127]:

          
            df1 = pd.DataFrame({'A': [1, 2]}) 
            df2 = pd.DataFrame({'B': [3, 4]}) 
            df1 
            ​

Out[127]:

	A
0	1
1	2

In [128]:

df2

Out[128]:

	B
0	3
1	4

In [129]:

          
            df1 - df2 #需要 行标签 列表去都对上

Out[129]:

	A	B
0	NaN	NaN
1	NaN	NaN

Arithmetic methods with fill values

In [130]:

          
            df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), 
                               columns=list('abcd')) 
            df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), 
                               columns=list('abcde')) 
            ​

In [131]:

df2

Out[131]:

	a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	6.0	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

In [132]:

          
            df2.loc[1, 'b'] = np.nan 
            ​

In [133]:

df1

Out[133]:

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

In [134]:

          
            df1 + df2

Out[134]:

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	NaN	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

In [135]:

          
            df1.add(df2, fill_value=0)#当找不到标签时等于0

Out[135]:

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	5.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

         
     

              
          

              
          

              
          

              
          
 
           Signature: df1.add(other, axis='columns', level=None, fill_value=None) 
           Docstring: 
           Addition of dataframe and other, element-wise (binary operator `add`). 
           ​ 
           Equivalent to ``dataframe + other``, but with support to substitute a fill_value for 
           missing data in one of the inputs. 
           ​ 
           Parameters 
           ---------- 
           other : Series, DataFrame, or constant 
           axis : {0, 1, 'index', 'columns'} 
               For Series input, axis to match Series index on 
           fill_value : None or float value, default None 
               Fill missing (NaN) values with this value. If both DataFrame 
               locations are missing, the result will be missing 
           level : int or name 
               Broadcast across a level, matching Index values on the 
               passed MultiIndex level 
           ​ 
           Notes 
           ----- 
           Mismatched indices will be unioned together 
           ​ 
           Returns 
           ------- 
           result : DataFrame 
           ​ 
           See also 
           -------- 
          

          
      

In [136]:

          
            1 / df1 
            ​

Out[136]:

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250000	0.200000	0.166667	0.142857
2	0.125000	0.111111	0.100000	0.090909

In [138]:

          
            df1.rdiv(1)

Out[138]:

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250000	0.200000	0.166667	0.142857
2	0.125000	0.111111	0.100000	0.090909

         
     

              
          

              
          

              
          

              
          
 
           Signature: df1.rdiv(other, axis='columns', level=None, fill_value=None) 
           Docstring: 
           Floating division of dataframe and other, element-wise (binary operator `rtruediv`). 
           ​ 
           Equivalent to ``other / dataframe``, but with support to substitute a fill_value for 
           missing data in one of the inputs. 
           ​ 
           Parameters 
           ---------- 
           other : Series, DataFrame, or constant 
           axis : {0, 1, 'index', 'columns'} 
               For Series input, axis to match Series index on 
           fill_value : None or float value, default None 
               Fill missing (NaN) values with this value. If both DataFrame 
               locations are missing, the result will be missing 
           level : int or name 
               Broadcast across a level, matching Index values on the 
               passed MultiIndex level 
           ​ 
           Notes 
           ----- 
           Mismatched indices will be unioned together 
           ​ 
           Returns 
           ------- 
           result : DataFrame 
           ​ 
           See also 
           -------- 
          

          
      

In [140]:

          
            df1.reindex(columns=df2.columns, fill_value=0)

Out[140]:

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

In [143]:

          
            df1.reindex(index=df2.index,columns=df2.columns, fill_value=np.pi)

Out[143]:

	a	b	c	d	e
0	0.000000	1.000000	2.000000	3.000000	3.141593
1	4.000000	5.000000	6.000000	7.000000	3.141593
2	8.000000	9.000000	10.000000	11.000000	3.141593
3	3.141593	3.141593	3.141593	3.141593	3.141593

Operations between DataFrame and Series

In [144]:

          
            arr = np.arange(12.).reshape((3, 4)) 
            arr 
            ​

Out[144]:

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [145]:

arr[0]

Out[145]:

array([0., 1., 2., 3.])

In [146]:

          
            arr - arr[0]

Out[146]:

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [147]:

          
            frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), 
                                 columns=list('bde'), 
                                 index=['Utah', 'Ohio', 'Texas', 'Oregon']) 
            ​

In [149]:

          
            series = frame.iloc[0] 
            frame 
            ​

Out[149]:

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

In [150]:

series

Out[150]:

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [ ]:

In [151]:

          
            frame - series

Out[151]:

	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

In [152]:

          
            series2 = pd.Series(range(3), index=['b', 'e', 'f']) 
            frame + series2

Out[152]:

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

In [153]:

          
            series3 = frame['d'] 
            frame 
            ​

Out[153]:

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

In [154]:

series3

Out[154]:

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [155]:

          
            frame.sub(series3, axis='index')#标签减法

Out[155]:

	b	e
Utah	-1.0	1.0
Ohio	-1.0	1.0
Texas	-1.0	1.0
Oregon	-1.0	1.0

         
     

              
          

              
          

              
          

              
          
 
           Signature: frame.sub(other, axis='columns', level=None, fill_value=None) 
           Docstring: 
           Subtraction of dataframe and other, element-wise (binary operator `sub`). 
           ​ 
           Equivalent to ``dataframe - other``, but with support to substitute a fill_value for 
           missing data in one of the inputs. 
           ​ 
           Parameters 
           ---------- 
           other : Series, DataFrame, or constant 
           axis : {0, 1, 'index', 'columns'} 
               For Series input, axis to match Series index on 
           fill_value : None or float value, default None 
               Fill missing (NaN) values with this value. If both DataFrame 
               locations are missing, the result will be missing 
           level : int or name 
               Broadcast across a level, matching Index values on the 
               passed MultiIndex level 
           ​ 
           Notes 
           ----- 
           Mismatched indices will be unioned together 
           ​ 
           Returns 
           ------- 
           result : DataFrame 
           ​ 
           See also 
           -------- 
           DataFrame.rsub 
          

          
      

Function Application and Mapping

In [156]:

          
            frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 
                                 index=['Utah', 'Ohio', 'Texas', 'Oregon']) 
            frame 
            ​

Out[156]:

	b	d	e
Utah	-0.204708	0.478943	-0.519439
Ohio	-0.555730	1.965781	1.393406
Texas	0.092908	0.281746	0.769023
Oregon	1.246435	1.007189	-1.296221

In [158]:

          
            np.abs(frame)#取绝对值

Out[158]:

	b	d	e
Utah	0.204708	0.478943	0.519439
Ohio	0.555730	1.965781	1.393406
Texas	0.092908	0.281746	0.769023
Oregon	1.246435	1.007189	1.296221

         
     

              
          

              
          

              
          

              
          
 
           Call signature:  np.abs(*args, **kwargs) 
           Type:            ufunc 
           String form:     <ufunc 'absolute'> 
           File:            c:\users\qq123\anaconda3\lib\site-packages\numpy\__init__.py 
           Docstring:       
           absolute(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj]) 
           ​ 
           Calculate the absolute value element-wise. 
           ​ 
           Parameters 
           ---------- 
           x : array_like 
               Input array. 
           out : ndarray, None, or tuple of ndarray and None, optional 
               A location into which the result is stored. If provided, it must have 
               a shape that the inputs broadcast to. If not provided or `None`, 
               a freshly-allocated array is returned. A tuple (possible only as a 
               keyword argument) must have length equal to the number of outputs. 
           where : array_like, optional 
               Values of True indicate to calculate the ufunc at that position, values 
               of False indicate to leave the value in the output alone. 
           **kwargs 
               For other keyword-only arguments, see the 
               :ref:`ufunc docs <ufuncs.kwargs>`. 
           ​ 
           Returns 
           ------- 
           absolute : ndarray 
               An ndarray containing the absolute value of 
               each element in `x`.  For complex input, ``a + ib``, the 
               absolute value is :math:`\sqrt{ a^2 + b^2 }`. 
           ​ 
           Examples 
           -------- 
           >>> x = np.array([-1.2, 1.2]) 
           >>> np.absolute(x) 
           array([ 1.2,  1.2]) 
           >>> np.absolute(1.2 + 1j) 
           1.5620499351813308 
           ​ 
           Plot the function over ``[-10, 10]``: 
           ​ 
           >>> import matplotlib.pyplot as plt 
           ​ 
           >>> x = np.linspace(start=-10, stop=10, num=101) 
           >>> plt.plot(x, np.absolute(x)) 
           >>> plt.show() 
           ​ 
           Plot the function over the complex plane: 
           ​ 
           >>> xx = x + 1j * x[:, np.newaxis] 
           >>> plt.imshow(np.abs(xx), extent=[-10, 10, -10, 10], cmap='gray') 
           >>> plt.show() 
           Class docstring: 
           Functions that operate element by element on whole arrays. 
           ​ 
           To see the documentation for a specific ufunc, use `info`.  For 
           example, ``np.info(np.sin)``.  Because ufuncs are written in C 
           (for speed) and linked into Python with NumPy's ufunc facility, 
           Python's help() function finds this page whenever help() is called 
           on a ufunc. 
           ​ 
           A detailed explanation of ufuncs can be found in the docs for :ref:`ufuncs`. 
           ​ 
           Calling ufuncs: 
           =============== 
           ​ 
           op(*x[, out], where=True, **kwargs) 
           Apply `op` to the arguments `*x` elementwise, broadcasting the arguments. 
           ​ 
           The broadcasting rules are: 
           ​ 
           * Dimensions of length 1 may be prepended to either array. 
           * Arrays may be repeated along dimensions of length 1. 
           ​ 
           Parameters 
           ---------- 
           *x : array_like 
               Input arrays. 
           out : ndarray, None, or tuple of ndarray and None, optional 
               Alternate array object(s) in which to put the result; if provided, it 
               must have a shape that the inputs broadcast to. A tuple of arrays 
               (possible only as a keyword argument) must have length equal to the 
               number of outputs; use `None` for outputs to be allocated by the ufunc. 
           where : array_like, optional 
               Values of True indicate to calculate the ufunc at that position, values 
               of False indicate to leave the value in the output alone. 
           **kwargs 
               For other keyword-only arguments, see the :ref:`ufunc docs <ufuncs.kwargs>`. 
           ​ 
           Returns 
           ------- 
           r : ndarray or tuple of ndarray 
               `r` will have the shape that the arrays in `x` broadcast to; if `out` is 
               provided, `r` will be equal to `out`. If the function has more than one 
               output, then the result will be a tuple of arrays. 
          

          
      

In [160]:

          
            f = lambda x: x.max() - x.min() 
            frame.apply(f)#每列最大值减最小值 
            ​

Out[160]:

b    1.802165
d    1.684034
e    2.689627
dtype: float64

In [161]:

          
            frame.apply(f,axis=1)#每一行最大值减最小值

Out[161]:

Utah      0.998382
Ohio      2.521511
Texas     0.676115
Oregon    2.542656
dtype: float64

In [162]:

          
            frame.apply(f, axis='columns')

Out[162]:

Utah      0.998382
Ohio      2.521511
Texas     0.676115
Oregon    2.542656
dtype: float64

In [164]:

          
            def f(x): 
                return pd.Series([x.min(), x.max()], index=['min', 'max']) 
            frame.apply(f)

Out[164]:

	b	d	e
min	-0.555730	0.281746	-1.296221
max	1.246435	1.965781	1.393406

In [165]:

          
            def f(x): 
                return pd.Series([x.min(), x.max()], index=['min', 'max']) 
            frame.apply(f,axis=1)

Out[165]:

	min	max
Utah	-0.519439	0.478943
Ohio	-0.555730	1.965781
Texas	0.092908	0.769023
Oregon	-1.296221	1.246435

In [166]:

          
            format = lambda x: '%.2f' % x #设置格式 
            frame.applymap(format)

Out[166]:

	b	d	e
Utah	-0.20	0.48	-0.52
Ohio	-0.56	1.97	1.39
Texas	0.09	0.28	0.77
Oregon	1.25	1.01	-1.30

In [167]:

          
            frame['e'].map(format)

Out[167]:

Utah      -0.52
Ohio       1.39
Texas      0.77
Oregon    -1.30
Name: e, dtype: object

Sorting and Ranking

In [168]:

          
            obj = pd.Series(range(4), index=['d', 'a', 'b', 'c']) 
            obj.sort_index()

Out[168]:

a    1
b    2
c    3
d    0
dtype: int64

In [171]:

          
            frame = pd.DataFrame(np.arange(8).reshape((2, 4)), 
                                 index=['three', 'one'], 
                                 columns=['d', 'a', 'b', 'c']) 
            frame.sort_index() 
            ​

Out[171]:

	d	a	b	c
one	4	5	6	7
three	0	1	2	3

         
     

              
          

              
          

              
          

              
          
 
           Signature: frame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None) 
           Docstring: 
           Sort object by labels (along an axis) 
           ​ 
           Parameters 
           ---------- 
           axis : index, columns to direct sorting 
           level : int or level name or list of ints or list of level names 
               if not None, sort on values in specified index level(s) 
           ascending : boolean, default True 
               Sort ascending vs. descending 
           inplace : bool, default False 
               if True, perform operation in-place 
           kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort' 
                Choice of sorting algorithm. See also ndarray.np.sort for more 
                information.  `mergesort` is the only stable algorithm. For 
                DataFrames, this option is only applied when sorting on a single 
                column or label. 
           na_position : {'first', 'last'}, default 'last' 
                `first` puts NaNs at the beginning, `last` puts NaNs at the end. 
                Not implemented for MultiIndex. 
           sort_remaining : bool, default True 
               if true and sorting by level and index is multilevel, sort by other 
               levels too (in order) after sorting by specified level 
           ​ 
           Returns 
           ------- 
           sorted_obj : DataFrame 
           File:      c:\users\qq123\anaconda3\lib\site-packages\pandas\core\frame.py 
           Type:      method 
          

          
      

In [170]:

          
            frame.sort_index(axis=1)#列排序

Out[170]:

	a	b	c	d
three	1	2	3	0
one	5	6	7	4

In [172]:

          
            frame.sort_index(axis=1, ascending=False)#降序

Out[172]:

	d	c	b	a
three	0	3	2	1
one	4	7	6	5

In [ ]:

          
            obj = pd.Series([4, 7, -3, 2]) 
            obj.sort_values()

In [ ]:

          
            obj = pd.Series([4, np.nan, 7, np.nan, -3, 2]) 
            obj.sort_values()

In [173]:

          
            frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]}) 
            frame 
            ​

Out[173]:

	a	b
0	0	4
1	1	7
2	0	-3
3	1	2

In [174]:

          
            frame.sort_values(by='b')

Out[174]:

	a	b
2	0	-3
3	1	2
0	0	4
1	1	7

         
     

              
          

              
          

              
          

              
          
 
           Signature: frame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') 
           Docstring: 
           Sort by the values along either axis 
           ​ 
           .. versionadded:: 0.17.0 
           ​ 
           Parameters 
           ---------- 
           by : str or list of str 
               Name or list of names which refer to the axis items. 
           axis : {0 or 'index', 1 or 'columns'}, default 0 
               Axis to direct sorting 
           ascending : bool or list of bool, default True 
                Sort ascending vs. descending. Specify list for multiple sort 
                orders.  If this is a list of bools, must match the length of 
                the by. 
           inplace : bool, default False 
                if True, perform operation in-place 
           kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort' 
                Choice of sorting algorithm. See also ndarray.np.sort for more 
                information.  `mergesort` is the only stable algorithm. For 
                DataFrames, this option is only applied when sorting on a single 
                column or label. 
           na_position : {'first', 'last'}, default 'last' 
                `first` puts NaNs at the beginning, `last` puts NaNs at the end 
           ​ 
           Returns 
           ------- 
           sorted_obj : DataFrame 
           ​ 
           Examples 
           -------- 
           >>> df = pd.DataFrame({ 
           ...     'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'], 
           ...     'col2' : [2, 1, 9, 8, 7, 4], 
           ...     'col3': [0, 1, 9, 4, 2, 3], 
           ... }) 
           >>> df 
               col1 col2 col3 
           0   A    2    0 
           1   A    1    1 
           2   B    9    9 
           3   NaN  8    4 
           4   D    7    2 
           5   C    4    3 
           ​ 
           Sort by col1 
           ​ 
           >>> df.sort_values(by=['col1']) 
               col1 col2 col3 
           0   A    2    0 
           1   A    1    1 
           2   B    9    9 
           5   C    4    3 
           4   D    7    2 
           3   NaN  8    4 
           ​ 
           Sort by multiple columns 
           ​ 
           >>> df.sort_values(by=['col1', 'col2']) 
               col1 col2 col3 
           1   A    1    1 
           0   A    2    0 
           2   B    9    9 
           5   C    4    3 
           4   D    7    2 
           3   NaN  8    4 
           ​ 
           Sort Descending 
           ​ 
           >>> df.sort_values(by='col1', ascending=False) 
               col1 col2 col3 
           4   D    7    2 
           5   C    4    3 
           2   B    9    9 
           0   A    2    0 
           1   A    1    1 
           3   NaN  8    4 
           ​ 
           Putting NAs first 
           ​ 
           >>> df.sort_values(by='col1', ascending=False, na_position='first') 
               col1 col2 col3 
           3   NaN  8    4 
           4   D    7    2 
           5   C    4    3 
           2   B    9    9 
           0   A    2    0 
           1   A    1    1 
          

          
      

In [176]:

          
            frame.sort_values(by=['a', 'b'])

Out[176]:

	a	b
2	0	-3
0	0	4
3	1	2
1	1	7

In [175]:

          
            obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) 
            ​

In [179]:

obj.rank()

Out[179]:

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [180]:

          
            obj.rank(pct=True)

Out[180]:

0    0.928571
1    0.142857
2    0.928571
3    0.642857
4    0.428571
5    0.285714
6    0.642857
dtype: float64

Signature: obj.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False) Docstring: Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of those values

Parameters

axis : {0 or 'index', 1 or 'columns'}, default 0 index to direct ranking method : {'average', 'min', 'max', 'first', 'dense'}

* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups

numeric_only : boolean, default None Include only float, int, boolean data. Valid only for DataFrame or Panel objects na_option : {'keep', 'top', 'bottom'}

* keep: leave NA values where they are
* top: smallest rank if ascending
* bottom: smallest rank if descending

ascending : boolean, default True False for ranks by high (1) to low (N) pct : boolean, default False Computes percentage rank of data

Returns

In [181]:

          
            obj.rank(method='first')

Out[181]:

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [182]:

          
            # 数值相同去最大排名 
            obj.rank(ascending=False, method='max')

Out[182]:

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [183]:

          
            frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 
                                  'c': [-2, 5, 8, -2.5]}) 
            frame 
            frame.rank(axis='columns')#每行大小排名

Out[183]:

	a	b	c
0	2.0	3.0	1.0
1	1.0	3.0	2.0
2	2.0	1.0	3.0
3	2.0	3.0	1.0

Axis Indexes with Duplicate Labels

In [184]:

          
            obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c']) 
            obj

Out[184]:

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [185]:

          
            obj.index.is_unique#判断行标签是否无重复

Out[185]:

False

In [187]:

obj['a']

Out[187]:

a    0
a    1
dtype: int64

In [188]:

obj['c']

Out[188]:

In [189]:

          
            df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b']) 
            df 
            df.loc['b']

Out[189]:

	0	1	2
b	1.669025	-0.438570	-0.539741
b	0.476985	3.248944	-1.021228

Summarizing and Computing Descriptive Statistics

In [190]:

          
            df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], 
                               [np.nan, np.nan], [0.75, -1.3]], 
                              index=['a', 'b', 'c', 'd'], 
                              columns=['one', 'two']) 
            df

Out[190]:

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

In [198]:

df.sum()

Out[198]:

one    9.25
two   -5.80
dtype: float64

         
           Signature: df.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs) 
           Docstring:

In [192]:

          
            df.sum(axis='columns')

Out[192]:

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [193]:

          
            df.mean(axis='columns', skipna=False)#强制不跳过NaN

Out[193]:

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [196]:

df.idxmax()

Out[196]:

one    b
two    d
dtype: object

In [197]:

          
            df.idxmax(axis=1)

Out[197]:

a    one
b    one
c    NaN
d    one
dtype: object

         
     

              
          

              
          

              
          

              
          
 
           Signature: df.idxmax(axis=0, skipna=True) 
           Docstring: 
           Return index of first occurrence of maximum over requested axis. 
           NA/null values are excluded. 
           ​ 
           Parameters 
           ---------- 
           axis : {0 or 'index', 1 or 'columns'}, default 0 
               0 or 'index' for row-wise, 1 or 'columns' for column-wise 
           skipna : boolean, default True 
               Exclude NA/null values. If an entire row/column is NA, the result 
               will be NA. 
           ​ 
           Raises 
           ------ 
           ValueError 
               * If the row/column is empty 
           ​ 
           Returns 
           ------- 
           idxmax : Series 
           ​ 
           Notes 
           ----- 
           This method is the DataFrame version of ``ndarray.argmax``. 
           ​ 
           See Also 
           -------- 
           Series.idxmax 
          

          
      

In [200]:

df.cumsum()

Out[200]:

	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

         
           Signature: df.cumsum(axis=None, skipna=True, *args, **kwargs) 
           Docstring: 
           Return cumulative sum over requested axis. 
           ​ 
           Parameters 
           ---------- 
           axis : {index (0), columns (1)} 
           skipna : boolean, default True 
               Exclude NA/null values. If an entire row/column is NA, the result 
               will be NA 
           ​ 
           Returns 
           ------- 
           cumsum : Series

In [202]:

df.describe()

Out[202]:

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

Signature: df.describe(percentiles=None, include=None, exclude=None) Docstring: Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters

percentiles : list-like of numbers, optional The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles. include : 'all', list-like of dtypes or None (default), optional A white list of data types to include in the result. Ignored for Series. Here are the options:

- 'all' : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
  provided data types.
  To limit the result to numeric types submit
  ``numpy.number``. To limit it instead to object columns submit
  the ``numpy.object`` data type. Strings
  can also be used in the style of
  ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
  select pandas categorical columns, use ``'category'``
- None (default) : The result will include all numeric columns.

exclude : list-like of dtypes or None (default), optional, A black list of data types to omit from the result. Ignored for Series. Here are the options:

- A list-like of dtypes : Excludes the provided data types
  from the result. To exclude numeric types submit
  ``numpy.number``. To exclude object columns submit the data
  type ``numpy.object``. Strings can also be used in the style of
  ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
  exclude pandas categorical columns, use ``'category'``
- None (default) : The result will exclude nothing.

Returns

summary: Series/DataFrame of summary statistics

Notes

For numeric data, the result's index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50percentile is the same as the median.

For object data (e.g. strings or timestamps), the result's index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value's frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

s = pd.Series([1, 2, 3]) s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0

Describing a categorical Series.

s = pd.Series(['a', 'a', 'b', 'c']) s.describe() count 4 unique 3 top a freq 2 dtype: object

Describing a timestamp Series.

s = pd.Series([ ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) s.describe() count 3 unique 2 top 2010-01-01 00:00:00 freq 2 first 2000-01-01 00:00:00 last 2010-01-01 00:00:00 dtype: object

Describing a DataFrame. By default only numeric fields are returned.

df = pd.DataFrame({ 'object': ['a', 'b', 'c'], ... 'numeric': [1, 2, 3], ... 'categorical': pd.Categorical(['d','e','f']) ... }) df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0

Describing all columns of a DataFrame regardless of data type.

df.describe(include='all') categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN c freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN

Describing a column from a DataFrame by accessing it as an attribute.

df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0

Including only string columns in a DataFrame description.

df.describe(include=[np.object]) object count 3 unique 3 top c freq 1

Including only categorical columns from a DataFrame description.

df.describe(include=['category']) categorical count 3 unique 3 top f freq 1

Excluding numeric columns from a DataFrame description.

df.describe(exclude=[np.number]) categorical object count 3 3 unique 3 3 top f c freq 1 1

Excluding object columns from a DataFrame description.

df.describe(exclude=[np.object]) categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0

Conclusion

In [226]:

          
            pd.options.display.max_rows = PREVIOUS_MAX_ROWS

In [ ]:

萌新向Python数据分析及数据挖掘第二章 pandas 第五节 Getting Started with pandas

Getting Started with pandas

Introduction to pandas Data Structures

Series

DataFrame

Index Objects

Essential Functionality

Reindexing

Dropping Entries from an Axis

Indexing, Selection, and Filtering

Selection with loc and iloc

Integer Indexes

Arithmetic and Data Alignment

Arithmetic methods with fill values

Operations between DataFrame and Series

Function Application and Mapping

Sorting and Ranking

Parameters

Axis Indexes with Duplicate Labels

Summarizing and Computing Descriptive Statistics

Parameters

Returns

Notes

Examples

See Also

Correlation and Covariance

Unique Values, Value Counts, and Membership

Conclusion

	AAPL	GOOG	IBM	MSFT
Date
2016-10-17	-0.000680	0.001837	0.002072	-0.003483
2016-10-18	-0.000681	0.019616	-0.026168	0.007690
2016-10-19	-0.002979	0.007846	0.003583	-0.002255
2016-10-20	-0.000512	-0.005652	0.001719	-0.004867
2016-10-21	-0.003930	0.003011	-0.012474	0.042096

	AAPL	GOOG	IBM	MSFT
AAPL	1.000000	0.407919	0.386817	0.389695
GOOG	0.407919	1.000000	0.405099	0.465919
IBM	0.386817	0.405099	1.000000	0.499764
MSFT	0.389695	0.465919	0.499764	1.000000

	AAPL	GOOG	IBM	MSFT
AAPL	0.000277	0.000107	0.000078	0.000095
GOOG	0.000107	0.000251	0.000078	0.000108
IBM	0.000078	0.000078	0.000146	0.000089
MSFT	0.000095	0.000108	0.000089	0.000215

萌新向Python数据分析及数据挖掘 第二章 pandas 第五节 Getting Started with pandas

Getting Started with pandas

Introduction to pandas Data Structures

Series

DataFrame

Index Objects

Essential Functionality

Reindexing

Dropping Entries from an Axis

Indexing, Selection, and Filtering

Selection with loc and iloc

Integer Indexes

Arithmetic and Data Alignment

Arithmetic methods with fill values

Operations between DataFrame and Series

Function Application and Mapping

Sorting and Ranking

Parameters

Axis Indexes with Duplicate Labels

Summarizing and Computing Descriptive Statistics

Parameters

Returns

Notes

Examples

See Also

Correlation and Covariance

Unique Values, Value Counts, and Membership

Conclusion

“相关推荐”对你有帮助么？

萌新向Python数据分析及数据挖掘第二章 pandas 第五节 Getting Started with pandas