【CookBook pandas】学习笔记第五章 Exploratory Data Analysis

dive more into … 深入讨论
exploratory data analysis , the process of sifting through the data and trying to make sense of the individual columns and the relationships between them.
literally 简直,差不多

what is ‘parsing dates’……

divine more about the data
在这里插入图片描述
在这里插入图片描述

object types may be strings or categorical data, but they could also be numeric-like value that need to be nudged a little so that they are numeric.

In[2]: import pandas as pd
In[3]: import numpy as np
In[4]: fueleco=pd.read_csv("vehicles.csv",nrows=3)
In[5]: fueleco.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 83 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   barrels08        3 non-null      float64
 1   barrelsA08       3 non-null      float64
 2   charge120        3 non-null      float64
 3   charge240        3 non-null      float64
 4   city08           3 non-null      int64  
 5   city08U          3 non-null      float64
 6   cityA08          3 non-null      int64  
 7   cityA08U         3 non-null      float64
 8   cityCD           3 non-null      float64
 9   cityE            3 non-null      float64
 10  cityUF           3 non-null      float64
 11  co2              3 non-null      int64  
 12  co2A             3 non-null      int64  
 13  co2TailpipeAGpm  3 non-null      float64
 14  co2TailpipeGpm   3 non-null      float64
 15  comb08           3 non-null      int64  
 16  comb08U          3 non-null      float64
 17  combA08          3 non-null      int64  
 18  combA08U         3 non-null      float64
 19  combE            3 non-null      float64
 20  combinedCD       3 non-null      float64
 21  combinedUF       3 non-null      float64
 22  cylinders        3 non-null      int64  
 23  displ            3 non-null      float64
 24  drive            3 non-null      object 
 25  engId            3 non-null      int64  
 26  eng_dscr         3 non-null      object 
 27  feScore          3 non-null      int64  
 28  fuelCost08       3 non-null      int64  
 29  fuelCostA08      3 non-null      int64  
 30  fuelType         3 non-null      object 
 31  fuelType1        3 non-null      object 
 32  ghgScore         3 non-null      int64  
 33  ghgScoreA        3 non-null      int64  
 34  highway08        3 non-null      int64  
 35  highway08U       3 non-null      float64
 36  highwayA08       3 non-null      int64  
 37  highwayA08U      3 non-null      float64
 38  highwayCD        3 non-null      float64
 39  highwayE         3 non-null      float64
 40  highwayUF        3 non-null      float64
 41  hlv              3 non-null      int64  
 42  hpv              3 non-null      int64  
 43  id               3 non-null      int64  
 44  lv2              3 non-null      int64  
 45  lv4              3 non-null      int64  
 46  make             3 non-null      object 
 47  model            3 non-null      object 
 48  mpgData          3 non-null      object 
 49  phevBlended      3 non-null      bool   
 50  pv2              3 non-null      int64  
 51  pv4              3 non-null      int64  
 52  range            3 non-null      int64  
 53  rangeCity        3 non-null      float64
 54  rangeCityA       3 non-null      float64
 55  rangeHwy         3 non-null      float64
 56  rangeHwyA        3 non-null      float64
 57  trany            3 non-null      object 
 58  UCity            3 non-null      float64
 59  UCityA           3 non-null      float64
 60  UHighway         3 non-null      float64
 61  UHighwayA        3 non-null      float64
 62  VClass           3 non-null      object 
 63  year             3 non-null      int64  
 64  youSaveSpend     3 non-null      int64  
 65  guzzler          1 non-null      object 
 66  trans_dscr       1 non-null      object 
 67  tCharger         0 non-null      float64
 68  sCharger         0 non-null      float64
 69  atvType          0 non-null      float64
 70  fuelType2        0 non-null      float64
 71  rangeA           0 non-null      float64
 72  evMotor          0 non-null      float64
 73  mfrCode          0 non-null      float64
 74  c240Dscr         0 non-null      float64
 75  charge240b       3 non-null      float64
 76  c240bDscr        0 non-null      float64
 77  createdOn        3 non-null      object 
 78  modifiedOn       3 non-null      object 
 79  startStop        0 non-null      float64
 80  phevCity         3 non-null      int64  
 81  phevHwy          3 non-null      int64  
 82  phevComb         3 non-null      int64  
dtypes: bool(1), float64(41), int64(28), object(13)
memory usage: 2.0+ KB
In[6]: fueleco=pd.read_csv("vehicles.csv")
D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py:3155: DtypeWarning: Columns (70,71,72,73,74,76,79) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
In[7]: fueleco=pd.read_csv("vehicles.csv",usecols=list(range(0,70,1)))
In[8]: fueleco.mean()
Out[8]: 
barrels08             17.442712
barrelsA08             0.219276
charge120              0.000000
charge240              0.029630
city08                18.077799
city08U                5.040648
cityA08                0.569883
cityA08U               0.416097
cityCD                 0.000560
cityE                  0.225181
cityUF                 0.000975
co2                   72.538989
co2A                   5.543950
co2TailpipeAGpm       17.826864
co2TailpipeGpm       470.704841
comb08                20.323828
comb08U                5.652724
combA08                0.631160
combA08U               0.453725
combE                  0.230912
combinedCD             0.000459
combinedUF             0.000959
cylinders              5.729105
displ                  3.309829
engId               8582.377382
feScore                0.122580
fuelCost08          2242.470781
fuelCostA08           91.335260
ghgScore               0.120866
ghgScoreA             -0.923889
highway08             24.208588
highway08U             6.712736
highwayA08             0.736452
highwayA08U            0.523423
highwayCD              0.000343
highwayE               0.238526
highwayUF              0.000938
hlv                    2.029539
hpv                   10.411243
id                 19662.541188
lv2                    1.834812
lv4                    6.155930
phevBlended            0.001458
pv2                   13.649574
pv4                   33.883711
range                  0.500243
rangeCity              0.458375
rangeCityA             0.050978
rangeHwy               0.450392
rangeHwyA              0.046958
UCity                 22.789421
UCityA                 0.723139
UHighway              33.884375
UHighwayA              1.009562
year                2000.635406
youSaveSpend       -3459.572645
dtype: float64
In[9]: fueleco.describe(include='object')
Out[9]: 
                    drive eng_dscr fuelType  ... tCharger sCharger atvType
count               37912    23431    39101  ...     5816      738    3204
unique                  7      545       14  ...        1        1       8
top     Front-Wheel Drive    (FFS)  Regular  ...        T        S     FFV
freq                13653     8827    25620  ...     5816      738    1383
[4 rows x 14 columns]
In[10]: fueleco.make.value_counts()
Out[10]: 
Chevrolet                      3900
Ford                           3208
Dodge                          2557
GMC                            2442
Toyota                         1976
                               ... 
Shelby                            1
Grumman Allied Industries         1
Qvale                             1
Volga Associated Automobile       1
Goldacre                          1
Name: make, Length: 134, dtype: int64
In[11]: fueleco.make.nunique()
Out[11]: 134
In[12]: fueleco.select_dtypes("int64")
Out[12]: 
       city08  cityA08  co2  co2A  comb08  ...  pv2  pv4  range  year  youSaveSpend
0          19        0   -1    -1      21  ...    0    0      0  1985         -1750
1           9        0   -1    -1      11  ...    0    0      0  1985        -10500
2          23        0   -1    -1      27  ...    0    0      0  1985           250
3          10        0   -1    -1      11  ...    0    0      0  1985        -10500
4          17        0   -1    -1      19  ...    0   90      0  1993         -4750
       ...      ...  ...   ...     ...  ...  ...  ...    ...   ...           ...
39096      19        0   -1    -1      22  ...    0   90      0  1993         -1500
39097      20        0   -1    -1      23  ...    0   90      0  1993         -1000
39098      18        0   -1    -1      21  ...    0   90      0  1993         -1750
39099      18        0   -1    -1      21  ...    0   90      0  1993         -1750
39100      16        0   -1    -1      18  ...    0   90      0  1993         -5500
[39101 rows x 24 columns]
In[13]: fueleco.select_dtypes("int64").describe().T
Out[13]: 
                count          mean           std  ...      50%      75%      max
city08        39101.0     18.077799      6.970672  ...     17.0     20.0    150.0
cityA08       39101.0      0.569883      4.297124  ...      0.0      0.0    145.0
co2           39101.0     72.538989    163.252019  ...     -1.0     -1.0    847.0
co2A          39101.0      5.543950     55.956932  ...     -1.0     -1.0    713.0
comb08        39101.0     20.323828      6.882807  ...     20.0     23.0    136.0
combA08       39101.0      0.631160      4.395797  ...      0.0      0.0    133.0
engId         39101.0   8582.377382  17606.675590  ...    202.0   4401.0  69102.0
feScore       39101.0      0.122580      2.516348  ...     -1.0     -1.0     10.0
fuelCost08    39101.0   2242.470781    601.273869  ...   2250.0   2500.0   6850.0
fuelCostA08   39101.0     91.335260    479.485802  ...      0.0      0.0   3850.0
ghgScore      39101.0      0.120866      2.512612  ...     -1.0     -1.0     10.0
ghgScoreA     39101.0     -0.923889      0.651017  ...     -1.0     -1.0      8.0
highway08     39101.0     24.208588      7.128070  ...     24.0     27.0    122.0
highwayA08    39101.0      0.736452      4.694207  ...      0.0      0.0    121.0
hlv           39101.0      2.029539      5.959735  ...      0.0      0.0     49.0
hpv           39101.0     10.411243     28.167271  ...      0.0      0.0    195.0
id            39101.0  19662.541188  11413.329199  ...  19552.0  29555.0  39483.0
lv2           39101.0      1.834812      4.407887  ...      0.0      0.0     41.0
lv4           39101.0      6.155930      9.698101  ...      0.0     13.0     55.0
pv2           39101.0     13.649574     31.214466  ...      0.0      0.0    194.0
pv4           39101.0     33.883711     45.991687  ...      0.0     91.0    192.0
range         39101.0      0.500243      9.742080  ...      0.0      0.0    335.0
year          39101.0   2000.635406     10.690422  ...   2001.0   2010.0   2018.0
youSaveSpend  39101.0  -3459.572645   3010.284617  ...  -3500.0  -1500.0   5250.0
[24 rows x 8 columns]
In[14]: # iinfo function in numpy will show the limit for integer types
In[15]: np.iinfo(np.int8)
Out[15]: iinfo(min=-128, max=127, dtype=int8)
In[16]: np.iinfo(int16)
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-16-33fbc0c72155>", line 1, in <module>
    np.iinfo(int16)
NameError: name 'int16' is not defined
In[17]: np.iinfo(np.int16)
Out[17]: iinfo(min=-32768, max=32767, dtype=int16)
In[18]: # 'cit08' and 'comb08' don't go above up to 150
In[19]: fueleco[['city08','comb08']].astype('int16').info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   city08  39101 non-null  int16
 1   comb08  39101 non-null  int16
dtypes: int16(2)
memory usage: 152.9 KB
In[20]: fueleco['city08','comb08'].info(memory_usage='deep')
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('city08', 'comb08')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-20-e5f2d55949d1>", line 1, in <module>
    fueleco['city08','comb08'].info(memory_usage='deep')
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: ('city08', 'comb08')
In[21]: fueleco[['city08','comb08']].info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   city08  39101 non-null  int64
 1   comb08  39101 non-null  int64
dtypes: int64(2)
memory usage: 611.1 KB
In[22]: # so just modify the type of 'city08' and 'comb08' into 'int16'
In[23]: fueleco[['city08','comb08']].assign()
Out[23]: 
       city08  comb08
0          19      21
1           9      11
2          23      27
3          10      11
4          17      19
       ...     ...
39096      19      22
39097      20      23
39098      18      21
39099      18      21
39100      16      18
[39101 rows x 2 columns]
In[24]: (fueleco[['city08','comb08']]
   ...:     .assin(city08=fueleco.city08.astype(np.int16),
   ...:            comb08=fueleco.comb08.astype(np.in16)
   ...:            )
   ...:  .info(memory_usage='deep')
   ...:  )
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-15d636284b16>", line 1, in <module>
    (fueleco[['city08','comb08']]
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'assin'
In[25]: (fueleco[['city08','comb08']]
   ...:     .assign(city08=fueleco.city08.astype(np.int16),
   ...:            comb08=fueleco.comb08.astype(np.in16)
   ...:            )
   ...:  .info(memory_usage='deep')
   ...:  )
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-25-fccec6400831>", line 3, in <module>
    comb08=fueleco.comb08.astype(np.in16)
  File "D:\PyCharm2020\python2020\lib\site-packages\numpy\__init__.py", line 214, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'in16'
In[26]: (fueleco[['city08','comb08']]
   ...:     .assign(city08=fueleco.city08.astype('int16'),
   ...:            comb08=fueleco.comb08.astype('in16')
   ...:            )
   ...:  .info(memory_usage='deep')
   ...:  )
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-fe1ee5143ca8>", line 3, in <module>
    comb08=fueleco.comb08.astype('in16')
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5874, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\managers.py", line 631, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\managers.py", line 427, in apply
    applied = getattr(b, f)(**kwargs)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\blocks.py", line 626, in astype
    dtype = pandas_dtype(dtype)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\dtypes\common.py", line 1799, in pandas_dtype
    npdtype = np.dtype(dtype)
TypeError: data type 'in16' not understood
In[27]: (fueleco[['city08','comb08']]
   ...:     .assign(city08=fueleco.city08.astype(np.int16),
   ...:            comb08=fueleco.comb08.astype(np.int16)
   ...:            )
   ...:  .info(memory_usage='deep')
   ...:  )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   city08  39101 non-null  int16
 1   comb08  39101 non-null  int16
dtypes: int16(2)
memory usage: 152.9 KB
In[28]: fueleco['make','model'].nunique()
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('make', 'model')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-bd185d23b85a>", line 1, in <module>
    fueleco['make','model'].nunique()
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: ('make', 'model')
In[29]: fueleco[['make','model']].nunique()
Out[29]: 
make      134
model    3816
dtype: int64
In[30]: # 'make' has a low cardinality, so convert it into 'category' for memory reusage

In[33]: (
   ...:     fueleco[['make']]
   ...:     .assign(make=fueleco.make.astype('category')
   ...: ).info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   make    39101 non-null  category
dtypes: category(1)
memory usage: 89.5 KB
In[34]: (fueleco[['model']]
   ...:     .assign(model=fueleco.model.astype('category'))
   ...:  .info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   model   39101 non-null  category
dtypes: category(1)
memory usage: 465.7 KB

In[36]: fueleco.make.value_counts()
Out[36]: 
Chevrolet                      3900
Ford                           3208
Dodge                          2557
GMC                            2442
Toyota                         1976
                               ... 
Shelby                            1
Grumman Allied Industries         1
Qvale                             1
Volga Associated Automobile       1
Goldacre                          1
Name: make, Length: 134, dtype: int64

In[37]: # there are so many values in the above summary, now look at the top 6 and collapse the remaining values 
In[38]: top_n=fueleco.make.value_counts().index[:6]
In[39]: fueleco.value_counts()
Out[39]: Series([], dtype: int64)
In[40]: fueleco.make.value_counts()
Out[40]: 
Chevrolet                      3900
Ford                           3208
Dodge                          2557
GMC                            2442
Toyota                         1976
                               ... 
Shelby                            1
Grumman Allied Industries         1
Qvale                             1
Volga Associated Automobile       1
Goldacre                          1
Name: make, Length: 134, dtype: int64

In[42]: top_n
Out[42]: Index(['Chevrolet', 'Ford', 'Dodge', 'GMC', 'Toyota', 'BMW'], dtype='object')

In[44]: (fueleco.assign(
   ...:     make=fueleco.make.where(fueleco.make.isin(top_n),"other"))
   ...:     .make.value_counts())
Out[44]: 
other        23211
Chevrolet     3900
Ford          3208
Dodge         2557
GMC           2442
Toyota        1976
BMW           1807
Name: make, dtype: int64
In[45]: # determine the number and percent of missing values
In[46]: fueleco.drive.isna().sum()
Out[46]: 1189
In[47]: fueleco.isna().mean()
Out[47]: 
barrels08     0.000000
barrelsA08    0.000000
charge120     0.000000
charge240     0.000000
city08        0.000000
                ...   
guzzler       0.940283
trans_dscr    0.615176
tCharger      0.851257
sCharger      0.981126
atvType       0.918058
Length: 70, dtype: float64
In[48]: fueleco.drive.isna().mean()
Out[48]: 0.030408429451932176
In[49]: fueleco.drive.isna().mean()*100
Out[49]: 3.0408429451932175
In[50]: # use .nunique method to determine cardinality 
In[51]: fueleco.drive.nunique()
Out[51]: 7
In[52]: # pick out the columns with data types that are object 
In[53]: fueleco.select_dtypes('object').columns
Out[53]: 
Index(['drive', 'eng_dscr', 'fuelType', 'fuelType1', 'make', 'model',
       'mpgData', 'trany', 'VClass', 'guzzler', 'trans_dscr', 'tCharger',
       'sCharger', 'atvType'],
      dtype='object')

In[55]: import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(10,8))
top_n=fueleco.make.value_counts().index[:6]
(fueleco.assign(make=fueleco.make.where(fueleco.make.isin(top_n),'Other'))
 .make.value_counts().plot.bar(ax=ax))
Out[58]: <AxesSubplot:>
fig.savefig("c5-catpan.png",dpi=300)

.cut .qcut(quantile cut) used to cut into equal-width bins or bin width that we specify, with these methods we can treat numeric columns as categories by binning them .

continuous data

import seaborn as sns
fig,ax=plt.subplots(figsize=(10,8))
sns.countplot(y='make',data=(fueleco.assign(make=fueleco.make.where(fueleco.make.isin(top_n),"Other"))))
Out[66]: <AxesSubplot:xlabel='count', ylabel='make'>
fig.savefig("c5-catsns.png",dpi=300)

# rows where 'drive' are missing
fueleco[fueleco.drive.isna()]
Out[69]: 
       barrels08  barrelsA08  charge120  ...  tCharger  sCharger  atvType
7138    0.240000         0.0        0.0  ...       NaN       NaN       EV
8144    0.312000         0.0        0.0  ...       NaN       NaN       EV
8147    0.270000         0.0        0.0  ...       NaN       NaN       EV
18215  15.695714         0.0        0.0  ...       NaN       NaN      NaN
18216  14.982273         0.0        0.0  ...       NaN       NaN      NaN
          ...         ...        ...  ...       ...       ...      ...
23023   0.240000         0.0        0.0  ...       NaN       NaN       EV
23024   0.546000         0.0        0.0  ...       NaN       NaN       EV
23026   0.426000         0.0        0.0  ...       NaN       NaN       EV
23031   0.426000         0.0        0.0  ...       NaN       NaN       EV
23034   0.204000         0.0        0.0  ...       NaN       NaN       EV
[1189 rows x 70 columns]

# by default, .value_counts does not show missing values, but use dropna parameter
fueleco.drive.value_counts(dropna=False)
Out[71]: 
Front-Wheel Drive             13653
Rear-Wheel Drive              13284
4-Wheel or All-Wheel Drive     6648
All-Wheel Drive                2401
4-Wheel Drive                  1221
NaN                            1189
2-Wheel Drive                   507
Part-time 4-Wheel Drive         198
Name: drive, dtype: int64

fig,ax=plt.subplots(figsize=(10,8))
fueleco.drive.value_counts(dropna=False).plot.bar(ax=ax)
Out[73]: <AxesSubplot:>
fig.savefig('c5-aa',dpi=300)

",".join('abcd')
Out[76]: 'a,b,c,d'


fueleco.city08.sample(5,random_state=40)
Out[80]: 
4643     16
1483     15
34149    21
563      14
2364     19
Name: city08, dtype: int64
fueleco.city08.sample(5,random_state=42)
Out[81]: 
4217     11
1736     21
36029    16
37631    16
1668     17
Name: city08, dtype: int64
# use pandas to plot a histogram
import matplotlib.pyplot as plt
fig,ax=plt.subplots(figsize=(10,8))
fueleco.city08.hist(ax=ax)
Out[85]: <AxesSubplot:>
fig.savefig('hist.png',dpi=300)
# the plot looks very skewed, so increase the number of bins in the histogram to see if the skew is hiding behaviors
# as the skew makes bins wider
fig,ax=plt.subplots(figsize=(10,8))
fueleco.city08.hist(ax=ax,bins=30)
Out[90]: <AxesSubplot:>
fig.savefig("hist-bins-30.png",dpi=300)
# use seaborn to create a distribution plot,which includes a histogram, a kernel density estimation(KDE), a rug plot
sns.distplot(fueleco.city08,rug=True,ax=ax)
D:\PyCharm2020\python2020\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
D:\PyCharm2020\python2020\lib\site-packages\seaborn\distributions.py:2056: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.
  warnings.warn(msg, FutureWarning)
Out[96]: <AxesSubplot:xlabel='city08', ylabel='Density'>
fig.savefig('rugplot.png',dpi=300)

[160(181/627)]

关于seaborn绘图函数:boxplot,boxenplot,violin plots的介绍
在这里插入图片描述

【162(183/627)】
用图形检验数据是否服从正态分布:
在这里插入图片描述

fig.savefig('rugplot.png',dpi=300)
fig,ax=plt.subplots(nrows=3,figsize=(10,8))
sns.boxplot(fueleco.city08,ax=ax[0])
sns.violinplot(fueleco.city08,ax=ax[1])
sns.boxenplot(fueleco.city08,ax=ax[2])
fig.savefig('subplots_nroes3.png',dpi=300)
from scipy import stats
stats.kstest(fueleco.city08,cdf='norm')
Out[104]: KstestResult(statistic=0.9999999990134123, pvalue=0.0)
fig,ax=plt.subplots(figsize=(10,8))
stats.probplot(fueleco.city08,plot=ax)
Out[106]: 
((array([-4.1352692 , -3.92687024, -3.81314873, ...,  3.81314873,
          3.92687024,  4.1352692 ]),
  array([  6,   6,   6, ..., 137, 138, 150], dtype=int64)),
 (5.385946629915974, 18.077798521776934, 0.772587941459713))
fig.savefig('proplor.png',dpi=300)

Comparing continuous values across categories

# make a mask for the brands we want 
mask=fueleco.make.isin(["Ford","Honda","Tesla","BMW"])
mask
Out[112]: 
0        False
1        False
2        False
3        False
4        False
         ...  
39096    False
39097    False
39098    False
39099    False
39100    False
Name: make, Length: 39101, dtype: bool
type(mask)
Out[113]: pandas.core.series.Series
fueleco[mask]
Out[114]: 
       barrels08  barrelsA08  charge120  ...  tCharger  sCharger  atvType
20     20.600625         0.0        0.0  ...       NaN       NaN      NaN
21     20.600625         0.0        0.0  ...       NaN       NaN      NaN
22     25.354615         0.0        0.0  ...       NaN       NaN      NaN
56     15.695714         0.0        0.0  ...       NaN       NaN      NaN
57     17.347895         0.0        0.0  ...       NaN       NaN      NaN
          ...         ...        ...  ...       ...       ...      ...
39016  13.733750         0.0        0.0  ...       NaN       NaN      NaN
39017  17.347895         0.0        0.0  ...       NaN       NaN      NaN
39018  15.695714         0.0        0.0  ...       NaN       NaN      NaN
39023  14.982273         0.0        0.0  ...       NaN       NaN      NaN
39025  13.733750         0.0        0.0  ...       NaN       NaN      NaN
[5986 rows x 70 columns]
fueleco[mask].groupby("make").city08.agg(["mean","std"])
Out[115]: 
            mean       std
make                      
BMW    17.817377  7.372907
Ford   16.853803  6.701029
Honda  24.372973  9.154064
Tesla  92.826087  5.538970
# and then use a group by operation to look at the mean and std for the city08 column 
# visualize the city08 values for each make with seborn
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box')
g.ax.figure.savefig("c5-catbox.png",dpi=300)
# one of drawback of a boxplot is that while it indicates the spread of the data, it does not reveal how many samples are in each make
mask=fueleco.make.isin(["Ford","Honda","Tesla","BMW"]) # 布尔型向量,其分量对应fueleco的每一行的make标签是否在特定的这四个元素组成的数组中
fueleco[mask].groupby("make").city08.count()
Out[123]: 
make
BMW      1807
Ford     3208
Honda     925
Tesla      46
Name: city08, dtype: int64
# faceet the grid by another feature
# break each of these new plot into its own graph by using the col parameter
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box',col='year',col_order=[2012,2014,2016,2018])
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box',col='year',col_order=[2012,2014,2016,2018],col_wrap=2)
# col为划分依据,col_order是小窗顺序
# embed the new dimension in the same plot by using the hue parameter 
g=sns.catplot(x="make",y="")

Comparing two continuous columns

# Comparing two continuous columns
# if you have two columnswith a high correlation to one another, often , you may drop one of them as a redundant column
# covariance of the two numbers if they are on the same scale
fueleco.city08.cov(fueleco.highway08)
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-be0843446ebc>", line 1, in <module>
    fueleco.city08.cov(fueleco.highway08)
NameError: name 'fueleco' is not defined
import pandas as pd
import numpy as np
fueleco=pd.read_csv("vehicles.csv",usecols=list(range(1,70,1)))
fueleco.city08.cov(fueleco.highway08)
Out[9]: 46.33326023673624
fueleco.city08.cov(fueleco.comb08)
Out[10]: 47.419946678190776
fueleco.city08.cov(fueleco.cylinders)
Out[11]: -5.931560263764768
# Pearson correlation between the two numbers
fueleco.city08.corr(fueleco.highway08)
Out[13]: 0.932494506228495
# use pandas to scatter plot the relationship
import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(8,8))
fueleco.plot.scatter(x="city08",y="highway08",ax=ax)
Out[17]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.01)
Out[18]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fig.savefig('scatterplot_alpha0.01.png',dpi=300)
fig,ax=plt.subplots(figsize=(8,8))
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.01)
Out[21]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.1)
# pearson correlation is intended to show the strength of a linear relationship.
# if the continuous columns columns do not have a linear relationship, another option is use Spearman correlation
# this number also varies from -1 to 1
# it measrues whether the relationship is monotonic and doesn't presume that it is linear
# it use the rank of each number rather than the number if you are not sure whther there is a linear relationship between your coulmns, this is a better metric to use
fueleco.city08.corr(fueleco.barrelsA08,method='spearman')
Out[31]: -0.08476703673460519
# Pearson correlation tells us how one value impacts another 
# covariance lets us know how these values vary together
# a heatmap is a great way to look at correlations in aggregate
# scatter plots are another way to visualize the relationships between continuous variables
# set alpha parameter to a value less than or equal to 0.5, which makes the points transparent
# now , add more dimension to a scatter plot
# using the replot function, we can color to the dotd by year aand size them by the number of barrals the vehicles consumes
# in this case, go from 2 dimension to 4dimensions  
res=sns.relplot(x="city08",y="highway08",data=fueleco.assign(cylinders=fueleco.cylinders.fillna(0)),hue='year',size='barrelsA08',alpha=0.5,height=8)
res=sns.relplot(x="city08",y="highway08",data=fueleco.assign(cylinders=fueleco.cylinders.fillna(0)),hue='year',size='barrelsA08',alpha=0.5,height=8,col='make',col_order=['Ford',"Tesla"])

Comparing categorical vaules with categorical values

# continuous columns can be converted into categorical columns by binning the values

在这里插入图片描述
[179(200/627)] 没大看懂这是在做什么……

# if you use seaborn, you can add multiple dimensions by setting 'hue' or 'col'

Using the pandas profiling library

在这里插入图片描述
在这里插入图片描述
[187/(208/627)]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值