dive more into … 深入讨论
exploratory data analysis , the process of sifting through the data and trying to make sense of the individual columns and the relationships between them.
literally 简直,差不多
what is ‘parsing dates’……
divine more about the data
object types may be strings or categorical data, but they could also be numeric-like value that need to be nudged a little so that they are numeric.
In[2]: import pandas as pd
In[3]: import numpy as np
In[4]: fueleco=pd.read_csv("vehicles.csv",nrows=3)
In[5]: fueleco.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 83 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 barrels08 3 non-null float64
1 barrelsA08 3 non-null float64
2 charge120 3 non-null float64
3 charge240 3 non-null float64
4 city08 3 non-null int64
5 city08U 3 non-null float64
6 cityA08 3 non-null int64
7 cityA08U 3 non-null float64
8 cityCD 3 non-null float64
9 cityE 3 non-null float64
10 cityUF 3 non-null float64
11 co2 3 non-null int64
12 co2A 3 non-null int64
13 co2TailpipeAGpm 3 non-null float64
14 co2TailpipeGpm 3 non-null float64
15 comb08 3 non-null int64
16 comb08U 3 non-null float64
17 combA08 3 non-null int64
18 combA08U 3 non-null float64
19 combE 3 non-null float64
20 combinedCD 3 non-null float64
21 combinedUF 3 non-null float64
22 cylinders 3 non-null int64
23 displ 3 non-null float64
24 drive 3 non-null object
25 engId 3 non-null int64
26 eng_dscr 3 non-null object
27 feScore 3 non-null int64
28 fuelCost08 3 non-null int64
29 fuelCostA08 3 non-null int64
30 fuelType 3 non-null object
31 fuelType1 3 non-null object
32 ghgScore 3 non-null int64
33 ghgScoreA 3 non-null int64
34 highway08 3 non-null int64
35 highway08U 3 non-null float64
36 highwayA08 3 non-null int64
37 highwayA08U 3 non-null float64
38 highwayCD 3 non-null float64
39 highwayE 3 non-null float64
40 highwayUF 3 non-null float64
41 hlv 3 non-null int64
42 hpv 3 non-null int64
43 id 3 non-null int64
44 lv2 3 non-null int64
45 lv4 3 non-null int64
46 make 3 non-null object
47 model 3 non-null object
48 mpgData 3 non-null object
49 phevBlended 3 non-null bool
50 pv2 3 non-null int64
51 pv4 3 non-null int64
52 range 3 non-null int64
53 rangeCity 3 non-null float64
54 rangeCityA 3 non-null float64
55 rangeHwy 3 non-null float64
56 rangeHwyA 3 non-null float64
57 trany 3 non-null object
58 UCity 3 non-null float64
59 UCityA 3 non-null float64
60 UHighway 3 non-null float64
61 UHighwayA 3 non-null float64
62 VClass 3 non-null object
63 year 3 non-null int64
64 youSaveSpend 3 non-null int64
65 guzzler 1 non-null object
66 trans_dscr 1 non-null object
67 tCharger 0 non-null float64
68 sCharger 0 non-null float64
69 atvType 0 non-null float64
70 fuelType2 0 non-null float64
71 rangeA 0 non-null float64
72 evMotor 0 non-null float64
73 mfrCode 0 non-null float64
74 c240Dscr 0 non-null float64
75 charge240b 3 non-null float64
76 c240bDscr 0 non-null float64
77 createdOn 3 non-null object
78 modifiedOn 3 non-null object
79 startStop 0 non-null float64
80 phevCity 3 non-null int64
81 phevHwy 3 non-null int64
82 phevComb 3 non-null int64
dtypes: bool(1), float64(41), int64(28), object(13)
memory usage: 2.0+ KB
In[6]: fueleco=pd.read_csv("vehicles.csv")
D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py:3155: DtypeWarning: Columns (70,71,72,73,74,76,79) have mixed types.Specify dtype option on import or set low_memory=False.
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
In[7]: fueleco=pd.read_csv("vehicles.csv",usecols=list(range(0,70,1)))
In[8]: fueleco.mean()
Out[8]:
barrels08 17.442712
barrelsA08 0.219276
charge120 0.000000
charge240 0.029630
city08 18.077799
city08U 5.040648
cityA08 0.569883
cityA08U 0.416097
cityCD 0.000560
cityE 0.225181
cityUF 0.000975
co2 72.538989
co2A 5.543950
co2TailpipeAGpm 17.826864
co2TailpipeGpm 470.704841
comb08 20.323828
comb08U 5.652724
combA08 0.631160
combA08U 0.453725
combE 0.230912
combinedCD 0.000459
combinedUF 0.000959
cylinders 5.729105
displ 3.309829
engId 8582.377382
feScore 0.122580
fuelCost08 2242.470781
fuelCostA08 91.335260
ghgScore 0.120866
ghgScoreA -0.923889
highway08 24.208588
highway08U 6.712736
highwayA08 0.736452
highwayA08U 0.523423
highwayCD 0.000343
highwayE 0.238526
highwayUF 0.000938
hlv 2.029539
hpv 10.411243
id 19662.541188
lv2 1.834812
lv4 6.155930
phevBlended 0.001458
pv2 13.649574
pv4 33.883711
range 0.500243
rangeCity 0.458375
rangeCityA 0.050978
rangeHwy 0.450392
rangeHwyA 0.046958
UCity 22.789421
UCityA 0.723139
UHighway 33.884375
UHighwayA 1.009562
year 2000.635406
youSaveSpend -3459.572645
dtype: float64
In[9]: fueleco.describe(include='object')
Out[9]:
drive eng_dscr fuelType ... tCharger sCharger atvType
count 37912 23431 39101 ... 5816 738 3204
unique 7 545 14 ... 1 1 8
top Front-Wheel Drive (FFS) Regular ... T S FFV
freq 13653 8827 25620 ... 5816 738 1383
[4 rows x 14 columns]
In[10]: fueleco.make.value_counts()
Out[10]:
Chevrolet 3900
Ford 3208
Dodge 2557
GMC 2442
Toyota 1976
...
Shelby 1
Grumman Allied Industries 1
Qvale 1
Volga Associated Automobile 1
Goldacre 1
Name: make, Length: 134, dtype: int64
In[11]: fueleco.make.nunique()
Out[11]: 134
In[12]: fueleco.select_dtypes("int64")
Out[12]:
city08 cityA08 co2 co2A comb08 ... pv2 pv4 range year youSaveSpend
0 19 0 -1 -1 21 ... 0 0 0 1985 -1750
1 9 0 -1 -1 11 ... 0 0 0 1985 -10500
2 23 0 -1 -1 27 ... 0 0 0 1985 250
3 10 0 -1 -1 11 ... 0 0 0 1985 -10500
4 17 0 -1 -1 19 ... 0 90 0 1993 -4750
... ... ... ... ... ... ... ... ... ... ...
39096 19 0 -1 -1 22 ... 0 90 0 1993 -1500
39097 20 0 -1 -1 23 ... 0 90 0 1993 -1000
39098 18 0 -1 -1 21 ... 0 90 0 1993 -1750
39099 18 0 -1 -1 21 ... 0 90 0 1993 -1750
39100 16 0 -1 -1 18 ... 0 90 0 1993 -5500
[39101 rows x 24 columns]
In[13]: fueleco.select_dtypes("int64").describe().T
Out[13]:
count mean std ... 50% 75% max
city08 39101.0 18.077799 6.970672 ... 17.0 20.0 150.0
cityA08 39101.0 0.569883 4.297124 ... 0.0 0.0 145.0
co2 39101.0 72.538989 163.252019 ... -1.0 -1.0 847.0
co2A 39101.0 5.543950 55.956932 ... -1.0 -1.0 713.0
comb08 39101.0 20.323828 6.882807 ... 20.0 23.0 136.0
combA08 39101.0 0.631160 4.395797 ... 0.0 0.0 133.0
engId 39101.0 8582.377382 17606.675590 ... 202.0 4401.0 69102.0
feScore 39101.0 0.122580 2.516348 ... -1.0 -1.0 10.0
fuelCost08 39101.0 2242.470781 601.273869 ... 2250.0 2500.0 6850.0
fuelCostA08 39101.0 91.335260 479.485802 ... 0.0 0.0 3850.0
ghgScore 39101.0 0.120866 2.512612 ... -1.0 -1.0 10.0
ghgScoreA 39101.0 -0.923889 0.651017 ... -1.0 -1.0 8.0
highway08 39101.0 24.208588 7.128070 ... 24.0 27.0 122.0
highwayA08 39101.0 0.736452 4.694207 ... 0.0 0.0 121.0
hlv 39101.0 2.029539 5.959735 ... 0.0 0.0 49.0
hpv 39101.0 10.411243 28.167271 ... 0.0 0.0 195.0
id 39101.0 19662.541188 11413.329199 ... 19552.0 29555.0 39483.0
lv2 39101.0 1.834812 4.407887 ... 0.0 0.0 41.0
lv4 39101.0 6.155930 9.698101 ... 0.0 13.0 55.0
pv2 39101.0 13.649574 31.214466 ... 0.0 0.0 194.0
pv4 39101.0 33.883711 45.991687 ... 0.0 91.0 192.0
range 39101.0 0.500243 9.742080 ... 0.0 0.0 335.0
year 39101.0 2000.635406 10.690422 ... 2001.0 2010.0 2018.0
youSaveSpend 39101.0 -3459.572645 3010.284617 ... -3500.0 -1500.0 5250.0
[24 rows x 8 columns]
In[14]: # iinfo function in numpy will show the limit for integer types
In[15]: np.iinfo(np.int8)
Out[15]: iinfo(min=-128, max=127, dtype=int8)
In[16]: np.iinfo(int16)
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-16-33fbc0c72155>", line 1, in <module>
np.iinfo(int16)
NameError: name 'int16' is not defined
In[17]: np.iinfo(np.int16)
Out[17]: iinfo(min=-32768, max=32767, dtype=int16)
In[18]: # 'cit08' and 'comb08' don't go above up to 150
In[19]: fueleco[['city08','comb08']].astype('int16').info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 city08 39101 non-null int16
1 comb08 39101 non-null int16
dtypes: int16(2)
memory usage: 152.9 KB
In[20]: fueleco['city08','comb08'].info(memory_usage='deep')
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('city08', 'comb08')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-20-e5f2d55949d1>", line 1, in <module>
fueleco['city08','comb08'].info(memory_usage='deep')
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: ('city08', 'comb08')
In[21]: fueleco[['city08','comb08']].info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 city08 39101 non-null int64
1 comb08 39101 non-null int64
dtypes: int64(2)
memory usage: 611.1 KB
In[22]: # so just modify the type of 'city08' and 'comb08' into 'int16'
In[23]: fueleco[['city08','comb08']].assign()
Out[23]:
city08 comb08
0 19 21
1 9 11
2 23 27
3 10 11
4 17 19
... ...
39096 19 22
39097 20 23
39098 18 21
39099 18 21
39100 16 18
[39101 rows x 2 columns]
In[24]: (fueleco[['city08','comb08']]
...: .assin(city08=fueleco.city08.astype(np.int16),
...: comb08=fueleco.comb08.astype(np.in16)
...: )
...: .info(memory_usage='deep')
...: )
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-24-15d636284b16>", line 1, in <module>
(fueleco[['city08','comb08']]
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'assin'
In[25]: (fueleco[['city08','comb08']]
...: .assign(city08=fueleco.city08.astype(np.int16),
...: comb08=fueleco.comb08.astype(np.in16)
...: )
...: .info(memory_usage='deep')
...: )
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-25-fccec6400831>", line 3, in <module>
comb08=fueleco.comb08.astype(np.in16)
File "D:\PyCharm2020\python2020\lib\site-packages\numpy\__init__.py", line 214, in __getattr__
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'in16'
In[26]: (fueleco[['city08','comb08']]
...: .assign(city08=fueleco.city08.astype('int16'),
...: comb08=fueleco.comb08.astype('in16')
...: )
...: .info(memory_usage='deep')
...: )
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-fe1ee5143ca8>", line 3, in <module>
comb08=fueleco.comb08.astype('in16')
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5874, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\managers.py", line 631, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\managers.py", line 427, in apply
applied = getattr(b, f)(**kwargs)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\internals\blocks.py", line 626, in astype
dtype = pandas_dtype(dtype)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\dtypes\common.py", line 1799, in pandas_dtype
npdtype = np.dtype(dtype)
TypeError: data type 'in16' not understood
In[27]: (fueleco[['city08','comb08']]
...: .assign(city08=fueleco.city08.astype(np.int16),
...: comb08=fueleco.comb08.astype(np.int16)
...: )
...: .info(memory_usage='deep')
...: )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 city08 39101 non-null int16
1 comb08 39101 non-null int16
dtypes: int16(2)
memory usage: 152.9 KB
In[28]: fueleco['make','model'].nunique()
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('make', 'model')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-28-bd185d23b85a>", line 1, in <module>
fueleco['make','model'].nunique()
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: ('make', 'model')
In[29]: fueleco[['make','model']].nunique()
Out[29]:
make 134
model 3816
dtype: int64
In[30]: # 'make' has a low cardinality, so convert it into 'category' for memory reusage
In[33]: (
...: fueleco[['make']]
...: .assign(make=fueleco.make.astype('category')
...: ).info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 make 39101 non-null category
dtypes: category(1)
memory usage: 89.5 KB
In[34]: (fueleco[['model']]
...: .assign(model=fueleco.model.astype('category'))
...: .info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 39101 non-null category
dtypes: category(1)
memory usage: 465.7 KB
In[36]: fueleco.make.value_counts()
Out[36]:
Chevrolet 3900
Ford 3208
Dodge 2557
GMC 2442
Toyota 1976
...
Shelby 1
Grumman Allied Industries 1
Qvale 1
Volga Associated Automobile 1
Goldacre 1
Name: make, Length: 134, dtype: int64
In[37]: # there are so many values in the above summary, now look at the top 6 and collapse the remaining values
In[38]: top_n=fueleco.make.value_counts().index[:6]
In[39]: fueleco.value_counts()
Out[39]: Series([], dtype: int64)
In[40]: fueleco.make.value_counts()
Out[40]:
Chevrolet 3900
Ford 3208
Dodge 2557
GMC 2442
Toyota 1976
...
Shelby 1
Grumman Allied Industries 1
Qvale 1
Volga Associated Automobile 1
Goldacre 1
Name: make, Length: 134, dtype: int64
In[42]: top_n
Out[42]: Index(['Chevrolet', 'Ford', 'Dodge', 'GMC', 'Toyota', 'BMW'], dtype='object')
In[44]: (fueleco.assign(
...: make=fueleco.make.where(fueleco.make.isin(top_n),"other"))
...: .make.value_counts())
Out[44]:
other 23211
Chevrolet 3900
Ford 3208
Dodge 2557
GMC 2442
Toyota 1976
BMW 1807
Name: make, dtype: int64
In[45]: # determine the number and percent of missing values
In[46]: fueleco.drive.isna().sum()
Out[46]: 1189
In[47]: fueleco.isna().mean()
Out[47]:
barrels08 0.000000
barrelsA08 0.000000
charge120 0.000000
charge240 0.000000
city08 0.000000
...
guzzler 0.940283
trans_dscr 0.615176
tCharger 0.851257
sCharger 0.981126
atvType 0.918058
Length: 70, dtype: float64
In[48]: fueleco.drive.isna().mean()
Out[48]: 0.030408429451932176
In[49]: fueleco.drive.isna().mean()*100
Out[49]: 3.0408429451932175
In[50]: # use .nunique method to determine cardinality
In[51]: fueleco.drive.nunique()
Out[51]: 7
In[52]: # pick out the columns with data types that are object
In[53]: fueleco.select_dtypes('object').columns
Out[53]:
Index(['drive', 'eng_dscr', 'fuelType', 'fuelType1', 'make', 'model',
'mpgData', 'trany', 'VClass', 'guzzler', 'trans_dscr', 'tCharger',
'sCharger', 'atvType'],
dtype='object')
In[55]: import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(10,8))
top_n=fueleco.make.value_counts().index[:6]
(fueleco.assign(make=fueleco.make.where(fueleco.make.isin(top_n),'Other'))
.make.value_counts().plot.bar(ax=ax))
Out[58]: <AxesSubplot:>
fig.savefig("c5-catpan.png",dpi=300)
.cut
.qcut
(quantile cut) used to cut into equal-width bins or bin width that we specify, with these methods we can treat numeric columns as categories by binning them .
continuous data
import seaborn as sns
fig,ax=plt.subplots(figsize=(10,8))
sns.countplot(y='make',data=(fueleco.assign(make=fueleco.make.where(fueleco.make.isin(top_n),"Other"))))
Out[66]: <AxesSubplot:xlabel='count', ylabel='make'>
fig.savefig("c5-catsns.png",dpi=300)
# rows where 'drive' are missing
fueleco[fueleco.drive.isna()]
Out[69]:
barrels08 barrelsA08 charge120 ... tCharger sCharger atvType
7138 0.240000 0.0 0.0 ... NaN NaN EV
8144 0.312000 0.0 0.0 ... NaN NaN EV
8147 0.270000 0.0 0.0 ... NaN NaN EV
18215 15.695714 0.0 0.0 ... NaN NaN NaN
18216 14.982273 0.0 0.0 ... NaN NaN NaN
... ... ... ... ... ... ...
23023 0.240000 0.0 0.0 ... NaN NaN EV
23024 0.546000 0.0 0.0 ... NaN NaN EV
23026 0.426000 0.0 0.0 ... NaN NaN EV
23031 0.426000 0.0 0.0 ... NaN NaN EV
23034 0.204000 0.0 0.0 ... NaN NaN EV
[1189 rows x 70 columns]
# by default, .value_counts does not show missing values, but use dropna parameter
fueleco.drive.value_counts(dropna=False)
Out[71]:
Front-Wheel Drive 13653
Rear-Wheel Drive 13284
4-Wheel or All-Wheel Drive 6648
All-Wheel Drive 2401
4-Wheel Drive 1221
NaN 1189
2-Wheel Drive 507
Part-time 4-Wheel Drive 198
Name: drive, dtype: int64
fig,ax=plt.subplots(figsize=(10,8))
fueleco.drive.value_counts(dropna=False).plot.bar(ax=ax)
Out[73]: <AxesSubplot:>
fig.savefig('c5-aa',dpi=300)
",".join('abcd')
Out[76]: 'a,b,c,d'
fueleco.city08.sample(5,random_state=40)
Out[80]:
4643 16
1483 15
34149 21
563 14
2364 19
Name: city08, dtype: int64
fueleco.city08.sample(5,random_state=42)
Out[81]:
4217 11
1736 21
36029 16
37631 16
1668 17
Name: city08, dtype: int64
# use pandas to plot a histogram
import matplotlib.pyplot as plt
fig,ax=plt.subplots(figsize=(10,8))
fueleco.city08.hist(ax=ax)
Out[85]: <AxesSubplot:>
fig.savefig('hist.png',dpi=300)
# the plot looks very skewed, so increase the number of bins in the histogram to see if the skew is hiding behaviors
# as the skew makes bins wider
fig,ax=plt.subplots(figsize=(10,8))
fueleco.city08.hist(ax=ax,bins=30)
Out[90]: <AxesSubplot:>
fig.savefig("hist-bins-30.png",dpi=300)
# use seaborn to create a distribution plot,which includes a histogram, a kernel density estimation(KDE), a rug plot
sns.distplot(fueleco.city08,rug=True,ax=ax)
D:\PyCharm2020\python2020\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
D:\PyCharm2020\python2020\lib\site-packages\seaborn\distributions.py:2056: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.
warnings.warn(msg, FutureWarning)
Out[96]: <AxesSubplot:xlabel='city08', ylabel='Density'>
fig.savefig('rugplot.png',dpi=300)
[160(181/627)]
关于seaborn绘图函数:boxplot,boxenplot,violin plots的介绍
【162(183/627)】
用图形检验数据是否服从正态分布:
fig.savefig('rugplot.png',dpi=300)
fig,ax=plt.subplots(nrows=3,figsize=(10,8))
sns.boxplot(fueleco.city08,ax=ax[0])
sns.violinplot(fueleco.city08,ax=ax[1])
sns.boxenplot(fueleco.city08,ax=ax[2])
fig.savefig('subplots_nroes3.png',dpi=300)
from scipy import stats
stats.kstest(fueleco.city08,cdf='norm')
Out[104]: KstestResult(statistic=0.9999999990134123, pvalue=0.0)
fig,ax=plt.subplots(figsize=(10,8))
stats.probplot(fueleco.city08,plot=ax)
Out[106]:
((array([-4.1352692 , -3.92687024, -3.81314873, ..., 3.81314873,
3.92687024, 4.1352692 ]),
array([ 6, 6, 6, ..., 137, 138, 150], dtype=int64)),
(5.385946629915974, 18.077798521776934, 0.772587941459713))
fig.savefig('proplor.png',dpi=300)
Comparing continuous values across categories
# make a mask for the brands we want
mask=fueleco.make.isin(["Ford","Honda","Tesla","BMW"])
mask
Out[112]:
0 False
1 False
2 False
3 False
4 False
...
39096 False
39097 False
39098 False
39099 False
39100 False
Name: make, Length: 39101, dtype: bool
type(mask)
Out[113]: pandas.core.series.Series
fueleco[mask]
Out[114]:
barrels08 barrelsA08 charge120 ... tCharger sCharger atvType
20 20.600625 0.0 0.0 ... NaN NaN NaN
21 20.600625 0.0 0.0 ... NaN NaN NaN
22 25.354615 0.0 0.0 ... NaN NaN NaN
56 15.695714 0.0 0.0 ... NaN NaN NaN
57 17.347895 0.0 0.0 ... NaN NaN NaN
... ... ... ... ... ... ...
39016 13.733750 0.0 0.0 ... NaN NaN NaN
39017 17.347895 0.0 0.0 ... NaN NaN NaN
39018 15.695714 0.0 0.0 ... NaN NaN NaN
39023 14.982273 0.0 0.0 ... NaN NaN NaN
39025 13.733750 0.0 0.0 ... NaN NaN NaN
[5986 rows x 70 columns]
fueleco[mask].groupby("make").city08.agg(["mean","std"])
Out[115]:
mean std
make
BMW 17.817377 7.372907
Ford 16.853803 6.701029
Honda 24.372973 9.154064
Tesla 92.826087 5.538970
# and then use a group by operation to look at the mean and std for the city08 column
# visualize the city08 values for each make with seborn
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box')
g.ax.figure.savefig("c5-catbox.png",dpi=300)
# one of drawback of a boxplot is that while it indicates the spread of the data, it does not reveal how many samples are in each make
mask=fueleco.make.isin(["Ford","Honda","Tesla","BMW"]) # 布尔型向量,其分量对应fueleco的每一行的make标签是否在特定的这四个元素组成的数组中
fueleco[mask].groupby("make").city08.count()
Out[123]:
make
BMW 1807
Ford 3208
Honda 925
Tesla 46
Name: city08, dtype: int64
# faceet the grid by another feature
# break each of these new plot into its own graph by using the col parameter
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box',col='year',col_order=[2012,2014,2016,2018])
g=sns.catplot(x="make",y="city08",data=fueleco[mask],kind='box',col='year',col_order=[2012,2014,2016,2018],col_wrap=2)
# col为划分依据,col_order是小窗顺序
# embed the new dimension in the same plot by using the hue parameter
g=sns.catplot(x="make",y="")
Comparing two continuous columns
# Comparing two continuous columns
# if you have two columnswith a high correlation to one another, often , you may drop one of them as a redundant column
# covariance of the two numbers if they are on the same scale
fueleco.city08.cov(fueleco.highway08)
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-5-be0843446ebc>", line 1, in <module>
fueleco.city08.cov(fueleco.highway08)
NameError: name 'fueleco' is not defined
import pandas as pd
import numpy as np
fueleco=pd.read_csv("vehicles.csv",usecols=list(range(1,70,1)))
fueleco.city08.cov(fueleco.highway08)
Out[9]: 46.33326023673624
fueleco.city08.cov(fueleco.comb08)
Out[10]: 47.419946678190776
fueleco.city08.cov(fueleco.cylinders)
Out[11]: -5.931560263764768
# Pearson correlation between the two numbers
fueleco.city08.corr(fueleco.highway08)
Out[13]: 0.932494506228495
# use pandas to scatter plot the relationship
import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(8,8))
fueleco.plot.scatter(x="city08",y="highway08",ax=ax)
Out[17]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.01)
Out[18]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fig.savefig('scatterplot_alpha0.01.png',dpi=300)
fig,ax=plt.subplots(figsize=(8,8))
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.01)
Out[21]: <AxesSubplot:xlabel='city08', ylabel='highway08'>
fueleco.plot.scatter(x="city08",y="highway08",ax=ax,alpha=0.1)
# pearson correlation is intended to show the strength of a linear relationship.
# if the continuous columns columns do not have a linear relationship, another option is use Spearman correlation
# this number also varies from -1 to 1
# it measrues whether the relationship is monotonic and doesn't presume that it is linear
# it use the rank of each number rather than the number if you are not sure whther there is a linear relationship between your coulmns, this is a better metric to use
fueleco.city08.corr(fueleco.barrelsA08,method='spearman')
Out[31]: -0.08476703673460519
# Pearson correlation tells us how one value impacts another
# covariance lets us know how these values vary together
# a heatmap is a great way to look at correlations in aggregate
# scatter plots are another way to visualize the relationships between continuous variables
# set alpha parameter to a value less than or equal to 0.5, which makes the points transparent
# now , add more dimension to a scatter plot
# using the replot function, we can color to the dotd by year aand size them by the number of barrals the vehicles consumes
# in this case, go from 2 dimension to 4dimensions
res=sns.relplot(x="city08",y="highway08",data=fueleco.assign(cylinders=fueleco.cylinders.fillna(0)),hue='year',size='barrelsA08',alpha=0.5,height=8)
res=sns.relplot(x="city08",y="highway08",data=fueleco.assign(cylinders=fueleco.cylinders.fillna(0)),hue='year',size='barrelsA08',alpha=0.5,height=8,col='make',col_order=['Ford',"Tesla"])
Comparing categorical vaules with categorical values
# continuous columns can be converted into categorical columns by binning the values
[179(200/627)] 没大看懂这是在做什么……
# if you use seaborn, you can add multiple dimensions by setting 'hue' or 'col'
Using the pandas profiling library
[187/(208/627)]