特征工程1

最新推荐文章于 2022-10-28 15:17:40 发布

Up_梅子酒

最新推荐文章于 2022-10-28 15:17:40 发布

阅读量430

点赞数

分类专栏： Feature Engineering 文章标签： python

本文链接：https://blog.csdn.net/eerywh/article/details/107648021

版权

Feature Engineering 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

第二章数据等级总结

import os 

os.listdir()

['.config', 'sample_data']

!git clone https://github.com/**********/Feature-Engineering-Made-Easy.git

Cloning into 'Feature-Engineering-Made-Easy'...
remote: Enumerating objects: 63, done.[K
remote: Total 63 (delta 0), reused 0 (delta 0), pack-reused 63[K
Unpacking objects: 100% (63/63), done.
Checking out files: 100% (62/62), done.

import numpy as np
import pandas as pd 
from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline 
plt.style.use('fivethirtyeight')

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

customer = pd.read_csv('/content/Feature-Engineering-Made-Easy/data/2013_SFO_Customer_survey.csv')
customer.shape

(3535, 95)

customer.head()

	RESPNUM	CCGID	RUN	INTDATE	GATE	STRATA	PEAK	METHOD	AIRLINE	FLIGHT	DEST	DESTGEO	DESTMARK	ARRTIME	DEPTIME	Q2PURP1	Q2PURP2	Q2PURP3	Q2PURP4	Q2PURP5	Q2PURP6	Q3GETTO1	Q3GETTO2	Q3GETTO3	Q3GETTO4	Q3GETTO5	Q3GETTO6	Q3PARK	Q4BAGS	Q4BUY	Q4FOOD	Q4WIFI	Q5FLYPERYR	Q6TENURE	SAQ	Q7A_ART	Q7B_FOOD	Q7C_SHOPS	Q7D_SIGNS	Q7E_WALK	...	Q9C_CLNRENT	Q9D_CLNFOOD	Q9E_CLNBATH	Q9F_CLNWHOLE	Q9COM1	Q9COM2	Q9COM3	Q10SAFE	Q10COM1	Q10COM2	Q10COM3	Q11A_USEWEB	Q11B_USESFOAPP	Q11C_USEOTHAPP	Q11D_USESOCMED	Q11E_USEWIFI	Q12COM1	Q12COM2	Q12COM3	Q13_WHEREDEPART	Q13_RATEGETTO	Q14A_FIND	Q14B_SECURITY	Q15_PROBLEMS	Q15COM1	Q15COM2	Q15COM3	Q16_REGION	Q17_CITY	Q17_ZIP	Q17_COUNTRY	HOME	Q18_AGE	Q19_SEX	Q20_INCOME	Q21_HIFLYER	Q22A_USESJC	Q22B_USEOAK	LANG	WEIGHT
0	1	1	1215	2	12	1	1	1	21	1437	49	1	1	8:34 AM	9:25 AM	1	8.0	NaN	NaN	NaN	NaN	2	10.0	NaN	NaN	NaN	NaN	NaN	2	1	2	2	6	2.0	1	3	4	3	3	3	...	3	3	4	4	NaN	NaN	NaN	5	1.0	NaN	NaN	2	2	2	2	2	NaN	NaN	NaN	5	3	3	3	2	NaN	NaN	NaN	1	SAN FRANCISCO	94131.0	US	1	2	1	1	2	2	1	1	0.553675
1	2	2	1215	2	12	1	1	1	21	1437	49	1	1	8:00 AM	9:25 AM	1	8.0	NaN	NaN	NaN	NaN	2	10.0	NaN	NaN	NaN	NaN	NaN	2	2	2	2	6	4.0	1	4	4	4	4	4	...	6	4	4	4	NaN	NaN	NaN	5	1.0	NaN	NaN	2	2	2	2	3	NaN	NaN	NaN	2	3	5	5	2	NaN	NaN	NaN	1	CONCORD	94521.0	US	5	6	1	0	3	2	1	1	0.553675
2	3	3	1215	2	12	1	1	1	21	1437	49	1	1	7:00 AM	9:25 AM	1	8.0	NaN	NaN	NaN	NaN	2	10.0	NaN	NaN	NaN	NaN	NaN	2	2	2	2	4	4.0	1	3	4	4	2	4	...	3	3	3	3	NaN	NaN	NaN	3	1.0	NaN	NaN	2	2	2	2	2	NaN	NaN	NaN	5	3	3	3	2	NaN	NaN	NaN	1	SAN FRANCISCO	94134.0	US	1	4	2	2	3	2	2	1	0.553675
3	4	4	1215	2	12	1	1	1	21	1437	49	1	1	7:30 AM	9:25 AM	1	8.0	NaN	NaN	NaN	NaN	1	10.0	NaN	NaN	NaN	NaN	1.0	1	2	1	2	3	4.0	2	3	3	3	4	4	...	5	5	5	5	NaN	NaN	NaN	5	NaN	NaN	NaN	2	2	2	2	2	NaN	NaN	NaN	5	3	5	5	2	NaN	NaN	NaN	1	NaN	NaN	US	90	4	1	2	2	2	2	1	0.553675
4	5	5	1215	2	12	1	1	1	21	1437	49	1	1	6:30 AM	9:25 AM	1	8.0	NaN	NaN	NaN	NaN	8	10.0	NaN	NaN	NaN	NaN	NaN	2	1	1	1	2	3.0	2	3	3	2	3	5	...	5	5	5	5	87.0	NaN	NaN	5	5.0	NaN	NaN	2	2	2	2	1	1.0	5.0	NaN	3	5	4	3	2	NaN	NaN	NaN	3	HUNTINGTON BEACH	92646.0	US	10	3	1	3	1	0	1	1	0.553675

5 rows Ă— 95 columns

定序等级

art_rating = customer['Q7A_ART']
art_rating.describe()

count    3535.000000
mean        4.300707
std         1.341445
min         0.000000
25%         3.000000
50%         4.000000
75%         5.000000
max         6.000000
Name: Q7A_ART, dtype: float64

art_ratings = art_rating[(art_rating>=1) & (art_rating<=5)]
art_ratings = art_ratings.astype(str)
art_ratings.describe()

count     2656
unique       5
top          4
freq      1066
Name: Q7A_ART, dtype: object

art_ratings.value_counts().plot(kind = 'bar');

在这里插入图片描述

art_ratings.value_counts().plot(kind = 'box')

在这里插入图片描述

定距等级

import zipfile

Dataset = "GlobalLandTemperaturesByCity.csv"

# Will unzip the files so that you can see them..
with zipfile.ZipFile("/content/Feature-Engineering-Made-Easy/data/"+Dataset+".zip","r") as z:
    z.extractall(".")

climate = pd.read_csv('/content/GlobalLandTemperaturesByCity.csv')
climate.head()

	dt	AverageTemperature	AverageTemperatureUncertainty	City	Country	Latitude	Longitude
0	1743-11-01	6.068	1.737	Ă rhus	Denmark	57.05N	10.33E
1	1743-12-01	NaN	NaN	Ă rhus	Denmark	57.05N	10.33E
2	1744-01-01	NaN	NaN	Ă rhus	Denmark	57.05N	10.33E
3	1744-02-01	NaN	NaN	Ă rhus	Denmark	57.05N	10.33E
4	1744-03-01	NaN	NaN	Ă rhus	Denmark	57.05N	10.33E

climate.dropna(axis = 0,inplace = True)

climate.shape

(8235082, 7)

climate.isnull().sum()

dt                               0
AverageTemperature               0
AverageTemperatureUncertainty    0
City                             0
Country                          0
Latitude                         0
Longitude                        0
dtype: int64

climate.AverageTemperature.plot(kind='hist');

在这里插入图片描述

climate.dt.head()

0    1743-11-01
5    1744-04-01
6    1744-05-01
7    1744-06-01
8    1744-07-01
Name: dt, dtype: object

climate.dt = pd.to_datetime(climate.dt)

climate.dt.head()

0   1743-11-01
5   1744-04-01
6   1744-05-01
7   1744-06-01
8   1744-07-01
Name: dt, dtype: datetime64[ns]

climate['year'] = climate.dt.map(lambda value: value.year)

climate.head()

	dt	AverageTemperature	AverageTemperatureUncertainty	City	Country	Latitude	Longitude	year
0	1743-11-01	6.068	1.737	Ă rhus	Denmark	57.05N	10.33E	1743
5	1744-04-01	5.788	3.624	Ă rhus	Denmark	57.05N	10.33E	1744
6	1744-05-01	10.644	1.283	Ă rhus	Denmark	57.05N	10.33E	1744
7	1744-06-01	14.051	1.347	Ă rhus	Denmark	57.05N	10.33E	1744
8	1744-07-01	16.082	1.396	Ă rhus	Denmark	57.05N	10.33E	1744

# ĺŞçœ‹çžŽĺ›˝
climate_sub_us = climate.loc[climate['Country']== 'United States']

climate_sub_us.head()

	dt	AverageTemperature	AverageTemperatureUncertainty	City	Country	Latitude	Longitude	year
47555	1820-01-01	2.101	3.217	Abilene	United States	32.95N	100.53W	1820
47556	1820-02-01	6.926	2.853	Abilene	United States	32.95N	100.53W	1820
47557	1820-03-01	10.767	2.395	Abilene	United States	32.95N	100.53W	1820
47558	1820-04-01	17.989	2.202	Abilene	United States	32.95N	100.53W	1820
47559	1820-05-01	21.809	2.036	Abilene	United States	32.95N	100.53W	1820

climate_sub_us['Century'] = climate_sub_us['year'].map(lambda x: int(x/100+1))

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

climate_sub_us['AverageTemperature'].hist(by=climate_sub_us['Century'],
 sharex=True, sharey=True, 
 figsize=(10, 10),
 bins=20);

在这里插入图片描述

climate_sub_us.groupby('Century')['AverageTemperature'].mean().plot(kind = 'line');

在这里插入图片描述

century_changes = climate_sub_us.groupby('Century')['AverageTemperature'].mean()

century_changes

Century
18    12.073243
19    13.662870
20    14.386622
21    15.197692
Name: AverageTemperature, dtype: float64

# ĺœ¨ĺŽščˇç‰çş§çť˜ĺˆśä¸¤ĺˆ—ć•°ćŽ
x = climate_sub_us['year']
y = climate_sub_us['AverageTemperature']
fig,ax = plt.subplots(figsize = (12,5))
ax.scatter(x,y)
plt.show()

在这里插入图片描述

climate_sub_us.groupby('year').mean()['AverageTemperature'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fd022540c50>

在这里插入图片描述

# ä˝żç”¨ćť‘ĺŠ¨ĺ‡ĺ€źĺšłćť‘ĺ›žĺƒ
climate_sub_us.groupby('year').mean()['AverageTemperature'].rolling(10).mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fd022121fd0>

在这里插入图片描述

定比等级

salary_ranges = pd.read_csv('/content/Feature-Engineering-Made-Easy/data/Salary_Ranges_by_Job_Classification.csv')

salary_ranges['Biweekly High Rate'] = salary_ranges['Biweekly High Rate'].map(lambda value: value.replace('$',''))
salary_ranges['Biweekly High Rate'] = salary_ranges['Biweekly High Rate'].astype(float)
salary_ranges['Grade'] = salary_ranges['Grade'].astype(str)

salary_ranges.head()

	SetID	Job Code	Eff Date	Sal End Date	Salary SetID	Sal Plan	Step	Biweekly High Rate	Biweekly Low Rate	Union Code	Pay Type
0	COMMN	0109	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	0.0	$0.00	330	C
1	COMMN	0110	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	15.0	$15.00	323	D
2	COMMN	0111	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	25.0	$25.00	323	D
3	COMMN	0112	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	50.0	$50.00	323	D
4	COMMN	0114	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	100.0	$100.00	323	M

salary_ranges.groupby('Grade')['Biweekly High Rate'].mean().sort_values(ascending = False).head(20).plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7fd020094400>

在这里插入图片描述

salary_ranges.head()

	SetID	Job Code	Eff Date	Sal End Date	Salary SetID	Sal Plan	Step	Biweekly High Rate	Biweekly Low Rate	Union Code	Pay Type
0	COMMN	0109	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	0.0	$0.00	330	C
1	COMMN	0110	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	15.0	$15.00	323	D
2	COMMN	0111	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	25.0	$25.00	323	D
3	COMMN	0112	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	50.0	$50.00	323	D
4	COMMN	0114	07/01/2009 12:00:00 AM	06/30/2010 12:00:00 AM	COMMN	SFM	1	100.0	$100.00	323	M

fig = plt.figure(figsize=(15,5))
ax = fig.gca()
salary_ranges.groupby('Grade')[['Biweekly High Rate']].mean().sort_values('Biweekly High Rate',ascending = False).head(20).plot.bar(stacked =False,ax = ax,color = 'darkorange')
ax.set_title('Top 20 Grade by Mean Biweekly High Rate')

Text(0.5, 1.0, 'Top 20 Grade by Mean Biweekly High Rate')

在这里插入图片描述

fig = plt.figure(figsize=(15,5))
ax = fig.gca()
salary_ranges.groupby('Grade')[['Biweekly High Rate']].mean().sort_values('Biweekly High Rate',ascending = False).tail(20).plot.bar(stacked =False,ax = ax,color = 'darkorange')
ax.set_title('Bottom 20 Grade by Mean Biweekly High Rate')

Text(0.5, 1.0, 'Bottom 20 Grade by Mean Biweekly High Rate')

在这里插入图片描述

sorted_df =salary_ranges.groupby('Grade')[['Biweekly High Rate']].mean().sort_values('Biweekly High Rate',ascending = False)

sorted_df.head()

	Biweekly High Rate
Grade
9186F	12120.77
0390F	11255.00
0140H	10843.00
0140F	10630.00
0395F	10376.00

sorted_df.iloc[0]

Biweekly High Rate    12120.77
Name: 9186F, dtype: float64

sorted_df.iloc[0][0] / sorted_df.iloc[-1][0]

13.931919540229886

Up_梅子酒

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
特征工程1

第二章数据等级总结import os os.listdir()['.config', 'sample_data']!git clone https://github.com/Childish1jin/Feature-Engineering-Made-Easy.gitCloning into 'Feature-Engineering-Made-Easy'...remote: Enumerating objects: 63, done.[Kremote: Total 63 (delta
复制链接

扫一扫