利用机器学习测试你的心动指数

利用机器学习测试你的心动指数

数据集下载

[1]:

 

!wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/speed_dating/Speed%20Dating%20Data%20Key.doc
--2020-08-12 11:06:06--  https://pai-public-data.oss-cn-beijing.aliyuncs.com/speed_dating/Speed%20Dating%20Data%20Key.doc
Resolving pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)... 47.95.85.22
Connecting to pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)|47.95.85.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 161792 (158K) [application/msword]
Saving to: 'Speed Dating Data Key.doc'

100%[======================================>] 161,792     --.-K/s   in 0.07s   

2020-08-12 11:06:06 (2.08 MB/s) - 'Speed Dating Data Key.doc' saved [161792/161792]

[2]:

 

!wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/speed_dating/Speed%20Dating%20Data.csv
--2020-08-12 11:06:07--  https://pai-public-data.oss-cn-beijing.aliyuncs.com/speed_dating/Speed%20Dating%20Data.csv
Resolving pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)... 47.95.85.22
Connecting to pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)|47.95.85.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5192296 (5.0M) [text/csv]
Saving to: 'Speed Dating Data.csv'

100%[======================================>] 5,192,296   18.7MB/s   in 0.3s   

2020-08-12 11:06:08 (18.7 MB/s) - 'Speed Dating Data.csv' saved [5192296/5192296]

安装以及引入调用的包

[3]:

 

!pip install palettable --user
Collecting palettable
  Downloading https://mirrors.aliyun.com/pypi/packages/ca/46/5198aa24e61bb7eef28d06cb69e56bfa1942f4b6807d95a0b5ce361fe09b/palettable-3.3.0-py2.py3-none-any.whl (111kB)
    100% |################################| 112kB 1.1MB/s 
Installing collected packages: palettable
Successfully installed palettable-3.3.0
You are using pip version 9.0.1, however version 20.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

[4]:

 

!pip install imbalanced-learn --user
Collecting imbalanced-learn
  Downloading https://mirrors.aliyun.com/pypi/packages/c8/81/8db4d87b03b998fda7c6f835d807c9ae4e3b141f978597b8d7f31600be15/imbalanced_learn-0.7.0-py3-none-any.whl (167kB)
    100% |################################| 174kB 11.0MB/s 
Requirement already satisfied: scikit-learn>=0.23 in /home/admin/.local/lib/python3.6/site-packages (from imbalanced-learn)
Requirement already satisfied: scipy>=0.19.1 in /opt/conda/lib/python3.6/site-packages (from imbalanced-learn)
Requirement already satisfied: joblib>=0.11 in /home/admin/.local/lib/python3.6/site-packages (from imbalanced-learn)
Requirement already satisfied: numpy>=1.13.3 in /home/admin/.local/lib/python3.6/site-packages (from imbalanced-learn)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/admin/.local/lib/python3.6/site-packages (from scikit-learn>=0.23->imbalanced-learn)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.7.0
You are using pip version 9.0.1, however version 20.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

[6]:

 

!pip install seaborn --user
Collecting seaborn
  Downloading https://mirrors.aliyun.com/pypi/packages/c7/e6/54aaaafd0b87f51dfba92ba73da94151aa3bc179e5fe88fc5dfb3038e860/seaborn-0.10.1-py3-none-any.whl (215kB)
    100% |################################| 225kB 13.2MB/s 
Requirement already satisfied: matplotlib>=2.1.2 in /opt/conda/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: numpy>=1.13.3 in /home/admin/.local/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: pandas>=0.22.0 in /opt/conda/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: scipy>=1.0.1 in /opt/conda/lib/python3.6/site-packages (from seaborn)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.1.2->seaborn)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.1.2->seaborn)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.1.2->seaborn)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.1.2->seaborn)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas>=0.22.0->seaborn)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.1->matplotlib>=2.1.2->seaborn)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn)
Installing collected packages: seaborn
Successfully installed seaborn-0.10.1
You are using pip version 9.0.1, however version 20.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

[8]:

 

# importing packages
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
import imblearn
from palettable.colorbrewer.qualitative import Pastel1_3

EDA 探索性数据分析

[9]:

 

df = pd.read_csv('Speed Dating Data.csv', encoding='gbk')
df.head()

[9]:

 iididgenderidgcondtnwaveroundpositionpositin1order...attr3_3sinc3_3intel3_3fun3_3amb3_3attr5_3sinc5_3intel5_3fun5_3amb5_3
011.00111107NaN4...5.07.07.07.07.0NaNNaNNaNNaNNaN
111.00111107NaN3...5.07.07.07.07.0NaNNaNNaNNaNNaN
211.00111107NaN10...5.07.07.07.07.0NaNNaNNaNNaNNaN
311.00111107NaN5...5.07.07.07.07.0NaNNaNNaNNaNNaN
411.00111107NaN7...5.07.07.07.07.0NaNNaNNaNNaNNaN

5 rows × 195 columns

[10]:

 

print(df.shape)
(8378, 195)

[11]:

 

# 计算出每个特征有多少百分比是缺失的
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({
    'column_name': df.columns,
    'percent_missing': percent_missing
})

[12]:

 

missing_value_df.sort_values(by='percent_missing', ascending=False).head(10)

[12]:

 column_namepercent_missing
num_in_3num_in_392.026737
numdat_3numdat_382.143710
expnumexpnum78.515159
sinc7_2sinc7_276.665075
amb7_2amb7_276.665075
shar7_2shar7_276.438291
attr7_2attr7_276.318931
intel7_2intel7_276.318931
fun7_2fun7_276.318931
amb5_3amb5_375.936978

多少人通过Speed Dating找到了对象

[13]:

 

# 多少人通过Speed Dating找到了对象
plt.subplots(figsize=(3,3), dpi=110,)
# 构造数据
size_of_groups=df.match.value_counts().values
single_percentage = round(size_of_groups[0]/sum(size_of_groups) * 100,2) 
matched_percentage = round(size_of_groups[1]/sum(size_of_groups)* 100,2) 
names = [
    'Single:' + str(single_percentage) + '%',
    'Matched' + str(matched_percentage) + '%']
 
# 创建饼图
plt.pie(
    size_of_groups, 
    labels=names, 
    labeldistance=1.2, 
    colors=Pastel1_3.hex_colors
)
plt.show()

[ ]:

 

[14]:

 

# 多少女生通过Speed Dating找到了对象
plt.subplots(figsize=(3,3), dpi=110,)
# 构造数据
size_of_groups=df[df.gender == 0].match.value_counts().values
single_percentage = round(size_of_groups[0]/sum(size_of_groups) * 100,2) 
matched_percentage = round(size_of_groups[1]/sum(size_of_groups)* 100,2) 
names = [
    'Single:' + str(single_percentage) + '%',
    'Matched' + str(matched_percentage) + '%']
 
# 创建饼图
plt.pie(
    size_of_groups, 
    labels=names, 
    labeldistance=1.2, 
    colors=Pastel1_3.hex_colors
)
plt.show()

[15]:

 

# 多少男生通过Speed Dating找到了对象
plt.subplots(figsize=(3,3), dpi=110,)
# 构造数据
size_of_groups=df[df.gender == 1].match.value_counts().values
single_percentage = round(size_of_groups[0]/sum(size_of_groups) * 100,2) 
matched_percentage = round(size_of_groups[1]/sum(size_of_groups)* 100,2) 
names = [
    'Single:' + str(single_percentage) + '%',
    'Matched' + str(matched_percentage) + '%']
 
# 创建饼图
plt.pie(
    size_of_groups, 
    labels=names, 
    labeldistance=1.2, 
    colors=Pastel1_3.hex_colors
)
plt.show()

年龄分布

[16]:

 

# 年龄分布
age = df[np.isfinite(df['age'])]['age']
plt.hist(age,bins=35)
plt.xlabel('Age')
plt.ylabel('Frequency')

[16]:

Text(0, 0.5, 'Frequency')

[17]:

 

date_df = df[[
    'iid', 'gender', 'pid', 'match', 'int_corr', 'samerace', 'age_o',
       'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb',
       'pf_o_sha', 'dec_o', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'like_o',
       'prob_o', 'met_o', 'age', 'race', 'imprace', 'imprelig', 'goal', 'date',
       'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining',
       'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv',
       'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'attr1_1',
       'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'attr3_1', 'sinc3_1',
       'fun3_1', 'intel3_1', 'dec', 'attr', 'sinc', 'intel', 'fun', 'like',
       'prob', 'met'
]]

[18]:

 

# heatmap
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("Correlation Heatmap")
corr = date_df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9662c23f98>

模型构建

数据准备

[19]:

 

# preparing the data
clean_df = df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o','match']]
clean_df.dropna(inplace=True)
X=clean_df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o',]]
y=clean_df['match']
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until

[20]:

 

oversample = imblearn.over_sampling.SVMSMOTE()
X, y = oversample.fit_resample(X, y)

[21]:

 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

模型训练

[22]:

 

 
# logistic regression classification model
model = LogisticRegression(C=1, random_state=0)
lrc = model.fit(X_train, y_train)
predict_train_lrc = lrc.predict(X_train)
predict_test_lrc = lrc.predict(X_test)
print('Training Accuracy:', metrics.accuracy_score(y_train, predict_train_lrc))
print('Validation Accuracy:', metrics.accuracy_score(y_test, predict_test_lrc))
Training Accuracy: 0.765040825096691
Validation Accuracy: 0.7555841924398625

测试结果

[24]:

 

lrc.predict_proba([[8.0,6.0,7.0,7.0,6.0,8.0,]])

[24]:

array([[0.29710471, 0.70289529]])

[ ]:

 

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值