一个机器学习公司的笔试题

最新推荐文章于 2024-03-19 20:35:00 发布

我是无名的我

最新推荐文章于 2024-03-19 20:35:00 发布

阅读量175

点赞数 1

分类专栏： sklearn 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_39821554/article/details/88580770

版权

sklearn 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

投了一家公司，发来一份笔试题，也就做了起来，若原公司有异议，我会及时删除。

第一题：

用一个你觉得效果最好的分类算法（e.g., SVM，RF，ANN，etc.）对Test1的data进行classification建模，用5-fold crossvalidation来计算预测的AUC或F-measure。若电脑计算能力不足可用3-fold。对于有些本身就含有cross validation功能的算法可以不用cross validation来稳定结果。（PS，对于非计算机系的同学，以python或R为例，各种分类算法以及AUC的计算都可以在网上找到对应参考代码。）
把你认为这个算法里的最重要的两个参数用网格优化（grid search）的方式进行优化，并把过程画出来。输出参数优化过程的三维图（2个参数变量 + 5 or 3 fold的AUC或者F-measure的值），示意图如下。
提交的作业结果要求包含：
a) 可直接运行的代码（注意路径）。
b) 参数优化过程的结果图。
c) 一个简易说明文档。

我的代码

from sklearn.svm import SVC
import numpy as np
import pandas as pd
from sklearn import model_selection


'''
参考相关的文章，选择核函数RBF，并且选择gamma和C作为两个最重要的参数。随后选择python中
sklearn库进行模型的建立求解，设定C为10^(-4)到10^(4)；gamma为[0.1, 0.2, 0.4, 0.6, 
0.8, 1.6, 3.2, 6.4, 12.8]。为进行交叉验证， 选择sklearn库中的model_selection模块，
设定三交叉验证并且设定AUC为评分标准。随后利用matplotlib.pyplot进行画图。
通过观察最大roc_auc(argmax(roc_auc)=0.980669)，得出最佳参数为gamma = 1.6,C = 204.082
'''

Test1_features = pd.read_csv('Test1_features.dat', header=None).iloc[:, :]
Test1_labels= pd.read_csv('Test1_labels.dat', header=None).iloc[:, :]
 

X = Test1_features
y = Test1_labels

C_test =np.linspace(10**-4, 10**4, 50)
gamma_test = np.array([0.1, 0.2, 0.4, 0.6, 0.8, 1.6, 3.2, 6.4, 12.8])


auc_scores = pd.DataFrame()
for g in gamma_test:
    for c in C_test:
        clf = SVC(gamma=g, C=c)
        scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring='roc_auc', n_jobs=-1)#交叉验证
        auc_scores = pd.concat([auc_scores, pd.DataFrame([g, c, scores.mean()])], axis=1)
        
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
X = auc_scores.iloc[0, :]
Y = auc_scores.iloc[1, :]
Z = auc_scores.iloc[2, :]
ax.plot_trisurf(X, Y, Z)
plt.xlabel('gamma')
plt.ylabel('C')
plt.zlabel('auc_scores')
plt.show()

第二题：

对Test2的data进行association rule analysis，挖掘的rule要求具备如下条件：
a) Rule的右侧只有一项，且其为Test2_data中的最后一列“label”。（eg. A->label, BC -> label, AD -> label）
b) Rule的左侧少于等于两项。
c) Support 0.1以上，Confidence 0.7以上。
提交的作业结果要求包含：
a) 可直接运行的代码（注意路径）。
b) 一个简易说明文档，其中列明挖出的rule。

import numpy as np
import pandas as pd
import itertools
'''
python没有类似实现的库，自己的参看相关apriori算法文献并且结合题目的要求进行代码的编写。
1
    遍历每一列，将同时选择任意两列(包括Label)进行计数，并计算其support, confidence，筛选出
    第一阶段频繁项集。
    
2
    因上一阶段得出的频繁项集，利用itertools库中的combinations函数对频繁项集中的columns
    进行组合，利用跟上一阶段类似的技术线路计算出同时选择同时选择任意三列(包括Label)和
    同时选择任意两列(不包括Label)进行分别计数，并计算support, confidence，选出
    第一阶段频繁项集。
3   
    设定min_support = 0.2，min_confidence = 1。
    最后得出的关联规则为{('VTYPE_1', 'Label'): [0.23583333333333334, 1.0]}第一项为support, 
    随后为confidence。
'''
dataSet = pd.read_csv('Test2_Data.csv')
min_support = 0.2
min_confidence = 1
len = dataSet.shape[0]



Lable_set_0 = dataSet.columns.values.tolist()#获得dataSet的列名
Lable_set_1 = [x for x in Lable_set_0 if x != 'Label']#将Label除去，因为题目要求Label为在右手边

'''
遍历每一列，将同时选择column和Label进行计数，并计算support, confidence。
'''
rule = dict()
Lable_set_2 = []
for columns in Lable_set_1:
    count_column_Label = 0    
    for i in np.arange(len):
        if (dataSet.loc[i, [columns, 'Label']].values == pd.Series([1, 1]).values).all():
            count_column_Label +=1
    sum_columns = np.sum(dataSet[columns])
    support = np.divide(count_column_Label, len)
    confidence = np.divide(count_column_Label, sum_columns)
    if (support >= min_support) and (confidence >= min_confidence):
        rule.setdefault((columns, 'Label'), [support, confidence])
        Lable_set_2.append(columns)

'''
遍历每两列，将(1)同时选择columns进行计数。(2)同时选择columns和Label进行计数，并计算support, confidence。
''' 
Lable_combinations = itertools.combinations(Lable_set_2, 2)
for columns in Lable_combinations:
    count_column_Label_2 = 0 
    count_column_Label_3 = 0
    for i in np.arange(len): 
        set2 = list(columns)
        if (dataSet.loc[i, set2].values == pd.Series([1, 1]).values).all():
            count_column_Label_2 += 1
        set3 = set2.copy()
        set3.append('Label')
        if (dataSet.loc[i, set3].values == pd.Series([1, 1, 1]).values).all():
            count_column_Label_3 +=1
    support = np.divide(count_column_Label_3, len)
    confidence = np.divide(count_column_Label_3, count_column_Label_2)
    if (support >= min_support) and (confidence >= min_confidence):
        rule.setdefault((columns, 'Label'), [support, confidence])