文章第二章练习题 | mnist数据集 | 泰坦尼克号数据集 | 图像增强

最新推荐文章于 2022-11-18 09:52:20 发布

代码魔法师！

最新推荐文章于 2022-11-18 09:52:20 发布

阅读量527

点赞数 2

分类专栏：机器学习实战文章标签： python 神经网络机器学习

本文链接：https://blog.csdn.net/lijiamingccc/article/details/119982574

版权

机器学习实战专栏收录该内容

10 篇文章 3 订阅

订阅专栏

三个练习题

1. 为mnist数据集创建一个分类器，并在测试集上达成超过97%的准确率
2. 图像增强，对图片进行一些处理，提高mnist识别率
3. 泰坦尼克号数据集，生存率预测

1. 为mnist数据集创建一个分类器，并在测试集上达成超过97%的准确率

关于mnist数据集的详细描述，请看本专栏第二章，一般学过机器学习和深度学习的都对这个数据集很熟悉

我们已知的knn_clf 这个分类器有很多的超参数，我们用网格搜索进行最佳超参数的查找，然后将最好结果的模型进行数据集的验证测试。

# 1. 为mnist数据集创建一个分类器，并在测试集上达成超过97%的准确率
from sklearn.model_selection import GridSearchCV

param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]
# 最佳参数: param_grid = [{'weights': ["distance"], 'n_neighbors': [4]}]

knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid, cv=2, verbose= 2)
grid_search.fit(x_train_lit, y_train_lit)

print(grid_search.best_params_)  
#{'n_neighbors': 4, 'weights': 'distance'}

print(grid_search.best_score_)  
# 0.9716166666666666

from sklearn.metrics import accuracy_score
y_pred = grid_search.predict(x_test_lit)
print(accuracy_score(y_test_lit,y_pred))

2. 图像增强，对图片进行一些处理，提高mnist识别率

建立一个shift_image函数，参数是dx，dy
参数是正值就是将图像向下移动dy个像素，向右移动dx个像素
负值就是相反方向移动

from scipy.ndimage.interpolation import shift

def shift_image(image,dx,dy):
   image = image.reshape((28,28))
   # 将图像向下移动dy个像素，向右移动dx个像素   负的值就是相反方向移动
   shifted_image = shift(image,[dy,dx],cval=0,mode="constant")
   return shifted_image.reshape([-1])

我们先以第100张图片为例，进行下，左偏移

image = x_train_lit[100]
    shifted_image_down = shift_image(image,0,5)
    shifted_image_left = shift_image(image,-5,0)

可视化查看操作

plt.figure(figsize=(12, 3))
    plt.subplot(131)
    plt.title("Original", fontsize=14)
    plt.imshow(image.reshape(28, 28), interpolation="nearest", cmap="Greys")
    plt.subplot(132)
    plt.title("Shifted down", fontsize=14)
    plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")
    plt.subplot(133)
    plt.title("Shifted left", fontsize=14)
    plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
    plt.show()

效果
在这里插入图片描述
效果还不错，向下的部分有点过头了，我们在实际操作需要注意，不能上下偏移过大

# 将原来的训练集数据转化为列表
X_train_augmented = [image for image in x_train_lit]
y_train_augmented = [label for label in y_train_lit]

# 四个方向上的位移添加到训练数据中
for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
   for image, label in zip(x_train_lit, y_train_lit):
       X_train_augmented.append(shift_image(image, dx, dy))
       y_train_augmented.append(label)
       
# 将原来的列表格式转换为numpy格式   目的是提高运算速度
X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

 # 实例化最好参数结果的方法
# knn_clf = grid_search.best_estimator_
knn_clf = KNeighborsClassifier(**grid_search.best_params_) 

# 开始训练
knn_clf.fit(X_train_augmented, y_train_augmented)

y_pred = knn_clf.predict(x_test_lit)
print(accuracy_score(y_test, y_pred))

结果会提高大概0.5% 的准确率

3. 泰坦尼克号数据集，生存率预测

数据下载地址：
https://www.kaggle.com/c/titanic

数据集介绍：
“”"
这些属性具有以下含义：
Survived：目标，0表示乘客没有幸存，1表示乘客幸存。
Pclass：乘客等级。
Name Sex Age，姓名，性别，年龄
SibSp : 同乘的兄弟姐妹/配偶数
Parch : 同乘的父母/小孩数
Ticket：票证id
Fare：已付价格（英镑）
Cabin：客舱编号
Embarked**：乘客登上泰坦尼克号的地方
“”"

设置pandas显示所有列

pd.set_option("max_columns",None)

载入数据集

TITINIC_PATH = os.path.join("datasets","tantic")
def load_titinic_data(filename,filepath=TITINIC_PATH):
    csv_path = os.path.join(filepath,filename)
    return pd.read_csv(csv_path)

# 读取文件
train_data = load_titinic_data("train.csv")
test_data = load_titinic_data("test.csv")


# 看一下数据缺失的情况
train_data.info()


# 看一下数字属性
train_data.describe()

# 存活率
train_data["Survived"].value_counts()

# 1，2，3等仓的人数
train_data["Pclass"].value_counts()

# 性别比例
train_data["Sex"].value_counts()

# 从哪上船 C=Cherbourg, Q=Queenstown, S=Southampton
train_data["Embarked"].value_counts()

在这里插入图片描述

1.直接看年龄这个属性是没有价值的，我们可以将年龄进行分段，可能年轻人存活可能性更大，这种数据更有意义；
2.同样的只看SibSp和Parch这两个属性也毫无意义，可以考虑将两个属性归为，在船上的亲属有几个，可能会产生更好的效果

train_data["AgeBucket"] = train_data["Age"] // 15 * 15
train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]

丢弃‘AgeBucket’和‘RelativesOnboard’这两列中有缺失值的行

train_data.dropna(axis=0,subset = ["AgeBucket", "RelativesOnboard"],inplace=True)

建立一个DataFrame属性选择器

from sklearn.base import BaseEstimator,TransformerMixin

class DataFrameSelector(BaseEstimator,TransformerMixin):
    def __init__(self,attribute_name):
        self.attribute_name = attribute_name
    def fit(self,X,y=None):
        return self
    def transform(self,X):
        return X[self.attribute_name]

处理数值信息，使用中位数进行一部分属性的填充

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
        ("select_numeric", DataFrameSelector(["Age", "SibSp", "Parch", "Fare"])),
        ("imputer", SimpleImputer(strategy="median")),
    ])

处理字符类别格式数据

# 用最频繁的属性值替换缺失值
class MostFrequentImputer(BaseEstimator,TransformerMixin):
    def fit(self,X,y=None):
        # 这里做的事情，是取出属性中最频繁的类型
        self.most_frequest_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                        index=X.columns)
        return self
    def transform(self,X,y=None):
        # 这里发现了fillna的新用法
        print(self.most_frequest_)
        return X.fillna(self.most_frequest_)

# 字符串格式流水线
from sklearn.preprocessing import OneHotEncoder
cat_pipeline = Pipeline([
                                        # 乘客等级   性别    上车位置
        ("select_cat", DataFrameSelector(["Pclass", "Sex", "Embarked"])),
        ("imputer", MostFrequentImputer()),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

合并数值和字符串处理的流水线

from sklearn.pipeline import FeatureUnion
preprocess_pipeline = FeatureUnion(
    transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

划分训练集和测试集

x_train_prep = preprocess_pipeline.fit_transform(train_data)
# print(x_train_prep)  # numpy.array
y_train = train_data["Survived"]

创建随机森林模型

# 随机森林模型
forest_clf = RandomForestClassifier(n_estimators=300)
forest_scores = cross_val_score(forest_clf, x_train_prep, y_train, cv=10)
print(forest_scores.mean())

准确率大概是80%左右

代码魔法师！

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
1
评论
文章第二章练习题 | mnist数据集 | 泰坦尼克号数据集 | 图像增强

三个练习题1. 为mnist数据集创建一个分类器，并在测试集上达成超过97%的准确率2. 图像增强，对图片进行一些处理，提高mnist识别率3. 泰坦尼克号数据集，生存率预测1. 为mnist数据集创建一个分类器，并在测试集上达成超过97%的准确率关于mnist数据集的详细描述，请看本专栏第二章，一般学过机器学习和深度学习的都对这个数据集很熟悉我们已知的knn_clf 这个分类器有很多的超参数，我们用网格搜索进行最佳超参数的查找，然后将最好结果的模型进行数据集的验证测试。# 1. 为mnist数据集
复制链接

扫一扫