单身舔狗的崛起之路——用MLP给你喜欢的女生训练个专属的衣服穿搭神经网络

本文链接：https://blog.csdn.net/qq_35357274/article/details/121023988

曾经有个人给我说过，当年有个男的追她，天天给她发天气预报。但是女神不会看天气预报啊？或者她不会抬头望望天啊？于是秉持着舔到最后应有尽有，偷懒是人类进步的最大动力这两大原则，我为女神训练出了一个专属的衣服穿搭神经网络。

整个项目已开源至github：https://github.com/Balding-Lee/PyTorch-MLP-for-personalized-dress-matching。

1 数据获取

数据爬取的网站为：http://www.tianqihoubao.com/lishi/chengdu/month/202001.html，从2020年1月爬取到了2021年10月。由于网站服务器较垃圾，导致2020年9月数据没有爬取下来。一共575条数据。通过人工标注的方式，将这575条数据分到了11类中。由于一个人一天不止会穿一件衣服（因为有衣服，裤子，鞋子），所以每条数据中都会有多个1的出现。

爬取的数据如下：
raw data
包括的特征有：季节、当日最高气温、当日最低气温、早晨的天气、晚间的天气。

2 数据处理

数据处理的目标主要是为了能够把数据给嵌入到神经网络输入层中。除了基本的切割与清理无关字符以外，我们需要着重处理的特征有：季节、早晨的天气、晚间的天气。

因为这些数据都是离散数据，所以最简单的方式就是通过one-hot编码的方式对其进行嵌入。首先我们用最简单的规则来对日期进行划分，其中11 - 02为冬天，02 - 05为春天，05 - 08为夏天，08 - 11为秋天。这样其实是不太准确的，如果想要尽可能准确，则应该把每一年的立春立秋这些给爬取下来。而对应的天气，我们则不用特殊处理，只需将两者合并起来再去重。这样处理下来，季节和天气的数据分别为：

seasons = ['春', '夏', '秋', '冬']
weathers = ['中雨', '多云', '大暴雨', '大雨', '小雨', '晴', '暴雨', '阴', '阵雨', '雷阵雨']

由于后续我们还需要onehot编码格式，所以这里我们不直接采用sklearn的OneHotEncoder对所有数据直接编码，而是寻找id与onehot编码之间的映射关系：

def get_id_char_mapping(char_list):
    """
    获得id与词的映射关系
    :param char_list: list
            词列表
    :return idx2char: dict
            {id1: 'char1', id2: 'char2', ...}
            id与词之间的映射关系
    :return char2idx: dict
            {'char1': id1, 'char2': id2, ...}
            词与id之间的映射关系
    """
    idx2char, char2idx = {}, {}
    char_set = set(char_list)  # 去重
    for i, char_ in enumerate(char_set):
        idx2char[i] = char_
        char2idx[char_] = i

    return idx2char, char2idx


def get_seq2idx(sequence, char2idx):
    """
    将序列数据映射为id
    :param sequence: list
            序列数据
    :param char2idx: dict
            {'char1': id1, 'char2': id2, ...}
            词与id之间的映射关系
    :return sequence2idx: list
            映射为id后的序列数据
    """
    sequence2idx = []
    for char_ in sequence:
        sequence2idx.append(char2idx[char_])

    return sequence2idx


def onehot_encode_seq(onehot_encoder, sequence):
    """
    对序列进行one-hot编码
    :param onehot_encoder: ndarray
            onehot编码器
    :param sequence: list
            需要编码的序列
    :return onehot: ndarray
            onehot编码后的序列
    """
    onehot = np.zeros((len(sequence), len(onehot_encoder)))

    for i, id_ in enumerate(sequence):
        onehot[i] = onehot_encoder[id_]

    return onehot


def encode_data(seasons, weather_mornings, weather_nights):
    """
    对数据进行编码, 将季节和天气编码为one-hot
    季节: shape: (4, 4)
    天气: shape: ()
    :param seasons: list
            季节
    :param weather_mornings: list
            早晨天气
    :param weather_nights: list
            晚间天气
    :return season_onehot: ndarray
            shape: (num_days, 4)
            季节的one-hot编码
    :return weather_mornings_onehot: ndarray
            shape: (num_days, 10)
            早晨天气的one-hot编码
    :return weather_nights_onehot: ndarray
            shape: (num_days, 10)
            晚间天气的one-hot编码
    """
    onehot_encoder = OneHotEncoder()  # one-hot编码器

    idx2season, season2idx = get_id_char_mapping(seasons)
    season_onehot_encoder = onehot_encoder.fit_transform(
        np.array(list(idx2season.keys())).reshape(-1, 1)
    ).toarray()  # 获得season的one-hot编码
    season_seq2idx = get_seq2idx(seasons, season2idx)  # 将sequence转为id

    # 根据id与one-hot的映射关系将sequence转为one-hot编码
    season_onehot = onehot_encode_seq(season_onehot_encoder, season_seq2idx)

    weather = []
    weather.extend(weather_mornings)
    weather.extend(weather_nights)
    idx2weather, weather2idx = get_id_char_mapping(weather)
    weather_onehot_encoder = onehot_encoder.fit_transform(
        np.array(list(idx2weather.keys())).reshape(-1, 1)
    ).toarray()

    weather_mornings_seq2idx = get_seq2idx(weather_mornings, weather2idx)
    weather_nights_seq2idx = get_seq2idx(weather_nights, weather2idx)
    weather_mornings_onehot = onehot_encode_seq(weather_onehot_encoder,
                                                weather_mornings_seq2idx)
    weather_nights_onehot = onehot_encode_seq(weather_onehot_encoder,
                                              weather_nights_seq2idx)

    return season_onehot, weather_mornings_onehot, weather_nights_onehot

这里我们还是传统的处理方式，先做一个id与词之间的映射关系，通过该映射关系，得到其onehot编码。以季节举例，会得到如下的结果：

idx2season = {0: '夏', 1: '秋', 2: '春', 3: '冬'}
season2idx = {'夏': 0, '秋': 1, '春': 2, '冬': 3}
season_onehot_encoder = array([[1., 0., 0., 0.],
      						   [0., 1., 0., 0.],
       						   [0., 0., 1., 0.],
       						   [0., 0., 0., 1.]])

也就是说夏的onehot编码为season_onehot_encoder[0]，这个0正好对应season2idx中夏的值。我们有了这个映射关系后，就可以将整个输入序列都编码为onehot了。首先将数据集中的所有季节全部映射为id，再根据这个id来找season_onehot_encoder中的onehot编码。而weather也是同理，只不过由于数据集中weather_morning与weather_night中有几个标签是不一样的，为了保证输入层输入的统一性，所以在onehot编码时是将两者给合并起来再做的去重处理，而onehot编码时则是单独进行编码。

当进行完以上的处理后，我们需要将这些单独处理好的数据拼接起来作为输入层数据。我们来分析下数据的维度。针对某一天，根据以上的编码后，我们会发现输入层层数是26 = 4 + 2 + 10 + 10，其中4代表季节的onehot编码，两个10分别代表了早晨天气和晚间天气的onehot编码，而2代表了最高气温与最低气温这两个标量。而对于每组数据的维度是如下的：

season_onehot: (575, 4)
highest_temps: (575, 1)
lowest_temps: (575, 1)
weather_mornings_onehot: (575, 10)
weather_nights_onehot: (575, 10)

其中575是sequence length。我们自然而然想到的就是向量的水平拼接：

inputs = np.hstack((season_onehot, highest_temps))
inputs = np.hstack((inputs, lowest_temps))
inputs = np.hstack((inputs, weather_mornings_onehot))
inputs = np.hstack((inputs, weather_nights_onehot))

拼接后inputs的维度为：

inputs: (575, 26)

3 模型定义与训练

3.1 模型定义

model framework
Embedding在第二节已经介绍过了。对于隐藏层而言，第一个隐藏层是128维，最后一个隐藏层是12维，至于这两层中间，我尝试过不加隐藏层；一层64维隐藏层；一层64维，一层32维。最后实验结果证明，效果最好的是 $128 \times 64 \times 12$ 的组合。最后这个12维的隐藏层主要是用于学习上身、裤子、鞋子在春夏秋冬的概率。模型定义的代码如下：

class MLP(nn.Module):

    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        self.linear1 = nn.Linear(num_inputs, 128)
        self.linear_add1 = nn.Linear(128, 64)
        # self.linear_add2 = nn.Linear(64, 32)
        self.linear2 = nn.Linear(64, 12)
        self.linear3 = nn.Linear(12, num_outputs)
        self.sigmoid = nn.Sigmoid()
        self.dropout = nn.Dropout(0.01)
        self.softmax = nn.Softmax()

    def forward(self, inputs):
        """
        前向传播
        :param inputs: tensor
                shape: (batch_size, 26)
        :return: tensor
                shape: (batch_size, 11)
        """
        out1 = self.sigmoid(self.linear1(inputs))
        out1 = self.dropout(out1)
        out_add1 = self.sigmoid(self.linear_add1(out1))
        out_add1 = self.dropout(out_add1)
        # out_add2 = self.sigmoid(self.linear_add2(out_add1))
        # out_add2 = self.dropout(out_add2)
        out2 = self.sigmoid(self.linear2(out_add1))
        out2 = self.dropout(out2)

        return self.softmax(self.linear3(out2))

隐藏层通过sigmoid做激活函数，输出层用softmax激活，由于数据量过少，所以dropout设置为的0.01。

3.2 评价指标与损失函数

由于该问题是个多标签分类问题，所以传统的分类问题的损失函数交叉熵是没办法使用的，具体问题详见我上篇博客：《Pytorch学习笔记(5)——交叉熵报错RuntimeError: 1D target tensor expected, multi-target not supported》。于是这里就采用了最传统的均方误差做损失函数。

同样，sklearn中传统的评价指标也不适用于该类型问题（虽然sklearn.metrics中的average_precision_score可以解决多标签分类问题（官方文档：sklearn.metrics.average_precision_score），但是我总觉得不是acc的评价指标感觉怪怪的），所以这里我自己定义了一个准确率的评价指标。

由于我们做的东西，目的是根据天气来推荐当天的穿着，那么我就设置了一个阈值 $\epsilon$ ，大于 $\epsilon$ 的为推荐的穿着，小于 $\epsilon$ 的则不推荐。由于有11个类别，平摊下来每个类别出现的概率是 $9\%$ ，所以我设置 $\epsilon = 0.1$ 。而准确率的代码为：

def get_accuracy(y_hat, y, epsilon):
    """
    获得准确率
    判断y_hat每个元素与阈值的大小, 再与y做比较
    :param y_hat: tensor
            预测数据
    :param y: tensor
            真实数据
    :param epsilon: float
            阈值
    :return: float
            准确率
    """
    return ((y_hat >= epsilon).float() == y).float().mean().item()

该代码可以用以下例子来简单理解：

y_hat = tensor([2.7865e-05, 7.7470e-06, 5.3148e-01, 3.0976e-04, 1.9971e-05,
				3.3148e-06, 1.3452e-01, 6.2689e-02, 1.3991e-01, 1.3103e-01, 5.4364e-06])
				
y = tensor([0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0])

(y_hat >= epsilon) = tensor([False, False,  True, False, False, False,  True, False,  True,  True, False])

(y_hat >= epsilon).float() = tensor([0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.])

((y_hat >= epsilon).float() == y) = tensor([True, False, False,  True, False, 
											True, False, False,  True, False, True])
		
((y_hat >= epsilon).float() == y).float() = tensor([1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1.])						

((y_hat >= epsilon).float() == y).float().mean() = tensor(0.4545)

3.3 训练模型

训练模型就平淡无奇了，就是传统的模型训练方法。唯一要注意的就是，由于数据量过少，我就按照6 : 2 : 2的比例来划分训练集 : 验证集 : 测试集。但是train_test_split又没有划分验证集的方法，所以我用以下两步来做的划分：

X_train, X_dt, y_train, y_dt = train_test_split(inputs, labels, test_size=0.4,
                                                random_state=0)
X_dev, X_test, y_dev, y_test = train_test_split(X_dt, y_dt, test_size=0.5,
                                                random_state=0)

3.4 模型评估

关于训练准确率、误差，验证准确率、误差，我就用layer2的结果来做展示。如下：

从上到下依次为：训练集损失，训练集准确率，验证集损失，验证集准确率。可以发现训练集的损失震荡很严重，或者说下降的很少（因为纵坐标的范围还不到 $10\%$ ）。这些就是欠拟合造成的，也就是说因为数据量过少导致的。

而在测试集上的损失与准确率为：

layer1: test accuracy 0.756522, test loss 0.188675
layer2: test accuracy 0.766798, test loss 0.186148
layer3: test accuracy 0.739130, test loss 0.194097

可以发现，在测试集上，有三层隐藏层的神经网络（对应layer2）是性能最好的。

4 测试API

当模型训练好了之后，我封装了个接口可以测试效果：

model = MLP(26, 11)
model.load_state_dict(torch.load('./data/parameters_layer2.pkl'))
model.eval()
with torch.no_grad():
    pred = model(input_)

dress_idx = torch.nonzero((pred >= epsilon).float())  # 提取出非零的元素下标

print('今日适合穿: ', end='')
for idx in dress_idx:
    print(titles[idx], end=' ')

效果如下：

测试不同数据，对应输入输出为：

python mlp.py -s 夏 -hi 34 -l 28 -m 晴 -n 晴
今日适合穿: T恤（短） 牛仔裤 帆布鞋 老爹鞋

python mlp.py -s 春 -hi 20 -l 18 -m 晴 -n 多云
今日适合穿: T恤（短） 牛仔裤 帆布鞋 老爹鞋

python mlp.py -s 秋 -hi 14 -l 11 -m 小雨 -n 多云
今日适合穿: 卫衣 牛仔裤 老爹鞋

python mlp.py -s 冬 -hi 5 -l 1 -m 多云 -n 阵雨
今日适合穿: 羽绒服 毛衣