mlops_电影在mlops上

最新推荐文章于 2024-04-12 13:48:53 发布

weixin_26640581

最新推荐文章于 2024-04-12 13:48:53 发布

阅读量400

点赞数

文章标签： python

原文链接：https://medium.com/@petter_28583/movies-on-mlops-dee93698a1ea

版权

mlops

介绍(Introduction)

Classifying movies is always super cool and useful. I, myself, have built at least five successful business around it, so I wanted to share an end-to-end example of how you can go from being an average Netflix rater to making millions of dollars on your skillset. First, we will go through the preprocessing using PySpark on the MLOps platform, we will then continue training awesome models that we can deploy so that millions of users can pay for the ratings.

对电影进行分类总是超级酷和有用。我自己，已经建立了周围至少五个成功的事业，所以我想分享的如何从作为一个平均的Netflix评价者对你的技能使数百万美元去一个终端到终端的例子。首先，我们将在MLOps平台上使用PySpark进行预处理，然后我们将继续训练可部署的出色模型，以便数百万用户可以为评级付费。

Note! Subscription tiers at about $29/month have worked best for me in the past.

检查数据 (Inspecting the data)

The dataset can be found here. I’m going to go with the ratings, metadata and keywords for this classifier. The columns that I’m interested in and will work with are:

数据集可以在这里找到。我将使用该分类器的评分，元数据和关键字。我感兴趣并可以使用的列是：

ratings.csv — all of them
rating.csv —全部
metadata.csv — [id, budget, genres, popularity, runtime, revenue, original_language, production_companies, vote_count, vote_average]
metadata.csv-[id，预算，体裁，受欢迎程度，运行时间，收入，原始语言，production_companies，vote_count，vote_average]
keywords.csv — all of them
keyword.csv —全部

Since these are the columns with least null values and most information. I’m not going to do any Pandas data analysis here, because Kaggle has already done it for me. Personally, I keep all my cool datasets on S3, so importing them on the MLOps platform is easy, just add the S3 paths under “Create Datasource”. It should look something like this:

由于这些是具有最少空值和最多信息的列。我不会在这里进行任何熊猫数据分析，因为Kaggle已经为我完成了。就我个人而言，我将所有出色的数据集都保留在S3上，因此在MLOps平台上导入它们很容易，只需在“ Create Datasource”下添加S3路径即可。它看起来应该像这样：

让我们开始吧 (Let’s get crackin’)

First things first. This is my __main__ .

首先是第一件事。这是我的__main__ 。

Importing the DataFrames might seem a bit obscure, but since I generated the code from the console, I get the UUID names of the version tracked tables. After reading the data we basically call a function that does all our transformations, and then we write the resulting DataFrame to disk, specifying format and label columns (for statistics and training). Using the MLOps read and write functions also automatically allows you to trigger this job on a schedule, and it will only process data it has not yet seen (pipeline gold).

导入DataFrames似乎有些晦涩，但是由于我是从控制台生成代码的，所以我获得了版本跟踪表的UUID名称。读取数据后，我们基本上会调用一个执行所有转换的函数，然后将结果DataFrame写入磁盘，并指定格式和标签列(用于统计和培训)。使用MLOps读写功能还可以自动使您按计划触发该作业，并且它将仅处理尚未看到的数据(流水线黄金)。

Image for post — Selecting datasources in the MLOps platform allows you to generate template code for your PySpark job.

Note! Saving this dataset in Parquet instead of CSV, results in about two magnitudes of disk space saved. Initial dataset size: ~700 Mb, transformed CSV: ~900 Mb, transformed Parquet: 14 Mb.

Oh btw, don’t forget to add some imports, it might help.

哦，顺便说一句，不要忘记添加一些导入，这可能会有所帮助。

from pyspark.sql import functions as f
from pyspark.sql import DataFrame, SparkSession
from pyspark.ml.feature import Imputer, Bucketizer, StringIndexer, VectorAssembler, OneHotEncoder, StandardScaler
from pyspark.ml import Pipeline
from pyspark.sql.types import (
    ArrayType,
    IntegerType,
    DoubleType
)
import sys
import random
import json
from mlops.processing.spark import SparkProcessor

转变(Transformations)

So to summarize what I want to accomplish with the data:

因此，总结一下我想用数据完成的工作：

Get the IDs from all the nested columns [genres, keywords, production_companies] and sort them by the numbers and then stringify them so that, for example, a combination of genre_1 + genre_2 can be seen as a feature.
从所有嵌套列[类型，关键字，生产公司]中获取ID，然后按数字对它们进行排序，然后对其进行字符串化，例如，可以将genre_1 + genre_2的组合视为特征。
Convert all string columns [original_language, + columns from (1)] to numerical classes.
将所有字符串列[original_language，+(1)中的列]转换为数字类。
Replace all empty values (zeroes in our case) with the mean of each column.
将所有空值(在本例中为零)替换为每列的平均值。
Normalize and scale all feature columns.
归一化和缩放所有功能列。
One hot encode my ratings (labels), since I will be using TensorFlow’s CategoricalCrossentropy as loss function. This is optional, and if you are using SparkML, or Scikit, you could just be satisfied with having a single column of integers instead.
一个热编码我的等级(标签)，因为我将使用TensorFlow的CategoricalCrossentropy作为损失函数。这是可选的，如果您使用的是SparkML或Scikit，则只对一整列整数感到满意。

预处理代码 (Preprocessing code)

So let’s start with the boring, annoying part of figuring out how to parse these poorly structured python dicts (they could at least have bothered doing a json.dumps()). So this is my oh-my-god helper function:

因此，让我们从解决如何解析这些结构不良的python字典的无聊而烦人的部分开始(他们至少可能json.dumps()执行json.dumps() )。这是我的“哦，我的上帝”辅助函数：

def parse_json_array(data: str):
    if data is not None:
        data_cleaned = data.replace(
            "None", "null").replace("'", '"').replace('\r', '').replace('\n', '').replace('""', '"').replace('\xa0', '')
        try:
            data_parsed = json.loads(data_cleaned)
            data = []
            for item in data_parsed:
                data.append(item['id'])
            return data
        except Exception as e:
            return []
    else:
        return []




def to_array(col):
    def to_array_(v):
        return v.toArray().tolist()
    return f.udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)

So now that we got that out of the way, let’s dive into the my_transformations() function. I’m going to start with the parsing. And basically, what I will do is run a UDF for each json column and pass each row to the above function. Basically we will go from having a stringified Python dict to a Spark ArrayType filled with integers (the Id’s), these integers I will then sort with the array_sort function so that genre_1+genre_2 is the same as genre_2+genre_1:

现在，我们已经解决了这个问题，让我们深入研究my_transformations()函数。我将从解析开始。基本上，我将为每个json列运行一个UDF并将每一行传递给上述函数。基本上，我们将从拥有一个字符串化的Python字典到一个填充有整数(Id)的Spark ArrayType，然后我将使用array_sort函数对这些整数进行排序，以便genre_1 + genre_2与genre_2 + genre_1相同：

def my_transformations(df_keywords: DataFrame, df_ratings: DataFrame, df_meta: DataFrame) -> DataFrame:
    json_array_schema = ArrayType(IntegerType())
    udf_id_parser = f.udf(lambda x: parse_json_array(x), json_array_schema)


    df_kw_parsed = df_keywords.select(
        [
            f.col('id').cast('int'),
            f.array_sort(udf_id_parser(df_keywords["keywords"])).alias(
                "keyword_ids")
        ]
    ).withColumn('size', f.size(f.col('keyword_ids'))).filter(f.col('size') >= 1)


    df_meta_parsed = df_meta.select(
        [
            f.col('id').cast('int'),
            'original_language',
            'popularity',
            'runtime',
            'revenue',
            'vote_average',
            'vote_count',
            f.array_sort(udf_id_parser(df_meta['production_companies'])).alias(
                'production_company_ids'),
            f.col('budget').cast('int'),
            f.array_sort(udf_id_parser(df_meta["genres"])).alias("genre_ids")


        ]
    ).withColumn('size', f.size(f.col('genre_ids'))).filter(f.col('size') >= 1)


    # Read ratings and cast columns
    df_ratings = df_ratings.select(
        f.col('movieid').alias('id'),
        f.col('rating').cast('float'),
        f.col('timestamp').cast('int')
    ).na.drop()

As you can see above I’m also using my own special “null value” handler at the end of df_kw_parsed and df_meta_parsed to remove all arrays that are empty.

如您在上面看到的，我还在df_kw_parsed和df_meta_parsed的末尾使用我自己的特殊“空值”处理程序来删除所有空数组。

Next up, I will join my three DataFrames together, repartition the data evenly onto the Spark executors and then cast a bunch of columns. You will also notice that I use the Spark built-in function array_join , which will help me stringify my arrays so that I can create integer classes out of them later.

接下来，我将三个DataFrame结合在一起，将数据均匀地重新分配到Spark执行器上，然后投射一堆列。您还将注意到，我使用了Spark内置函数array_join ，它将帮助我对数组进行字符串化，以便以后可以从中创建整数类。

# Continuing inside my_transformations()


df_joined = df_ratings.join(
    f.broadcast(df_meta_parsed.join(
        df_kw_parsed,
        on='id',
        how='left'
    )),
    on='id',
    how='inner'
).repartition(20)


df_joined = df_joined.select(
    'id',
    'rating',
    'original_language',
    'budget',
    'timestamp',
    f.col('popularity').cast('float'),
    f.col('vote_count').cast('float'),
    f.col('vote_average').cast('float'),
    f.col('runtime').cast('float'),
    f.col('revenue').cast('float'),
    f.array_join(df_joined['production_company_ids'],
                 delimiter='_').alias('production_company_join'),
    f.array_join(df_joined['genre_ids'],
                 delimiter='_').alias('genre_join'),
    f.array_join(df_joined['keyword_ids'], delimiter='_').alias('kw_join')
)

Once that hurdle is over, I will create my first Spark Pipeline that will take all my string columns and turn each unique value into its own integer value, as well as using the Imputer function to fill zero values with the mean. I will also take my ratings column and use Bucketizer . Basically, since the ratings are a bit skewed in number of samples, I will this function to bucket ratings into a unified class, reducing the number of classes from 10 to 4.

一旦这个障碍结束，我将创建我的第一个Spark Pipeline，它将使用我的所有字符串列，并将每个唯一值转换为其自己的整数值，以及使用Imputer函数用均值填充零值。我还将使用我的收视率列，并使用Bucketizer 。基本上，由于评分在样本数量上有些偏斜，因此我将使用此功能将评分归为一个统一的类别，将类别的数量从10个减少到4个。

# Continuing inside my_transformations()


index_pipeline = Pipeline(stages=[
    Imputer(
        inputCols=['budget', 'popularity', 'vote_count',
                   'vote_average', 'runtime', 'revenue'],
        outputCols=['budget', 'popularity', 'vote_count',
                    'vote_average', 'runtime', 'revenue'],
        missingValue=0.0
    ),
    StringIndexer(inputCol='genre_join',
                  outputCol='genreIndex').setHandleInvalid('skip'),
    StringIndexer(inputCol='production_company_join',
                  outputCol='pcIndex').setHandleInvalid('skip'),
    StringIndexer(inputCol='kw_join',
                  outputCol='kwIndex').setHandleInvalid('skip'),
    StringIndexer(inputCol='original_language',
                  outputCol='langIndex').setHandleInvalid('skip'),
    Bucketizer(splits=[0.0, 2.5, 3.5, 4.5, 5.0], inputCol='rating',
               outputCol='ratingIndex').setHandleInvalid('skip')
])
df_indexed = index_pipeline.fit(df_joined).transform(df_joined).select(
    'budget', 'popularity', 'timestamp', 'vote_count', 'vote_average', 'runtime', 'revenue', 'pcIndex', 'genreIndex', 'kwIndex', 'langIndex', 'ratingIndex'
)

Alright, we have two things left to do, scaling and one hot encoding. So let’s start with scaling, which I will do for for all features. I will use the Spark StandardScaler that normalizes and then scales, because I’m lazy. And honestly, because we are doing movie ratings.

好了，我们还有两件事要做，缩放和一项热编码。因此，让我们从缩放开始，我将针对所有功能进行缩放。由于我很懒，因此我将使用先标准化然后缩放的Spark StandardScaler 。老实说，因为我们正在做电影分级。

# Continuing inside my_transformations()


unlist = f.udf(lambda x: round(float(list(x)[0]), 3), DoubleType())


# Iterating over columns to be scaled
for i in ['budget', 'popularity', 'timestamp', 'vote_count', 'vote_average', 'runtime', 'revenue', 'pcIndex', "genreIndex", "kwIndex", 'langIndex']:
    assembler = VectorAssembler(inputCols=[i], outputCol=i+"_vect")
    assembler.setHandleInvalid('skip')
    scaler = StandardScaler(
        inputCol=i+"_vect", outputCol=i+"_scaled", withMean=True)
    pipeline = Pipeline(stages=[assembler, scaler])
    df_indexed = pipeline.fit(df_indexed).transform(df_indexed).withColumn(
        i+"_scaled", unlist(i+"_scaled")).drop(i+"_vect")

Finally, we can run our last pipeline, with the OneHotEncoder . There is just one little obstacle here, and that is that it’s outputted into a SparseVector and we want it as columns, since we are going to save it to disk.

最后，我们可以使用OneHotEncoder运行最后一个管道。这里只有一个小障碍，那就是它已输出到SparseVector ，我们希望将其作为列，因为我们将其保存到磁盘。

# Continuing inside my_transformations()


encode_pipeline = Pipeline(stages=[
    OneHotEncoder(inputCol='ratingIndex', outputCol='rating_hot')
])
df_encoded = encode_pipeline.fit(df_indexed).transform(df_indexed)


return df_encoded.withColumn("label", to_array(f.col("rating_hot"))).select(
    ['budget_scaled',
     'timestamp_scaled',
     'vote_count_scaled',
     'vote_average_scaled',
     'popularity_scaled',
     'runtime_scaled',
     'revenue_scaled',
     'pcIndex_scaled',
     'genreIndex_scaled',
     'kwIndex_scaled',
     'langIndex_scaled',
     'ratingIndex'] + [f.col("label")[i] for i in range(4)]
)

Notice that I use the function to_array from gist number three. Finally I will return to main and write to disk.

注意，我使用了to_array要点中的函数to_array 。最后，我将返回main并写入磁盘。

Now that my script is done, I want to pop it into the MLOps platform so I do “Create Dataset”, choose 4 standard workers (that’s a lot of compute power for this little princess dataset). I will create a full subset right away and do a 80/10/10 split on train/val/test. I also won’t calculate any column metrics, since I know they are normalized and scaled, and also because we are doing movie ratings. Here is the result:

现在，我的脚本已完成，我想将其弹出到MLOps平台中，以便执行“创建数据集”，选择4个标准工作程序(对于这个小的公主数据集，这具有很大的计算能力)。我将立即创建一个完整的子集，并在训练/验证/测试中进行80/10/10拆分。我也不会计算任何列指标，因为我知道它们已被标准化和缩放，而且还因为我们正在进行电影分级。结果如下：

As we can see, this badboy took us about 35 minutes to execute and costed me about a dollar. Good for me that I’m 100 % certain this will generate a kick-ass classifier in the next part!

如我们所见，这个坏男孩花了我们大约35分钟的时间执行，并花了我大约1美元。对我有好处，我100％确信这将在下一部分中生成反冲分类器！

If I’m interested, I can directly via the Logs button inspect the CloudWatch logs. I can also check out the PySpark Code directly, which is nice if you are multiple people on a team working towards a common goal. You can then easily start of where someone left.

如果有兴趣，我可以直接通过“日志”按钮检查CloudWatch日志。我也可以直接签出PySpark代码，如果您是一个团队中的多个人，都朝着一个共同的目标努力，那么这很好。然后，您可以轻松地从某人离开的地方开始。

Well that was a handful. Now let’s move on and build a model!

好吧，那是少数。现在让我们继续建立模型！

训练TensorFlow模型 (Training a TensorFlow model)

Alright, so as you know, the heavy lifting is over, and building models is just a walk in the park once the data is tip top. So, straight from the Datasets view we can select our dataset and click “Create Model” which will look something like this:

好了，正如您所知，繁重的工作已经结束，一旦数据达到最佳状态，构建模型就只是在公园散步。因此，直接从“数据集”视图中，我们可以选择我们的数据集，然后单击“创建模型”，其外观如下所示：

If you are a Data Scientist, and have read this far, you probably agree that using a DNN to fit a model around this dataset is a tad too much. But who has time for that kind of philosophical thinking!

如果您是一名数据科学家，并且已经读了很长时间，那么您可能同意使用DNN围绕该数据集拟合模型有点麻烦。但是谁有时间进行这种哲学思考呢！

I will use a 120–120–4 DNN with CategoricalCrossEntropy in the Keras API. All this I will run on a GPU instance with 30 epochs. And this is all the code we need:

我将在Keras API中使用带有CategoricalCrossEntropy的120–120–4 DNN。我将在具有30个纪元的GPU实例上运行所有这些。这就是我们需要的所有代码：

import tensorflow as tf
from mlops.training.tensorflow import TensorFlowTrainer




def my_network(mlops, labels):
    inputs = tf.keras.Input(shape=(mlops.train_data.shape[1],))
    hidden_1 = tf.keras.layers.Dense(120, activation="relu")(inputs)
    hidden_2 = tf.keras.layers.Dense(120, activation="relu")(hidden_1)
    outputs = tf.keras.layers.Dense(labels)(hidden_2)
    optimizer = tf.keras.optimizers.Nadam(
        mlops.hyperparameters["learning_rate"])
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    model.compile(
        optimizer=optimizer,
        loss=mlops.loss,
        metrics=mlops.metrics,
    )
    return model




if __name__ == "__main__":
    mlops = TensorFlowTrainer()
    mlops.read_data(number_of_label_cols=4, label_pos='end')
    model = my_network(mlops, 4)
    model.fit(
        mlops.train_data,
        mlops.train_labels.astype('float64'),
        batch_size=mlops.hyperparameters["batch_size"],
        epochs=mlops.hyperparameters["epochs"],
        validation_data=(mlops.val_data, mlops.val_labels.astype('float64')),
        callbacks=[mlops.callback],
        shuffle=True
    )
    mlops.predict(model)
    mlops.save_model(model)

As you can see, most important things are baked into the MLOps SDK after submitting through the console. You access all the hyperparameters through mlops.hyperparameters . The best thing is, if I wanted to run hyperparameter optimization on this dataset, the total lines of code I would have to change would be zero. This allows you to experiment at a fast pace, doing many iterations on smaller datasets, and then quickly scale it up once you are ready to build your production model. As well, using the mlops.callback class, we automatically feed all metrics back to the console so that you and your colleagues can collaborate and iterate together.

如您所见，最重要的事情是在通过控制台提交后被烘焙到MLOps SDK中的。您可以通过mlops.hyperparameters访问所有超mlops.hyperparameters 。最好的事情是，如果我想对该数据集运行超参数优化，那么我必须更改的代码总行数将为零。这使您可以快速进行实验，在较小的数据集上进行多次迭代，然后在准备好构建生产模型后Swift进行扩展。同样，使用mlops.callback类，我们会自动将所有指标反馈到控制台，以便您和您的同事可以协作并一起迭代。

Ooooh! Fantastic results. A stunning accuracy of 25 % on a 4 class problem. But what could have gone wrong? Must have been those lazy data engineers. Oh wait…

喔！了不起的结果。 4类问题的准确性达到25％。但是可能出了什么问题？一定是那些懒惰的数据工程师。等一下…

If you can see what’s wrong — leave a comment! The first correct solution wins a movie (rating).

如果您发现问题所在，请发表评论！第一个正确的解决方案将赢得一部电影(评级)。

Click here if you want to learn more about the MLOps platform!

如果您想了解有关MLOps平台的更多信息，请单击此处！

翻译自: https://medium.com/@petter_28583/movies-on-mlops-dee93698a1ea

mlops

weixin_26640581

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
mlops_电影在mlops上

mlops介绍(Introduction)Classifying movies is always super cool and useful. I, myself, have built at least five successful business around it, so I wanted to share an end-to-end example of how you can g...
复制链接

扫一扫