让我们用sql估算缺失值

最新推荐文章于 2023-03-11 14:01:29 发布

weixin_26720761

最新推荐文章于 2023-03-11 14:01:29 发布

阅读量1k

点赞数 1

文章标签： sql

原文链接：https://towardsdatascience.com/lets-impute-missing-values-with-sql-e350c89d9e2a

版权

本文展示了如何在没有Python或R的情况下，利用Oracle Cloud的SQL Developer Web处理机器学习中的缺失值问题。通过创建免费的OLTP数据库，作者以著名的Titanic数据集为例，演示了如何上传数据并根据理解调整数据类型来处理缺失值。

摘要由CSDN通过智能技术生成

Missing values are a huge problem in machine learning. In a day and age when machine learning can be done directly in the database, one wonders how to perform adequate data preparation with SQL, without other programming languages, such as Python and R. Today we’ll see just how easy it is.

缺失值是机器学习中的一个巨大问题。在当今可以直接在数据库中进行机器学习的时代，人们想知道如何在没有其他编程语言(例如Python和R)的情况下使用SQL进行适当的数据准备。今天我们将看到它是如此的容易。

We’ll use Oracle Cloud for the purpose of this article, as it’s free and can be used without any downloads and installations on your machine — through the SQL Developer Web. If you decide to follow along, create a free OLTP database, and go to Service Console — Development — SQL Developer Web.

出于本文的目的，我们将使用Oracle Cloud ，因为它是免费的，并且可以通过SQL Developer Web在计算机上进行任何下载和安装而无需任何下载和安装。如果您决定继续，请创建一个免费的OLTP数据库，然后转到Service Console-Development-SQL Developer Web 。

With regards to the dataset, we’ll use the well-known Titanic dataset for two reasons:

关于数据集，出于两个原因，我们将使用众所周知的Titanic数据集：

It’s simple and easy to understand
简单易懂
It contains enough missing values for us to play with
它包含足够的缺失值供我们使用

Once you have the dataset downloaded, you can use the Upload Data functionality of SQL Developer Web to create the table and upload data:

下载数据集后，可以使用SQL Developer Web的Upload Data功能创建表并上传数据：

Change data types using your best judgment and you’re ready to roll!

使用您的最佳判断来更改数据类型，您就可以准备开始工作！

准备与探索 (Preparation and exploration)

I don’t want to mess anything up with the source table, called titanic, so let’s make a copy of it:

我不想弄乱名为titanic的源表，因此让我们对其进行复制：

CREATE TABLE cp_titanic AS 
SELECT * FROM titanic;

Let’s just make a quick Select to verify everything is as it should be:

让我们快速选择一下，以验证所有内容均正确无误：

SELECT * FROM cp_titanic;

Awesome! If you’ve done any work with this dataset, you know the Age column is the most problematic one when it comes to missing values. Let’s just check how many there are:

太棒了！如果您对此数据集进行过任何处理，就可以知道Age列是涉及缺失值时最有问题的列。让我们检查一下有多少：

SELECT COUNT(*) AS num_age_missing 
FROM cp_titanic 
WHERE age IS NULL;

Now that’s a lot if we take into account that the dataset has less than 1000 rows. For that reason, we need to figure out how to fill these missing values and do so in an easy and automated fashion.

现在，如果我们考虑到数据集的行数少于1000，那就很多了。因此，我们需要找出如何填充这些缺失的值，并以一种简单而自动化的方式进行。

插补时间 (Imputation time)

To handle missing data imputation, we’ll create a PL/SQL procedure. Don’t worry if you don’t know what that is, as it will be utterly simple to follow along. Before we do anything in code, let’s list things this procedure should do:

为了处理丢失的数据插补，我们将创建一个PL/SQL过程。如果您不知道这是什么，请不要担心，因为它非常简单。在执行代码中的任何操作之前，让我们列出此过程应执行的操作：

Obtain a value used for imputation (mean/median/mode will do the trick for this simple example)
获取用于插补的值(平均值/中位数/模式将为该简单示例提供技巧)
Make an update to the table — replace NULL values with the calculated value
更新表-将NULL值替换为计算值

Great! This shouldn’t be hard to do. We’ll also be accepting some input from the user — 3 parameters to be more precise:

大！这应该不难做到。我们还将接受用户的一些输入-更精确的3个参数：

Table name — text representation of a table where missing data is located
表名 -缺少数据的表的文本表示形式
Attribute — a column which contains missing values
属性 -包含缺失值的列
Impute method — a way on which imputation is done — either mean, median, or mode
插补方法 (一种插补方法)，均值，中位数或众数

And that’s all we have to know to get started. Let’s create a procedure with what we know so far:

这就是我们开始所需要知道的。让我们用到目前为止所知道的来创建一个过程：

CREATE OR REPLACE PROCEDURE impute_missing(
    in_table_name    IN VARCHAR2,
    in_attribute     IN VARCHAR2,
    in_impute_method IN VARCHAR2 DEFAULT ‘mean’
) IS
BEGIN
END;
/

Awesome! Below the IS keyword we’ll declare some other variables — for holding things such as impute function that will be used, a statement for getting the average value, the average value itself, and an update statement. Our procedure now looks as follows:

太棒了！在IS关键字下面，我们将声明其他一些变量-用于保存诸如将要使用的插补函数，获取平均值的语句，平均值本身以及更新语句之类的内容。现在，我们的过程如下所示：

CREATE OR REPLACE PROCEDURE impute_missing(
    in_table_name    IN VARCHAR2,
    in_attribute     IN VARCHAR2,
    in_impute_method IN VARCHAR2 DEFAULT ‘mean’
) IS
    impute_func   VARCHAR2(16);
    get_avg_stmt  VARCHAR2(128);
    avg_value     NUMBER;
    update_stmt   VARCHAR2(128);
BEGIN
END;
/

The remaining of the code will be located between the BEGIN and END keywords.

其余代码将位于BEGIN和END关键字之间。

估算方法 (Impute method)

As discussed earlier, our procedure can handle missing value imputation by using mean, median, or mode statistical functions. Also, those are values that the user can provide for the in_impute_method parameter.

如前所述，我们的过程可以通过使用均值，中位数或众数统计函数来处理缺失值的估算。同样，这些是用户可以为in_impute_method参数提供的in_impute_method 。

The only problem is — these statistical functions are called a bit differently in SQL. We can use the CASE operator to handle this and store the result into impute_func variable. Here’s the code:

唯一的问题是-这些统计函数在SQL中的调用方式有所不同。我们可以使用CASE运算符来处理此问题，并将结果存储到impute_func变量中。这是代码：

CASE in_impute_method
    WHEN ‘mean’   THEN impute_func := ‘AVG’;
    WHEN ‘median’ THEN impute_func := ‘MEDIAN’;
    WHEN ‘mode’   THEN impute_func := ‘STATS_MODE’;
    ELSE RAISE_APPLICATION_ERROR(-20001, ‘Invalid impute method!’);
END CASE;

That wasn’t hard, was it?

那不难，不是吗？

We’re almost halfway done, and the following parts are also plain simple.

我们几乎完成了一半，以下部分也很简单。

获得平均值 (Get the average value)

To get the average value, we’ll have to use something known as a Dynamic SQL. This means we won’t hardcode the SQL statements, but instead, the statements get created based on provided user input.

要获得平均值，我们必须使用一种称为Dynamic SQL的方法 。这意味着我们不会对SQL语句进行硬编码，而是根据提供的用户输入来创建语句。

Let’s see how to create a Dynamic query for getting the average value, with respect to parameters the user has entered (table name, attribute, and impute method):

让我们看看如何针对用户输入的参数(表名称，属性和插补方法)创建动态查询以获取平均值：

get_avg_stmt :=q’[SELECT ]’|| impute_func|| q’[(]’|| in_attribute|| q’[) FROM ]’|| in_table_name;DBMS_OUTPUT.PUT_LINE(‘get_avg_stmt = ‘ || get_avg_stmt);

It might look and feel a bit strange until you get used to this, but you’ll see later what it all boils down to. This DBMS_OUTPUT line isn’t required, but it prints the get_avg_stmt to the console once we run the procedure.

除非您习惯了它，否则它的外观可能会有些奇怪，但是稍后您将看到所有内容。此DBMS_OUTPUT行不是必需的，但是一旦我们运行该过程，它将把get_avg_stmt打印到控制台。

This isn’t enough though, as we still need to execute this statement and store its result in the avg_value variable. It’s simple to do:

但是，这还不够，因为我们仍然需要执行该语句并将其结果存储在avg_value变量中。这很简单：

BEGIN
    EXECUTE IMMEDIATE get_avg_stmt INTO avg_value;
END;DBMS_OUTPUT.PUT_LINE(‘avg_value = ‘ || avg_value);

This part is now done, and now we should somehow make an update to the table to actually fill the missing values. Let’s see how.

现在已经完成了这一部分，现在我们应该以某种方式对表进行更新以实际填充缺失的值。让我们看看如何。

归因 (Imputation)

If you’ve understood the previous section, you’ll understand this one. Once again, we need to create a SQL statement dynamically and execute it. The only difference is that this time results won’t be stored in a variable, as that makes no sense for an UPDATE statement. The table is modified in-place, instead:

如果您已了解上一节，则将了解这一部分。再一次，我们需要动态创建一个SQL语句并执行它。唯一的区别是，这次的结果将不会存储在变量中，因为这对于UPDATE语句没有意义。该表被原地修改，而是：

update_stmt :=
     q’[UPDATE ]’
  || in_table_name
  || q’[ SET ]’
  || in_attribute
  || q’[ = ]’
  || avg_value
  || q’[ WHERE ]’
  || in_attribute
  || q’[ IS NULL]’;DBMS_OUTPUT.PUT_LINE(‘update_stmt = ‘ || update_stmt);BEGIN 
    EXECUTE IMMEDIATE update_stmt;
END;COMMIT;

The only new thing here is the COMMIT keyword. It is used to end the current transaction and make all of the changes permanent. If you’re not sure which statements have to be committed, here’s an explanation. You’re welcome.

这里唯一的新事物是COMMIT关键字。它用于结束当前事务并使所有更改永久生效。如果您不确定必须提交哪些语句，请执行以下说明。别客气。

And that’s it basically, we’ve done everything we had to.

基本上就是这样，我们已经完成了所有必须要做的事情。

回顾 (Recap)

Here’s the entire procedure, just in case you’ve missed something:

这是整个过程，以防万一您错过了一些东西：

CREATE OR REPLACE PROCEDURE impute_missing(
    in_table_name    IN VARCHAR2,
    in_attribute     IN VARCHAR2,
    in_impute_method IN VARCHAR2 DEFAULT ‘mean’
) IS
    impute_func   VARCHAR2(16);
    get_avg_stmt  VARCHAR2(128);
    avg_value     NUMBER;
    update_stmt   VARCHAR2(128);
BEGIN
    CASE in_impute_method
        WHEN ‘mean’   THEN impute_func := ‘AVG’;
        WHEN ‘median’ THEN impute_func := ‘MEDIAN’;
        WHEN ‘mode’   THEN impute_func := ‘STATS_MODE’;
        ELSE RAISE_APPLICATION_ERROR(-20001, ‘Invalid impute method!’);
    END CASE;    get_avg_stmt :=q’[SELECT ]’|| impute_func|| q’[(]’|| in_attribute|| q’[) FROM ]’|| in_table_name;
    DBMS_OUTPUT.PUT_LINE(‘get_avg_stmt = ‘ || get_avg_stmt);    BEGIN EXECUTE IMMEDIATE get_avg_stmt INTO avg_value;
    END;
    DBMS_OUTPUT.PUT_LINE(‘avg_value = ‘ || avg_value);    update_stmt :=
         q’[UPDATE ]’
      || in_table_name
      || q’[ SET ]’
      || in_attribute
      || q’[ = ]’
      || avg_value
      || q’[ WHERE ]’
      || in_attribute
      || q’[ IS NULL]’;
    DBMS_OUTPUT.PUT_LINE(‘update_stmt = ‘ || update_stmt);    BEGIN EXECUTE IMMEDIATE update_stmt;
    END;
    COMMIT;END;
/

And now, let’s test the thing!

现在，让我们测试一下！

测试中 (Testing)

We’ve done all the hard work, and now we can test if everything works as advertised. To do so, we’ll impute the missing values of age column with its median value. Here’s how:

我们已经完成了所有艰苦的工作，现在我们可以测试是否一切都如广告中所述。为此，我们将使用age列的中位数估算age列的缺失值。这是如何做：

BEGIN
    impute_missing(‘cp_titanic’, ‘age’, ‘median’);
END;
/

And that’s it! If we execute this block of code, we’ll get some output to the console, due to all DBMS_OUTPUT calls in the procedure:

就是这样！如果执行此代码块，由于该过程中的所有DBMS_OUTPUT调用，我们将获得一些输出到控制台：

Awesome! If we once again check for the number of missing values, we can see that everything worked out as it should:

太棒了！如果我们再次检查缺失值的数量，我们可以看到一切都按预期进行：

SELECT COUNT(*) AS num_age_missing 
FROM cp_titanic 
WHERE age IS NULL;

有什么可以改进的？ (What can be improved?)

Our procedure does the job, but that doesn’t mean it can’t be improved further. Here are a couple of ideas:

我们的程序可以完成工作，但这并不意味着它无法进一步改进。这里有一些想法：

Round the result — calling AVG on pretty much anything will result in a lot of decimal spaces
对结果取整-在几乎所有内容上调用AVG都会导致很多小数位
Have the option for the user to enter the value for imputation — instead of using mean/median/mode
用户可以选择输入插补值，而不是使用均值/中位数/众数

This is not a definite list, so feel free to tailor the procedure to your needs.

这不是一个明确的列表，因此请随时根据您的需要定制该过程。

Join my private email list for more helpful insights.

加入我的私人电子邮件列表，以获取更多有用的见解。