Python中的逻辑回归-准备数据

在Python中创建逻辑回归分类器前,需先对数据进行预处理,包括使用One Hot编码转换数据。本文介绍了如何编码数据、理解数据映射以及删除'未知'值,以准备构建模型。
摘要由CSDN通过智能技术生成

Python中的逻辑回归-准备数据 (Logistic Regression in Python - Preparing Data)

For creating the classifier, we must prepare the data in a format that is asked by the classifier building module. We prepare the data by doing One Hot Encoding.

为了创建分类器,我们必须以分类器构建模块要求的格式准备数据。 我们通过进行一次热编码来准备数据。

编码数据 (Encoding Data)

We will discuss shortly what we mean by encoding data. First, let us run the code. Run the following command in the code window.

我们将在短期内讨论编码数据的含义。 首先,让我们运行代码。 在代码窗口中运行以下命令。


In [10]: # creating one hot encoding of the categorical columns.
data = pd.get_dummies(df, columns =['job', 'marital', 'default', 'housing', 'loan', 'poutcome'])

As the comment says, the above statement will create the one hot encoding of the data. Let us see what has it created? Examine the created data called “data” by printing the head records in the database.

如评论所述,以上语句将创建数据的一种热编码。 让我们看看它创造了什么? 通过在数据库中打印头记录来检查称为“数据”的创建数据。


In [11]: data.head()

You will see the following output −

您将看到以下输出-

Created Data

To understand the above data, we will list out the column names by running the data.columns command as shown below −

为了理解上述数据,我们将通过运行data.columns命令列出列名称,如下所示:


In [12]: data.columns
Out[12]: Index(['y', 'job_admin.', 'job_blue-collar', 'job_entrepreneur',
'job_housemaid', 'job_management', 'job_retired', 'job_self-employed', 
'job_services', 'job_student', 'job_technician', 'job_unemployed',
'job_unknown', 'marital_divorced', 'marital_married', 'marital_single', 
'marital_unknown', 'default_no', 'default_unknown', 'default_yes', 
'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
'loan_unknown', 'loan_yes', 'poutcome_failure', 'poutcome_nonexistent', 
'poutcome_success'], dtype='object')

Now, we will explain how the one hot encoding is done by the get_dummies command. The first column in the newly generated database is “y” field which indicates whether this client has subscribed to a TD or not. Now, let us look at the columns which are encoded. The first encoded column is “job”. In the database, you will find that the “job” column has many possible values such as “admin”, “blue-collar”, “entrepreneur”, and so on. For each possible value, we have a new column created in the database, with the column name appended as a prefix.

现在,我们将解释如何通过get_dummies命令完成一种热编码。 新生成的数据库中的第一列是“ y”字段,它指示此客户端是否已预订TD。 现在,让我们看一下已编码的列。 第一编码列是“ job” 。 在数据库中,您会发现“职位”列具有许多可能的值,例如“管理员”,“蓝领”,“企业家”等等。 对于每个可能的值,我们都会在数据库中创建一个新列,并在列名后面添加前缀。

Thus, we have columns called “job_admin”, “job_blue-collar”, and so on. For each encoded field in our original database, you will find a list of columns added in the created database with all possible values that the column takes in the original database. Carefully examine the list of columns to understand how the data is mapped to a new database.

因此,我们有名为“ job_admin”,“ job_blue-collar”的列,依此类推。 对于我们原始数据库中的每个编码字段,您将找到添加到已创建数据库中的列的列表,其中包含该列在原始数据库中采用的所有可能值。 仔细检查列列表,以了解如何将数据映射到新数据库。

了解数据映射 (Understanding Data Mapping)

To understand the generated data, let us print out the entire data using the data command. The partial output after running the command is shown below.

为了理解生成的数据,让我们使用data命令打印出整个数据。 运行命令后的部分输出如下所示。


In [13]: data

Understanding Data Mapping

The above screen shows the first twelve rows. If you scroll down further, you would see that the mapping is done for all the rows.

上面的屏幕显示了前十二行。 如果向下滚动,您会看到所有行的映射均已完成。

A partial screen output further down the database is shown here for your quick reference.

在数据库下方显示了部分屏幕输出,以供您快速参考。

Quick Reference

To understand the mapped data, let us examine the first row.

要了解映射的数据,让我们检查第一行。

Mapped Data

It says that this customer has not subscribed to TD as indicated by the value in the “y” field. It also indicates that this customer is a “blue-collar” customer. Scrolling down horizontally, it will tell you that he has a “housing” and has taken no “loan”.

它表示此客户未按照“ y”字段中的值指示订阅TD。 它还表明该客户是“蓝领”客户。 水平向下滚动,它将告诉您他有一个“住房”,没有获得“贷款”。

After this one hot encoding, we need some more data processing before we can start building our model.

经过这一热编码之后,我们需要更多的数据处理才能开始构建模型。

删除“未知” (Dropping the “unknown”)

If we examine the columns in the mapped database, you will find the presence of few columns ending with “unknown”. For example, examine the column at index 12 with the following command shown in the screenshot −

如果我们检查映射数据库中的列,您会发现以“ unknown”结尾的几列。 例如,使用屏幕快照中显示的以下命令检查索引12处的列-


In [14]: data.columns[12]
Out[14]: 'job_unknown'

This indicates the job for the specified customer is unknown. Obviously, there is no point in including such columns in our analysis and model building. Thus, all columns with the “unknown” value should be dropped. This is done with the following command −

这表明指定客户的工作未知。 显然,在我们的分析和模型构建中包括这些列是没有意义的。 因此,应删除所有具有“未知”值的列。 这是通过以下命令完成的-


In [15]: data.drop(data.columns[[12, 16, 18, 21, 24]], axis=1, inplace=True)

Ensure that you specify the correct column numbers. In case of a doubt, you can examine the column name anytime by specifying its index in the columns command as described earlier.

确保指定正确的列号。 如有疑问,您可以随时通过在列命令中指定列索引来检查列名称,如前所述。

After dropping the undesired columns, you can examine the final list of columns as shown in the output below −

删除不需要的列后,您可以检查列的最终列表,如下面的输出所示-


In [16]: data.columns
Out[16]: Index(['y', 'job_admin.', 'job_blue-collar', 'job_entrepreneur', 
'job_housemaid', 'job_management', 'job_retired', 'job_self-employed', 
'job_services', 'job_student', 'job_technician', 'job_unemployed',
'marital_divorced', 'marital_married', 'marital_single', 'default_no', 
'default_yes', 'housing_no', 'housing_yes', 'loan_no', 'loan_yes',
'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success'], 
dtype='object')

At this point, our data is ready for model building.

至此,我们的数据已准备好用于模型构建。

翻译自: https://www.tutorialspoint.com/logistic_regression_in_python/logistic_regression_in_python_preparing_data.htm

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值