Python中的逻辑回归-重组数据

最新推荐文章于 2024-05-06 14:24:47 发布

cunzai1985

最新推荐文章于 2024-05-06 14:24:47 发布

阅读量210

点赞数

文章标签： python java 大数据数据分析机器学习

原文链接：https://www.tutorialspoint.com/logistic_regression_in_python/logistic_regression_in_python_restructuring_data.htm

版权

Python中的逻辑回归-重组数据 (Logistic Regression in Python - Restructuring Data)

Whenever any organization conducts a survey, they try to collect as much information as possible from the customer, with the idea that this information would be useful to the organization one way or the other, at a later point of time. To solve the current problem, we have to pick up the information that is directly relevant to our problem.

每当任何组织进行调查时，他们都会尝试从客户那里收集尽可能多的信息，以为该信息在以后的某个时间点将以一种或另一种方式对组织有用。为了解决当前的问题，我们必须选择与我们的问题直接相关的信息。

显示所有字段 (Displaying All Fields)

Now, let us see how to select the data fields useful to us. Run the following statement in the code editor.

现在，让我们看看如何选择对我们有用的数据字段。在代码编辑器中运行以下语句。


In [6]: print(list(df.columns))

You will see the following output −

您将看到以下输出-


['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 
'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 
'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 
'euribor3m', 'nr_employed', 'y']

The output shows the names of all the columns in the database. The last column “y” is a Boolean value indicating whether this customer has a term deposit with the bank. The values of this field are either “y” or “n”. You can read the description and purpose of each column in the banks-name.txt file that was downloaded as part of the data.

输出显示数据库中所有列的名称。最后一列“ y”是布尔值，指示此客户是否在银行有定期存款。该字段的值为“ y”或“ n”。您可以阅读bank-name.txt文件中作为数据的一部分下载的每一列的描述和用途。

消除不必要的领域 (Eliminating Unwanted Fields)

Examining the column names, you will know that some of the fields have no significance to the problem at hand. For example, fields such as month, day_of_week, campaign, etc. are of no use to us. We will eliminate these fields from our database. To drop a column, we use the drop command as shown below −

检查列名，您将知道某些字段对当前问题没有任何意义。例如，诸如month，day_of_week ，campaign等之类的字段对我们没有用。我们将从数据库中删除这些字段。要删除列，我们使用drop命令，如下所示：


In [8]: #drop columns which are not needed.
   df.drop(df.columns[[0, 3, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19]], 
   axis = 1, inplace = True)

The command says that drop column number 0, 3, 7, 8, and so on. To ensure that the index is properly selected, use the following statement −

该命令说删除列号0、3、7、8，依此类推。为了确保正确选择索引，请使用以下语句-


In [7]: df.columns[9]
Out[7]: 'day_of_week'

This prints the column name for the given index.

这将打印给定索引的列名称。

After dropping the columns which are not required, examine the data with the head statement. The screen output is shown here −

删除不需要的列后，使用head语句检查数据。屏幕输出如下所示-


In [9]: df.head()
Out[9]:
      job   marital  default  housing  loan  poutcome    y
0     blue-collar    married  unknown yes no nonexistent 0
1     technician     married  no    no    no nonexistent 0
2     management     single   no    yes   no success     1
3     services       married  no    no    no nonexistent 0
4     retired        married  no    yes   no success     1

Now, we have only the fields which we feel are important for our data analysis and prediction. The importance of Data Scientist comes into picture at this step. The data scientist has to select the appropriate columns for model building.

现在，只有我们认为对我们的数据分析和预测很重要的领域。在这一步骤中， 数据科学家的重要性得到了体现。数据科学家必须选择适当的列以进行模型构建。

For example, the type of job though at the first glance may not convince everybody for inclusion in the database, it will be a very useful field. Not all types of customers will open the TD. The lower income people may not open the TDs, while the higher income people will usually park their excess money in TDs. So the type of job becomes significantly relevant in this scenario. Likewise, carefully select the columns which you feel will be relevant for your analysis.

例如，虽然乍一看工作的类型可能不会说服所有人都将其包含在数据库中，但这将是一个非常有用的领域。并非所有类型的客户都将打开TD。收入较低的人可能不会打开TD，而收入较高的人通常会把多余的钱存放在TD中。因此，在这种情况下，工作类型变得非常重要。同样，仔细选择您认为与分析相关的列。

In the next chapter, we will prepare our data for building the model.

在下一章中，我们将准备用于构建模型的数据。