数据分析实例1（英文报告）--预测未来收入--SAS 逻辑回归--1994年美国人口普查数据

本文链接：https://blog.csdn.net/m0_47089011/article/details/105533444

该博客通过分析1994年美国人口普查数据，利用SAS进行逻辑回归，探讨了年龄、性别、种族、教育水平等因素如何影响个人年收入。研究发现，年龄、投资、教育年限、工作小时数、工作类别、婚姻状况、种族和国籍等因素对收入有显著影响。尽管随机森林和梯度提升模型在准确性上优于逻辑回归，但逻辑回归由于其解释性而仍然受到青睐。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Prediction of Future Income by Using Logistic Regression

Matthew LaFrance， Yu Zhang

1 Introduction

Many factors could influence a person’s annual income, for example, age, gender, race, level of education, marriage status, nationality, etc. The authors tried to fit four models of influential factors that based on a census dataset and to find a precise prediction of annual income.

The dataset was found at the UCI Machine Learning Repository. The data consists of 48,844 observations from a 1994 U.S. Census. The target variable, “Salary”, has two levels: >50k and <=50k. There are 8 categorical features and 5
numeric features consisting of demographic, educational, and occupational information. Table 1 is the variables in SAS version.

Table 1 Variables in the Income Dataset
在这里插入图片描述
The target variable “Salary” is notably unbalanced (in table 2). As a result, we noted that using raw accuracy as a success metric could potentially be misleading because a model that classifies all observations as <=50k would achieve roughly 76% accuracy.

Table 2 Salary Overview
在这里插入图片描述

2 Data Preprocessing

2.1 Missing Values

3,622 observations contained missing values primarily in the “occupation” column. Because the occupation column had many factor levels, we decided many imputation methods wouldn’t retain enough information to justify the increase in bias, so we decided to take only complete cases. As a result, our conclusions may be biased, and we assume that the deleted observations had missing values at random. For future work, it may be worthwhile to explore other methods of handling the missing data. After deletion, the final dataset had 45,222 observations.

2.2 Multicollinearity Checks

None of our numeric features showed strong correlations between each other (see table 3). As a result, we were not particularly concerned about multicollinearity.

Table 3 The Result of Correlation Checks
在这里插入图片描述

2.3 Exploratory Data Analysis and Feature Engineering

In our initial looks at the data, we were able to make several noteworthy observations which will be detailed below by variables and summarized at the end.

2.3.1 Capital Gains and Capital Losses

Figure 1 Overview of Capital Gains
在这里插入图片描述
Figure 2 Overview of Capital Losses

Regarding the capital gains and loss variables, it is worth noting that most individuals in our dataset do not have any investments (see Figure 1 and Figure 2).

2.3.2 Native Country
As expected, most observations are U.S. natives (see Figure 3). The native country variable consists of many factor levels. In order to avoid having too many dummy variables later on, we decided it would be necessary to rebin this feature into a “0, and 1 ” indicator variable of being a native-born citizen.
在这里插入图片描述
Figure 3 Overview of Native Countries

2.3.3 Marital Status
In looking into the marital status feature, we noticed several different levels all representing married (see Figure 4). We decided to combine all these levels into one level, “married”, for more convenient interpretation. Additionally, it was worth noting that married individuals appear to more consistently make greater than 50k.
在这里插入图片描述
Figure 4 Overview of Marital Status