Prediction of Future Income by Using Logistic Regression
Matthew LaFrance, Yu Zhang
1 Introduction
Many factors could influence a person’s annual income, for example, age, gender, race, level of education, marriage status, nationality, etc. The authors tried to fit four models of influential factors that based on a census dataset and to find a precise prediction of annual income.
The dataset was found at the UCI Machine Learning Repository. The data consists of 48,844 observations from a 1994 U.S. Census. The target variable, “Salary”, has two levels: >50k and <=50k. There are 8 categorical features and 5
numeric features consisting of demographic, educational, and occupational information. Table 1 is the variables in SAS version.
Table 1 Variables in the Income Dataset
The target variable “Salary” is notably unbalanced (in table 2). As a result, we noted that using raw accuracy as a success metric could potentially be misleading because a model that classifies all observations as <=50k would achieve roughly 76% accuracy.
Table 2 Salary Overview
2 Data Preprocessing
2.1 Missing Values
3,622 observations contained missing values primarily in the “occupation” column. Because the occupation column had many factor levels, we decided many imputation methods wouldn’t retain enough information to justify the increase in bias, so we decided to take only complete cases. As a result, our conclusions may be biased, and we assume that the deleted observations had missing values at random. For future work, it may be worthwhile to explore other methods of handling the missing data. After deletion, the final dataset had 45,222 observations.
2.2 Multicollinearity Checks
None of our numeric features showed strong correlations between each other (see table 3). As a result, we were not particularly concerned about multicollinearity.
Table 3 The Result of Correlation Checks
2.3 Exploratory Data Analysis and Feature Engineering
In our initial looks at the data, we were able to make several noteworthy observations which will be detailed below by variables and summarized at the end.
2.3.1 Capital Gains and Capital Losses
Figure 1 Overview of Capital Gains
Figure 2 Overview of Capital Losses
Regarding the capital gains and loss variables, it is worth noting that most individuals in our dataset do not have any investments (see Figure 1 and Figure 2).
2.3.2 Native Country
As expected, most observations are U.S. natives (see Figure 3). The native country variable consists of many factor levels. In order to avoid having too many dummy variables later on, we decided it would be necessary to rebin this feature into a “0, and 1 ” indicator variable of being a native-born citizen.
Figure 3 Overview of Native Countries
2.3.3 Marital Status
In looking into the marital status feature, we noticed several different levels all representing married (see Figure 4). We decided to combine all these levels into one level, “married”, for more convenient interpretation. Additionally, it was worth noting that married individuals appear to more consistently make greater than 50k.
Figure 4 Overview of Marital Status
2.3.4 Education Level
As expected with e