# necessary to remove rows with incorrect labels in test dataset
data_test = data_test[
(data_test["Target"] == " >50K.") | (data_test["Target"] == " <=50K.")
]
# encode target variable as integer
data_train.loc[data_train["Target"] == " <=50K", "Target"] = 0
data_train.loc[data_train["Target"] == " >50K", "Target"] = 1
data_test.loc[data_test["Target"] == " <=50K.", "Target"] = 0
data_test.loc[data_test["Target"] == " >50K.", "Target"] = 1
第一行代码 看似是一句废话,其实是将不同值域的数值弄到一起。(我后来看看,确实有点废话,应该是想抛弃某些值)
data_test.describe(include="all").T
转置,让表格可能更加美观,格式符合人们一般认知
将训练集和测试集分开看待
# choose categorical and continuous features from data
categorical_columns = [
c for c in data_train.columns if data_train[c].dtype.name == "object"
]
numerical_columns = [
c for c in data_train.columns if data_train[c].dtype.name != "object"
]
print("categorical_columns:", categorical_columns)
print("numerical_columns:", numerical_columns)
将分类变量与数值变量直接分开来看
# fill missing data
for c in categorical_columns:
data_train[c].fillna(data_train[c].mode()[0], inplace=True)
data_test[c].fillna(data_train[c].mode()[0], inplace=True)
for c in numerical_columns:
data_train[c].fillna(data_train[c].median(), inplace=True)
data_test[c].fillna(data_train[c].median(), inplace=True)
分类变量的空缺值,用众数填充;数值变量的空缺值用平均数填充
data_train = pd.concat(
[data_train[numerical_columns], pd.get_dummies(data_train[categorical_columns])],
axis=1,
)
data_test = pd.concat(
[data_test[numerical_columns], pd.get_dummies(data_test[categorical_columns])],
axis=1,
)
异常聪明的哑变量代换 ,之前鼠鼠是个傻逼
鼠鼠直接全截图下来吧,由于测试集可能不具备训练集的某些特征种类,我们利用集合知识求出两个列表的差集,通过差集找出没有的那一个feature,为测试集创造一个值全为0的特征。
read_html 读取的是表格的集合 ,从中抽取 你想要的
这一部要经过很原始的查数 有趣有趣