github上的项目,跟着一起学习
数据预处理 | 第1天
用来练习的数据是这样的:
Country | Age | Salary | Purchased |
France | 44 | 72000 | No |
Spain | 27 | 48000 | Yes |
Germany | 30 | 54000 | No |
Spain | 38 | 61000 | No |
Germany | 40 | Yes | |
France | 35 | 58000 | Yes |
Spain | 52000 | No | |
France | 48 | 79000 | Yes |
Germany | 50 | 83000 | No |
France | 37 | 67000 | Yes |
jupyder notebook读取 array([['France', 44.0, 72000.0], ['Spain', 27.0, 48000.0], ['Germany', 30.0, 54000.0], ['Spain', 38.0, 61000.0], ['Germany', 40.0, nan], ['France', 35.0, 58000.0], ['Spain', nan, 52000.0], ['France', 48.0, 79000.0], ['Germany', 50.0, 83000.0], ['France', 37.0, 67000.0]], dtype=object)
操作共6步:
1.导入库
numpy
pandas
2.导入数据集
pd.read_csv()
3.处理丢失数据
sklearn.preprocessing的Imputer 缺失值用平均值填充 fit, transform
4.解析分类数据
sklearn.preprocessing的LabelEncoder OneHotEncoder 编码fit_transform
5.拆分数据集为训练集和测试集
sklearn.model_selection的train_test_split
6.特征量化
sklearn.preprocessing的StandardScaler fit_transform