Dataset
本文的数据集包含了各种与汽车相关的信息,如点击的位移,汽车的重量,汽车的加速度等等信息,我们将通过这些信息来预测汽车的来源:北美,欧洲或者亚洲,这个问题中类标签有三个,不同于之前的二元分类问题。
- 由于这个数据集不是csv文件,而是txt文件,并且每一列的没有像csv文件那样有一个行列索引(不包含在数据本身里面),而txt文件只是数据。因此采用一个通用的方法read_table()来读取txt文件:
mpg – Miles per gallon, Continuous.
cylinders – Number of cylinders in the motor, Integer, Ordinal, and Categorical.(汽缸数 )
displacement – Size of the motor, Continuous.
horsepower – Horsepower produced, Continuous.
weight – Weights of the car, Continuous.
acceleration – Acceleration, Continuous.
year – Year the car was built, Integer and Categorical.(每年生产量)
origin – 1=North America, 2=Europe, 3=Asia. Integer and Categorical
car_name – Name of the Car, will not be needed in this analysis.
- 通过read_table读取数据后,返回的auto是个DataFrame对象
import pandas
import numpy as np
# Filename
auto_file = "auto.txt"
# Column names, not included in file
names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
'year', 'origin', 'car_name']
# Read in file
# Delimited by an arbitrary number of whitespaces
auto = pandas.read_table(auto_file, delim_whitespace=True, names=names)
# Show the first 5 rows of the dataset
print(auto.head())
'''
mpg cylinders displacement horsepower weight acceleration year \
0 18 8 307 130.0 3504 12.0 70
1 15 8 350 165.0 3693 11.5 70
2 18 8 318 150.0 3436 11.0 70
3 16 8 304 150.0 3433 12.0 70
4 17 8 302 140.0 3449 10.5 70
origin car_name
0 1 chevrolet chevelle malibu
1 1 buick skylark 320
2 1 plymouth satellite
3 1 amc rebel sst
4 1 ford torino
'''
print(auto.describe())
'''
mpg cylinders displacement weight acceleration \
count 398.000000 398.000000 398.000000 398.000000 398.000000
mean 23.514573 5.454774 193.425879 2970.424623 15.568090
std 7.815984 1.701004 104.269838 846.841774 2.757689
min 9.000000 3.000000 68.000000 1613.000000 8.000000
25% 17.500000 4.000000 104.250000 2223.750000 13.825000
50% 23.000000 4.000000 148.500000 2803.500000 15.500000
75% 29.000000 8.000000 262.000000 3608.000000 17.175000
max 46.600000 8.000000 455.000000 5140.000000 24.800000
year origin
count 398.000000 398.000000
mean 76.010050 1.572864
std 3.697627 0.802055
min 70.000000 1.000000
25% 73.000000 1.000000
50% 76.000000 1.000000
75% 79.000000 2.000000
max 82.000000 3.000000
'''
Clean Dataset
- 由于auto有很多缺省值和无关的列信息,因此需要先做数据清洗:首先car_name 是无关的属性,其次horsepower这个属性在统计分析时没出现,可能是因为有缺失值,观察数据集发现确实是有缺失,在数据集中缺失的值用?表示的。
# Delete the column car_name
del auto["car_name"]