汽车分类——多元分类

最新推荐文章于 2024-09-01 08:47:43 发布

mmい

最新推荐文章于 2024-09-01 08:47:43 发布

阅读量6.8k

点赞数 2

分类专栏： Machine Learning

本文链接：https://blog.csdn.net/zm714981790/article/details/51243955

版权

这篇博客通过分析汽车的多种属性，如mpg、cylinders等，使用多元分类方法预测汽车来自北美、欧洲还是亚洲。介绍了数据预处理、哑变量的使用，以及运用逻辑回归进行多类分类。博客还探讨了混淆矩阵、平均准确率、查准率、查全率和F-分数等评估指标，并利用sklearn库进行模型评估。

摘要由CSDN通过智能技术生成

Dataset

本文的数据集包含了各种与汽车相关的信息，如点击的位移，汽车的重量，汽车的加速度等等信息，我们将通过这些信息来预测汽车的来源：北美，欧洲或者亚洲，这个问题中类标签有三个，不同于之前的二元分类问题。

由于这个数据集不是csv文件，而是txt文件，并且每一列的没有像csv文件那样有一个行列索引（不包含在数据本身里面），而txt文件只是数据。因此采用一个通用的方法read_table()来读取txt文件：

mpg – Miles per gallon, Continuous.
cylinders – Number of cylinders in the motor, Integer, Ordinal, and Categorical.（汽缸数）
displacement – Size of the motor, Continuous.
horsepower – Horsepower produced, Continuous.
weight – Weights of the car, Continuous.
acceleration – Acceleration, Continuous.
year – Year the car was built, Integer and Categorical.（每年生产量）
origin – 1=North America, 2=Europe, 3=Asia. Integer and Categorical
car_name – Name of the Car, will not be needed in this analysis.

通过read_table读取数据后，返回的auto是个DataFrame对象

import pandas
import numpy as np

# Filename
auto_file = "auto.txt"

# Column names, not included in file
names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 
         'year', 'origin', 'car_name']

# Read in file 
# Delimited by an arbitrary number of whitespaces 
auto = pandas.read_table(auto_file, delim_whitespace=True, names=names)

# Show the first 5 rows of the dataset
print(auto.head())
'''
  mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0   18          8           307      130.0    3504          12.0    70   
1   15          8           350      165.0    3693          11.5    70   
2   18          8           318      150.0    3436          11.0    70   
3   16          8           304      150.0    3433          12.0    70   
4   17          8           302      140.0    3449          10.5    70   

   origin                   car_name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  
'''
print(auto.describe())
'''
              mpg   cylinders  displacement       weight  acceleration  \
count  398.000000  398.000000    398.000000   398.000000    398.000000   
mean    23.514573    5.454774    193.425879  2970.424623     15.568090   
std      7.815984    1.701004    104.269838   846.841774      2.757689   
min      9.000000    3.000000     68.000000  1613.000000      8.000000   
25%     17.500000    4.000000    104.250000  2223.750000     13.825000   
50%     23.000000    4.000000    148.500000  2803.500000     15.500000   
75%     29.000000    8.000000    262.000000  3608.000000     17.175000   
max     46.600000    8.000000    455.000000  5140.000000     24.800000   

             year      origin  
count  398.000000  398.000000  
mean    76.010050    1.572864  
std      3.697627    0.802055  
min     70.000000    1.000000  
25%     73.000000    1.000000  
50%     76.000000    1.000000  
75%     79.000000    2.000000  
max     82.000000    3.000000  
'''

Clean Dataset

由于auto有很多缺省值和无关的列信息，因此需要先做数据清洗：首先car_name 是无关的属性，其次horsepower这个属性在统计分析时没出现，可能是因为有缺失值，观察数据集发现确实是有缺失，在数据集中缺失的值用？表示的。

# Delete the column car_name
del auto["car_name"]

最低0.47元/天解锁文章

mmい

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录