二手车市场探索性数据分析

最新推荐文章于 2024-06-18 10:25:52 发布

Great1414

最新推荐文章于 2024-06-18 10:25:52 发布

阅读量3.3k

点赞数

分类专栏：数据分析文章标签： BeautifulSoup pandas matplotlib sklearn

本文链接：https://blog.csdn.net/weixin_41512727/article/details/80041915

版权

二手车市场探索性数据分析

项目描述：近年随着汽车工业发展，二手车市场越来越火热。根据获取的二手车市场数据，对影响二手车价格的因素进行研究与分析

项目职责：1.二手车市场的数据采集和数据集的预处理

2.可视化分析，确定二手车价格的影响因素

3.针对关键因素，分析对二手车价格的影响规律

4.分析图表的制作及分析报告的输出。

第一步:数据获取。抓取所有二手车对应的信息。1.找到各品牌车，对应的目标链接。2.获取所有页面下面，二手车的目标链接。3.通过该目标链接，获取所有二手车的各类信息。4.对各二手车及对应信息进行保存。以便后续分析。

## ********************************** 第一步：抓取二手车的所有品牌 **********************************
# 导入第三方包
import requests
from bs4 import BeautifulSoup
import time

# 设置头
headers = {
    'Accept':'*/*',
    'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'
}
# 二手车主页的链接及解析html
url = 'http://shanghai.taoche.com/all/'
res = requests.get(url, headers = headers).text
soup = BeautifulSoup(res,'html.parser')

# 抓取二手车名称及对应的链接
car_brands = soup.findAll('div',{'class':'brand-name'})
car_brands = [j for i in car_brands for j in i]
brands = [i.text for i in car_brands]
urls = ['http://shanghai.taoche.com' + i['href'] for i in car_brands]


## ********************************** 第二步：抓取所有页面下二手车的目标链接 **********************************
# 构建空列表，生成所需抓取的目标链接
target_urls = []
target_brands = []

for b,u in zip(brands,urls):
    # 抓取各品牌二手车主页下的所有页码
    res = requests.get(u, headers = headers).text
    soup = BeautifulSoup(res,'html.parser')
    
    if len(soup.findAll('div',{'class':'the-pages'})) == 0:
        pages = 1
    else:
        pages = int([page.text for page in soup.findAll('div',{'class':'the-pages'})[0].findAll('a')][-2])
    time.sleep(3)
    
    for i in range(1,pages + 1):
        target_brands.append(b)
        target_urls.append(u+'?page='+str(i)+'#pagetag')
        

## ********************************** 第三步：对二手车信息进行采集 **********************************        
# 构建空列表，用于数据的存储
brand = []
title = []
boarding_time = []
km = []
discharge = []
sec_price = []
new_price = []

# 对每个链接发生请求
for b,u in zip(target_brands,target_urls):
    
    res = requests.get(u, headers = headers).text
    soup = BeautifulSoup(res,'html.parser')
    
    # 每页车子的数量
    N = len([i.findAll('a')[0]['title'] for i in soup.findAll('div',{'class':'item_details'})])
    try:
        #车名称
        brands = (b+'-')*N
        brand.extend(brands.split('-')[:-1])
        title.extend([i.findAll('a')[0]['title'] for i in soup.findAll('div',{'class':'item_details'})])
        # 二手车的上牌时间、行驶里程数等信息
        info = [i.findAll('li') for i in soup.findAll('ul',{'class':'ul_news'})]
        boarding_time.extend([i[0].text[4:] for i in info])
        km.extend([i[1].text[4:] for i in info])
        discharge.extend([i[3].text[4:] for i in info])
        sec_price.extend([float(i.findAll('h2')[0].text[:-1]) for i in soup.findAll('div',{'class':'item_price'})])
        new_price.extend([i.findAll('p')[0].text.split('\xa0')[0][5:].strip() for i in soup.findAll('div',{'class':'item_price'})])
        
    except IndexError:
        pass
    # 每3秒停顿一次
    time.sleep(3)

    
## ********************************** 第四步：将采集来的数据进行存储 **********************************      
# 数据导出
import pandas as pd
cars_info = pd.DataFrame([brand,title,boarding_time,km,discharge,sec_price,new_price]).T
cars_info = cars_info.rename(columns={0:'Brand',1:'Name',2:'Boarding_time',3:'Km',4:'Discharge',5:'Sec_price',6:'New_price'})
cars_info.to_csv('second_cars_info.csv', index=False)

第二步：数据清洗。对抓取的数据，进行预处理。

通过上表，可以清楚看到整个数据结构，各变量包括代表汽车品牌、汽车款式、上牌时间、行驶里程数、排放标准、二手价格和同款新车的参考价格。从中也发现一些问题：1.二手车上牌时间，存在‘’未上牌‘’，行驶里程、新车价格、上牌时间为字符串，所以需要进行数据预处理工作。

    In [12]: 
  

# 导入第三方模块
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# 可视化的中文处理
plt.rcParams['font.sans-serif'] = 'Microsoft YaHei'
plt.rcParams['axes.unicode_minus'] = False
plt.style.use('ggplot')

# 读取数据
cars = pd.read_csv('C:/Users/Administrator/Desktop/second_cars_info.csv')

#********************第一部分：数据预处理*****************************
# “未上牌”的二手车占比
N = np.sum(cars.Boarding_time == '未上牌')
Ratio = N/cars.shape[0]
Ratio

      Out[12]: 
    

0.00824395000443223

    In [13]: 
  

# 由于未上牌的汽车数量占比极少，仅千分之八，这里不妨考虑将其删除
cars = cars.loc[cars.Boarding_time != '未上牌',:]

    In [14]: 
  

cars.index = range(0,cars.shape[0])
# 取出上牌时间变量中的年和月
cars['year'] = cars.Boarding_time.str[:4].astype('int')
month = cars.Boarding_time.str.findall('年(.*?)月')
# print(month.head(10))

# 由于month是列表构成的序列，所以需要非列表化，再序列化
month = pd.Series([i[0] for i in month]).astype('int')
cars['month'] = month
# print(month.head(10))

# 计算上牌日期距离2018年03月份的月数
cars['diff_months'] = (2018-cars.year)*12 + (3-cars.month) + 1
# 显示数据的前5行
cars.head()

      Out[14]: 
    

	Brand	Name	Boarding_time	Km	Discharge	Sec_price	New_price	year	month	diff_months
0	奥迪	奥迪A6L 2006款 2.4 CVT 舒适型	2006年8月	9.00万公里	国3	6.90	50.89万	2006	8	140
1	奥迪	奥迪A6L 2007款 2.4 CVT 舒适型	2007年1月	8.00万公里	国4	8.88	50.89万	2007	1	135
2	奥迪	奥迪A6L 2004款 2.4L 技术领先型

最低0.47元/天解锁文章

Great1414

关注

0
点赞
踩
35

收藏

觉得还不错? 一键收藏
打赏
1
评论
二手车市场探索性数据分析

二手车市场探索性数据分析项目描述：近年随着汽车工业发展，二手车市场越来越火热。根据获取的二手车市场数据，对影响二手车价格的因素进行研究与分析项目职责：1.二手车市场的数据采集和数据集的预处理 2.可视化分析，确定二手车价格的影响因素 3.针对关键因素，分析对二手车价格的影响规律 4.分析图表的制作及分析...
复制链接

扫一扫

专栏目录