预测纽约出租车行程时间

最新推荐文章于 2024-08-29 12:41:05 发布

Goodsta

最新推荐文章于 2024-08-29 12:41:05 发布

阅读量2.4k

点赞数

文章标签：机器学习

本文链接：https://blog.csdn.net/wong2016/article/details/87548314

版权

本案例通过R语言，利用纽约市1.5M的出租车行程数据，建立预测模型，探讨行程时间与乘客数量、供应商、日期/时间等因素的关系。通过特征工程和XGBoost分类模型，对数据进行预处理和分析，发现如乘客数量、供应商ID与行程时间的关联，并对异常值进行处理。

摘要由CSDN通过智能技术生成

机器学习案例详解的直播互动平台——
机器学习训练营（入群联系qq：2279055353）
下期直播案例预告：大数据预测商品的销售量波动趋势

案例简介

本案例要求根据乘客的旅程属性，建立一个模型预测纽约市出租车的行程时间，相关数据集来自Google云平台。该案例使用R语言编码。

我们的解决方案将分成以下三步进行：

可视化数据集，加工新特征，检查离群点。
增加外部数据集
XGBoost分类模型

数据描述

数据由1.5M的训练观测train.csv和630K的检验观测test.csv组成。每行观测代表一个乘车旅程。

介绍

加载R包和函数

首先，我们加载必需的R包。

library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
library('alluvial') # visualisation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
library('lubridate') # date and time
library('geosphere') # geospatial locations
library('leaflet') # maps
library('leaflet.extras') # maps
library('maps') # maps
library('xgboost') # modelling
library('caret') # modelling

然后，我们定义一个多图函数，该函数将在可视化时使用。

# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

加载数据

这里，我们使用data.table包的fread函数，加快数据的读取。

train <- as.tibble(fread('../input/nyc-taxi-trip-duration/train.csv'))
test <- as.tibble(fread('../input/nyc-taxi-trip-duration/test.csv'))

查看数据

让我们来观察一下训练集和检验集的数据分布和变量类型等信息。以训练集为例：

summary(train)

在这里插入图片描述

glimpse(train)

最低0.47元/天解锁文章

Goodsta

关注

0
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
预测纽约出租车行程时间

机器学习案例详解的直播互动平台——机器学习训练营（qq群号：696721295）下期直播案例预告：大数据预测商品的销售量波动趋势案例简介本案例要求根据乘客的旅程属性，建立一个模型预测纽约市出租车的行程时间，相关数据集来自Google云平台。该案例使用R语言编码。我们的解决方案将分成以下三步进行：可视化数据集，加工新特征，检查离群点。增加外部数据集XGBoost分类模...
复制链接

扫一扫