基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测

AbstractThis project is a modeling and prediction project based on cloud computingWe used the historical bike-sharing usage data set from Kaggle in the Washington area, analyzed and modeled the data on the cloud computing platform of Apache Spark and th
摘要由CSDN通过智能技术生成

Abstract


This project is a modeling and prediction project based on cloud computing

We used the historical bike-sharing usage data set from Kaggle in the Washington area, analyzed and modeled the data on the cloud computing platform of Apache Spark and the Python processing platform, and finally predicted the bike-sharing rental demand in the Washington area.

Data Set Overview: The selected data set consists of a training set and a test set:

  • The training set consists of data from the first 19 days of each month, and the test set consists of data from the 20th day of each month to the end of the month.

  • The training set contains 12 attributes, including datetime, season, holiday and other attributes.The test set is missing the casual, count, and registered properties.

Model evaluation and application: In this modeling, three models are used, including Multiple Linear Regression, K Nearest Neighbor and Random forest. Through the model.score function of each model, the prediction accuracy score of each model for the data set is 0.3893, 0.1919 and 0.9926, respectively.The random forest has the highest accuracy, so we finally choose the random forest to predict the test set. Finally save the prediction output as test_pred.csv.

Through this project, we have solved the problems proposed at the beginning, gained a new understanding of data processing and model analysis, and also found new problems and challenges in the process of project execution:

  1. The factors affecting the number of car rentals are not a single variable, but these characteristics jointly determine the number of car rentals.At the same time, there are many characteristics that are related to each other (such as temperature, humidity, windspeed and atemp), which will affect each other.Of course, there are also some variables that are interference terms that are not directly related to the number of car rentals (the correlation is very low).Therefore, in the process of modeling, exploratory data analysis is very necessary.

  2. The data preprocessing step is very critical, which is directly related to the overall analysis and even the prediction of the model.Simple check missing value to weight and remove outliers often does not significantly improve model fitting effect, rather than through insight into the intrinsic characteristics of data visualization results, and according to the characteristics of the data model, processing (such as log, data conversion, etc.), so as to make the model prediction results more accurate.

  3. In this project, we also encountered a number of problems,for example: Spark’s DataFrame could not perform method operations on individual columns and rows, which made me have trouble in the initial data analysis and visualization. Secondly, our analysis and modeling process also has many deficiencies, for example:

    • Repeated values, missing values and so on in the original data set need to be completely processed before modeling;
    • After modeling, it is found that the impact of some variables on car rental is not suitable to be calculated by means of the average value, instead, the cumulative value should be used for statistics;

Packages imported:pyspark.sql.functions pyspark.context(SparkContext) pyspark.sql.session(SparkSession) seaborn matplotlib.pyplot warnings numpy pandas datetime(datetime) sklearn.ensemble(RandomForestRegressor) sklearn.neighbors(KNeighborsClassifier) sklearn.linear_model(LinearRegression) sklearn.model_selection(train_test_split)

1. Introduction


1.1 Project background

The bike-sharing system is a way of renting bicycles. Registration, renting and returning the bikes are all done through the self-service terminal network of the whole city, which can automatically obtain the bike rental and return data.
Through this system, people can rent bikes in one place and return them to different places.

1.2 Project requirements

The data generated by the system records the car’s ride time, departure point, arrival point and usage time.
In this project, we analyzed the impact of the number of shared bike rentals on natural and human factors such as weather and time, based on historical usage data, in order to predict the demand for shared bike rentals in the Washington area.

1.3 Methods and techniques

In this project, we mainly used the cloud computing platform of Apache Spark and the Python processing platform. We use the Park dataframe basic operations and the Pandas dataframe basic operations. Using exploratory data analysis: data cleaning, data description, view the distribution of data, compare the relationship between data, data summary; At the same time, using data visualization technology, using charts to present the results of exploratory data analysis, more intuitive understanding of the real distribution of data, see the hidden rules in the data, so as to get inspiration, in order to find a model suitable for the data.

In the process of modeling and analysis, we used Multiple Linear Regression, Random Forest and KNN:

  1. Multiple Linear Regression: When there are multiple factors affecting the dependent variable, the problem that multiple independent variables affect one dependent variable can be solved by multiple regression analysis.Multivariate regression analysis refers to a statistical analysis method that takes one variable as dependent variable and one or more variables as independent variables, establishes the quantitative relationship of linear or nonlinear mathematical models among the variables, and uses sample data for analysis.
  1. Random Forest: Random forest is a classifier containing multiple decision trees, and its basic unit is decision tree.The category of random forest output is determined by the mode of the category of individual tree output.Random forest can effectively run on large data sets, and can process input samples with high-dimensional features without dimensionality reduction, which has excellent accuracy
  1. KNN: KNN is the k nearest neighbor classification algorithm, which means that each sample can be represented by its nearest k neighborhood values: if most of the k most similar samples in the feature space of a sample belong to a certain category, then the sample also belongs to this category.KNN algorithm is more suitable for automatic classification of class domains with large sample sizes, while it is easy to generate errors in class domains with small sample sizes.

2. Problem Definition


In this project, we need to explore the question—"What factors affect the use of shared bikes?"

We need to predict the total rental number of shared bikes through characteristic values such as weather in the test set.

3. Data


Before analyzing the data, we need to have a certain understanding of the data in the data set, which will help us to choose the appropriate model later.

In this module, we present the following aspects:

  1. Data source and attribute interpretation;
  2. Data type;
  3. The main structure of the dataset;
  4. Data set size (rows and columns);
  5. Descriptive statistics for the data set;

3.1 Data source

The source of the data set is https://www.kaggle.com/c/bike-sharing-demand/data. The data set consists of the training set and the test set. The training set consists of data from the first 19 days of the month, and the test set consists of data from the 20th day of the month to the end of the month. The training set contains 12 attributes, and the test set lacks the casual, count and registered attributes.

The following table shows the names of the 12 attributes and their explanations

Attributes Explanation
datetime Time-Year/Month/Day/Hours
season 1.spring; 2.summer;3.autumn;4.winter
holiday Is it a holiday? 0:no; 1:yes
workingday Is it a workday? 0:no; 1:yes
weather 1:Sunny day; 2:cloudy days; 3:light rain or light snow; 4:bad weather (heavy rain, hail or blizzard)
temp The actual temperature – Celsius
atemp Sensory temperature - Celsius
humidity Humidity
windspeed Wind speed
casual Number of rentals by unregistered users
registered Number of rentals by registered users
count Total rental quantity

3.2 Data set format

Import data packet and show
from pyspark.sql.functions import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
sqlContext = SparkSession(sc)
spark = SparkSession.builder.appName('Final_project').getOrCreate()
train = spark.read.csv('file:///home/ljm/project/train.csv', header = True, inferSchema = True)
test = spark.read.csv('file:///home/ljm/project/test.csv', header = True, inferSchema = True)
Data type display in training set:
train.dtypes
[('datetime', 'string'),
 ('season', 'int'),
 ('holiday', 'int'),
 ('workingday', 'int'),
 ('weather', 'int'),
 ('temp', 'double'),
 ('atemp', 'double'),
 ('humidity', 'int'),
 ('windspeed', 'double'),
 ('casual', 'int'),
 ('registered', 'int'),
 ('count', 'int')]
Partial training data set display:
train.show()
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|casual|registered|count|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|2011-01-01 00:00:00|     1|      0|         0|      1| 9.84|14.395|      81|      0.0|     3|        13|   16|
|2011-01-01 01:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     8|        32|   40|
|2011-01-01 02:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     5|        27|   32|
|2011-01-01 03:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     3|        10|   13|
|2011-01-01 04:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     0|         1|    1|
|2011-01-01 05:00:00|     1|      0|         0|      2| 9.84| 12.88|      75|   6.0032|     0|         1|    1|
|2011-01-01 06:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     2|         0|    2|
|2011-01-01 07:00:00|     1|      0|         0|      1|  8.2| 12.88|      86|      0.0|     1|         2|    3|
|2011-01-01 08:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     1|         7|    8|
|2011-01-01 09:00:00|     1|      0|         0|      1|13.12|17.425|      76|      0.0|     8|         6|   14|
|2011-01-01 10:00:00|     1|      0|         0|      1|15.58|19.695|      76|  16.9979|    12|        24|   36|
|2011-01-01 11:00:00|     1|      0|         0|      1|14.76|16.665|      81|  19.0012|    26|        30|   56|
|2011-01-01 12:00:00|     1|      0|         0|      1|17.22| 21.21|      77|  19.0012|    29|        55|   84|
|2011-01-01 13:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.9995|    47|        47|   94|
|2011-01-01 14:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.0012|    35|        71|  106|
|2011-01-01 15:00:00|     1|      0|         0|      2|18.04| 21.97|      77|  19.9995|    40|        70|  110|
|2011-01-01 16:00:00|     1|      0|         0|      2|17.22| 21.21|      82|  19.9995|    41|        52|   93|
|2011-01-01 17:00:00|     1|      0|         0|      2|18.04| 21.97|      82|  19.0012|    15|        52|   67|
|2011-01-01 18:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     9|        26|   35|
|2011-01-01 19:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     6|        31|   37|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
only showing top 20 rows
Data type display in testing set:
test.dtypes
[('datetime', 'string'),
 ('season', 'int'),
 ('holiday', 'int'),
 ('workingday', 'int'),
 ('weather', 'int'),
 ('temp', 'double'),
 ('atemp', 'double'),
 ('humidity', 'int'),
 ('windspeed', 'double')]
Partial testing data set display:
test.show()
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
|2011-01-20 00:00:00|     1|      0|         1|      1|10.66|11.365|      56|  26.0027|
|2011-01-20 01:00:00|     1|      0|         1|      1|10.66|13.635|      56|      0.0|
|2011-01-20 02:00:00|     1|      0|         1|      1|10.66|13.635|      56|      0.0|
|2011-01-20 03:00:00|     1|      0|         1|      1|10.66| 12.88|      56|  11.0014|
|2011-01-20 04:00:00|     1|      0|         1|      1|10.66| 12.88|      56|  11.0014|
|2011-01-20 05:00:00|     1|      0|         1|      1| 9.84|11.365|      60|  15.0013|
|2011-01-20 06:00:00|     1|      0|         1|      1| 9.02|10.605|      60|  15.0013|
|2011-01-20 07:00:00|     1|      0|         1|      1| 9.02|10.605|      55|  15.0013|
|2011-01-20 08:00:00|     1|      0|         1|      1| 9.02|10.605|      55|  19.0012|
|2011-01-20 09:00:00|     1|      0|         1|      2| 9.84|11.365|      52|  15.0013|
|2011-01-20 10:00:00|     1|      0|         1|      1|10.66|11.365|      48|  19.9995|
|2011-01-20 11:00:00|     1|      0|         1|      2|11.48|13.635|      45|  11.0014|
|2011-01-20 12:00:00|     1|      0|         1|      2| 12.3|16.665|      42|      0.0|
|2011-01-20 13:00:00|     1|      0|         1|      2|11.48|14.395|      45|   7.0015|
|2011-01-20 14:00:00|     1|      0|         1|      2| 12.3| 15.15|      45|   8.9981|
|2011-01-20 15:00:00|     1|      0|         1|      2|13.12| 15.91|      45|   12.998|
|2011-01-20 16:00:00|     1|      0|         1|      2| 12.3| 15.15|      49|   8.9981|
|2011-01-20 17:00:00|     1|      0|         1|      2| 12.3| 15.91|      49|   7.0015|
|2011-01-20 18:00:00|     1|      0|         1|      2|10.66| 12.88|      56|   12.998|
|2011-01-20 19:00:00|     1|      0|         1|      1|10.66|11.365|      56|  22.0028|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
only showing top 20 rows

3.2 Data set size

print("Train dataset:", "The number of rows:", train.count(), "\n", "              The number of columns:", len(train.columns))
print("Test dataset :", "The number of rows:", test.count(), "\n", "              The number of columns:", len(test.columns))
Train dataset: The number of rows: 10886 
               The number of columns: 12
Test dataset : The number of rows: 6493 
               The number of columns: 9
Get descriptive statistics for a data type column (training set)
train.describe().show()
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
|summary|           datetime|            season|            holiday|        workingday|           weather|              temp|            atemp|          humidity|         windspeed|           casual|        registered|             count|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
|  count|              10886|             10886|              10886|             10886|             10886|             10886|            10886|             10886|             10886|            10886|             10886|             10886|
|   mean|               null|2.5066139996325556|0.02856880396839978|0.6808745177291935| 1.418427337865148|20.230859819952173|23.65508405291192| 61.88645967297446|12.799395406945093|36.02195480433584| 155.5521771082124|191.57413191254824|
| stddev|               null|1.1161743093443237|0.16659885062470944|0.4661591687997361|0.6338385858190968| 7.791589843987573| 8.47460062648494|19.245033277394704|  8.16453732683871|49.96047657264955|151.03903308192452|181.14445383028493|
|    min|2011-01-01 00:00:00|                 1|                  0|                 0|                 1|              0.82|             0.76|                 0|               0.0|                0|                 0|                 1|
|    max|2012-12-19 23:00:00|                 4|                  1|                 1|                 4|              41.0|           45.455|               100|           56.9969|              367|               886|               977|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
Get descriptive statistics for a data type column (test set)
test.describe().show()
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|summary|           datetime|            season|             holiday|         workingday|           weather|              temp|             atemp|         humidity|        windspeed|
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|  count|               6493|              6493|                6493|               6493|              6493|              6493|              6493|             6493|             6493|
|   mean|               null|  2.49330047743724|0.0
  • 5
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值