基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测

最新推荐文章于 2024-05-07 04:02:19 发布

Limerencebb

最新推荐文章于 2024-05-07 04:02:19 发布

阅读量2.4k

点赞数 5

文章标签： spark 机器学习数据挖掘 python linux

本文链接：https://blog.csdn.net/Blankkk23/article/details/116983802

版权

AbstractThis project is a modeling and prediction project based on cloud computingWe used the historical bike-sharing usage data set from Kaggle in the Washington area, analyzed and modeled the data on the cloud computing platform of Apache Spark and th

摘要由CSDN通过智能技术生成

Abstract

This project is a modeling and prediction project based on cloud computing

We used the historical bike-sharing usage data set from Kaggle in the Washington area, analyzed and modeled the data on the cloud computing platform of Apache Spark and the Python processing platform, and finally predicted the bike-sharing rental demand in the Washington area.

Data Set Overview: The selected data set consists of a training set and a test set:

The training set consists of data from the first 19 days of each month, and the test set consists of data from the 20th day of each month to the end of the month.
The training set contains 12 attributes, including datetime, season, holiday and other attributes.The test set is missing the casual, count, and registered properties.

Model evaluation and application: In this modeling, three models are used, including Multiple Linear Regression, K Nearest Neighbor and Random forest. Through the model.score function of each model, the prediction accuracy score of each model for the data set is 0.3893, 0.1919 and 0.9926, respectively.The random forest has the highest accuracy, so we finally choose the random forest to predict the test set. Finally save the prediction output as test_pred.csv.

Through this project, we have solved the problems proposed at the beginning, gained a new understanding of data processing and model analysis, and also found new problems and challenges in the process of project execution:

The factors affecting the number of car rentals are not a single variable, but these characteristics jointly determine the number of car rentals.At the same time, there are many characteristics that are related to each other (such as temperature, humidity, windspeed and atemp), which will affect each other.Of course, there are also some variables that are interference terms that are not directly related to the number of car rentals (the correlation is very low).Therefore, in the process of modeling, exploratory data analysis is very necessary.
The data preprocessing step is very critical, which is directly related to the overall analysis and even the prediction of the model.Simple check missing value to weight and remove outliers often does not significantly improve model fitting effect, rather than through insight into the intrinsic characteristics of data visualization results, and according to the characteristics of the data model, processing (such as log, data conversion, etc.), so as to make the model prediction results more accurate.
In this project, we also encountered a number of problems,for example: Spark’s DataFrame could not perform method operations on individual columns and rows, which made me have trouble in the initial data analysis and visualization. Secondly, our analysis and modeling process also has many deficiencies, for example:
- Repeated values, missing values and so on in the original data set need to be completely processed before modeling;
- After modeling, it is found that the impact of some variables on car rental is not suitable to be calculated by means of the average value, instead, the cumulative value should be used for statistics;
- …

Packages imported：pyspark.sql.functions pyspark.context（SparkContext) pyspark.sql.session（SparkSession） seaborn matplotlib.pyplot warnings numpy pandas datetime（datetime) sklearn.ensemble（RandomForestRegressor） sklearn.neighbors（KNeighborsClassifier） sklearn.linear_model（LinearRegression） sklearn.model_selection（train_test_split）

1. Introduction

1.1 Project background

The bike-sharing system is a way of renting bicycles. Registration, renting and returning the bikes are all done through the self-service terminal network of the whole city, which can automatically obtain the bike rental and return data.
Through this system, people can rent bikes in one place and return them to different places.

1.2 Project requirements

The data generated by the system records the car’s ride time, departure point, arrival point and usage time.
In this project, we analyzed the impact of the number of shared bike rentals on natural and human factors such as weather and time, based on historical usage data, in order to predict the demand for shared bike rentals in the Washington area.

1.3 Methods and techniques

In this project, we mainly used the cloud computing platform of Apache Spark and the Python processing platform. We use the Park dataframe basic operations and the Pandas dataframe basic operations. Using exploratory data analysis: data cleaning, data description, view the distribution of data, compare the relationship between data, data summary; At the same time, using data visualization technology, using charts to present the results of exploratory data analysis, more intuitive understanding of the real distribution of data, see the hidden rules in the data, so as to get inspiration, in order to find a model suitable for the data.

In the process of modeling and analysis, we used Multiple Linear Regression, Random Forest and KNN:

Multiple Linear Regression: When there are multiple factors affecting the dependent variable, the problem that multiple independent variables affect one dependent variable can be solved by multiple regression analysis.Multivariate regression analysis refers to a statistical analysis method that takes one variable as dependent variable and one or more variables as independent variables, establishes the quantitative relationship of linear or nonlinear mathematical models among the variables, and uses sample data for analysis.

Random Forest: Random forest is a classifier containing multiple decision trees, and its basic unit is decision tree.The category of random forest output is determined by the mode of the category of individual tree output.Random forest can effectively run on large data sets, and can process input samples with high-dimensional features without dimensionality reduction, which has excellent accuracy

KNN: KNN is the k nearest neighbor classification algorithm, which means that each sample can be represented by its nearest k neighborhood values: if most of the k most similar samples in the feature space of a sample belong to a certain category, then the sample also belongs to this category.KNN algorithm is more suitable for automatic classification of class domains with large sample sizes, while it is easy to generate errors in class domains with small sample sizes.

2. Problem Definition

In this project, we need to explore the question—"What factors affect the use of shared bikes?"

We need to predict the total rental number of shared bikes through characteristic values such as weather in the test set.

3. Data

Before analyzing the data, we need to have a certain understanding of the data in the data set, which will help us to choose the appropriate model later.

In this module, we present the following aspects:

Data source and attribute interpretation;
Data type;
The main structure of the dataset;
Data set size (rows and columns);
Descriptive statistics for the data set;

3.1 Data source

The source of the data set is https://www.kaggle.com/c/bike-sharing-demand/data. The data set consists of the training set and the test set. The training set consists of data from the first 19 days of the month, and the test set consists of data from the 20th day of the month to the end of the month. The training set contains 12 attributes, and the test set lacks the casual, count and registered attributes.

The following table shows the names of the 12 attributes and their explanations

Attributes	Explanation
datetime	Time-Year/Month/Day/Hours
season	1.spring; 2.summer;3.autumn;4.winter
holiday	Is it a holiday? 0:no; 1:yes
workingday	Is it a workday? 0:no; 1:yes
weather	1:Sunny day; 2:cloudy days; 3:light rain or light snow; 4:bad weather (heavy rain, hail or blizzard)
temp	The actual temperature – Celsius
atemp	Sensory temperature - Celsius
humidity	Humidity
windspeed	Wind speed
casual	Number of rentals by unregistered users
registered	Number of rentals by registered users
count	Total rental quantity

3.2 Data set format

Import data packet and show

from pyspark.sql.functions import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext('local')
sqlContext = SparkSession(sc)
spark = SparkSession.builder.appName('Final_project').getOrCreate()
train = spark.read.csv('file:///home/ljm/project/train.csv', header = True, inferSchema = True)
test = spark.read.csv('file:///home/ljm/project/test.csv', header = True, inferSchema = True)

Data type display in training set:

train.dtypes

[('datetime', 'string'),
 ('season', 'int'),
 ('holiday', 'int'),
 ('workingday', 'int'),
 ('weather', 'int'),
 ('temp', 'double'),
 ('atemp', 'double'),
 ('humidity', 'int'),
 ('windspeed', 'double'),
 ('casual', 'int'),
 ('registered', 'int'),
 ('count', 'int')]

Partial training data set display:

train.show()

+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|casual|registered|count|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
|2011-01-01 00:00:00|     1|      0|         0|      1| 9.84|14.395|      81|      0.0|     3|        13|   16|
|2011-01-01 01:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     8|        32|   40|
|2011-01-01 02:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     5|        27|   32|
|2011-01-01 03:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     3|        10|   13|
|2011-01-01 04:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     0|         1|    1|
|2011-01-01 05:00:00|     1|      0|         0|      2| 9.84| 12.88|      75|   6.0032|     0|         1|    1|
|2011-01-01 06:00:00|     1|      0|         0|      1| 9.02|13.635|      80|      0.0|     2|         0|    2|
|2011-01-01 07:00:00|     1|      0|         0|      1|  8.2| 12.88|      86|      0.0|     1|         2|    3|
|2011-01-01 08:00:00|     1|      0|         0|      1| 9.84|14.395|      75|      0.0|     1|         7|    8|
|2011-01-01 09:00:00|     1|      0|         0|      1|13.12|17.425|      76|      0.0|     8|         6|   14|
|2011-01-01 10:00:00|     1|      0|         0|      1|15.58|19.695|      76|  16.9979|    12|        24|   36|
|2011-01-01 11:00:00|     1|      0|         0|      1|14.76|16.665|      81|  19.0012|    26|        30|   56|
|2011-01-01 12:00:00|     1|      0|         0|      1|17.22| 21.21|      77|  19.0012|    29|        55|   84|
|2011-01-01 13:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.9995|    47|        47|   94|
|2011-01-01 14:00:00|     1|      0|         0|      2|18.86|22.725|      72|  19.0012|    35|        71|  106|
|2011-01-01 15:00:00|     1|      0|         0|      2|18.04| 21.97|      77|  19.9995|    40|        70|  110|
|2011-01-01 16:00:00|     1|      0|         0|      2|17.22| 21.21|      82|  19.9995|    41|        52|   93|
|2011-01-01 17:00:00|     1|      0|         0|      2|18.04| 21.97|      82|  19.0012|    15|        52|   67|
|2011-01-01 18:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     9|        26|   35|
|2011-01-01 19:00:00|     1|      0|         0|      3|17.22| 21.21|      88|  16.9979|     6|        31|   37|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+------+----------+-----+
only showing top 20 rows

Data type display in testing set:

test.dtypes

[('datetime', 'string'),
 ('season', 'int'),
 ('holiday', 'int'),
 ('workingday', 'int'),
 ('weather', 'int'),
 ('temp', 'double'),
 ('atemp', 'double'),
 ('humidity', 'int'),
 ('windspeed', 'double')]

Partial testing data set display:

test.show()

+-------------------+------+-------+----------+-------+-----+------+--------+---------+
|           datetime|season|holiday|workingday|weather| temp| atemp|humidity|windspeed|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
|2011-01-20 00:00:00|     1|      0|         1|      1|10.66|11.365|      56|  26.0027|
|2011-01-20 01:00:00|     1|      0|         1|      1|10.66|13.635|      56|      0.0|
|2011-01-20 02:00:00|     1|      0|         1|      1|10.66|13.635|      56|      0.0|
|2011-01-20 03:00:00|     1|      0|         1|      1|10.66| 12.88|      56|  11.0014|
|2011-01-20 04:00:00|     1|      0|         1|      1|10.66| 12.88|      56|  11.0014|
|2011-01-20 05:00:00|     1|      0|         1|      1| 9.84|11.365|      60|  15.0013|
|2011-01-20 06:00:00|     1|      0|         1|      1| 9.02|10.605|      60|  15.0013|
|2011-01-20 07:00:00|     1|      0|         1|      1| 9.02|10.605|      55|  15.0013|
|2011-01-20 08:00:00|     1|      0|         1|      1| 9.02|10.605|      55|  19.0012|
|2011-01-20 09:00:00|     1|      0|         1|      2| 9.84|11.365|      52|  15.0013|
|2011-01-20 10:00:00|     1|      0|         1|      1|10.66|11.365|      48|  19.9995|
|2011-01-20 11:00:00|     1|      0|         1|      2|11.48|13.635|      45|  11.0014|
|2011-01-20 12:00:00|     1|      0|         1|      2| 12.3|16.665|      42|      0.0|
|2011-01-20 13:00:00|     1|      0|         1|      2|11.48|14.395|      45|   7.0015|
|2011-01-20 14:00:00|     1|      0|         1|      2| 12.3| 15.15|      45|   8.9981|
|2011-01-20 15:00:00|     1|      0|         1|      2|13.12| 15.91|      45|   12.998|
|2011-01-20 16:00:00|     1|      0|         1|      2| 12.3| 15.15|      49|   8.9981|
|2011-01-20 17:00:00|     1|      0|         1|      2| 12.3| 15.91|      49|   7.0015|
|2011-01-20 18:00:00|     1|      0|         1|      2|10.66| 12.88|      56|   12.998|
|2011-01-20 19:00:00|     1|      0|         1|      1|10.66|11.365|      56|  22.0028|
+-------------------+------+-------+----------+-------+-----+------+--------+---------+
only showing top 20 rows

3.2 Data set size

print("Train dataset:", "The number of rows:", train.count(), "\n", "              The number of columns:", len(train.columns))
print("Test dataset :", "The number of rows:", test.count(), "\n", "              The number of columns:", len(test.columns))

Train dataset: The number of rows: 10886 
               The number of columns: 12
Test dataset : The number of rows: 6493 
               The number of columns: 9

Get descriptive statistics for a data type column (training set)

train.describe().show()

+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
|summary|           datetime|            season|            holiday|        workingday|           weather|              temp|            atemp|          humidity|         windspeed|           casual|        registered|             count|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+
|  count|              10886|             10886|              10886|             10886|             10886|             10886|            10886|             10886|             10886|            10886|             10886|             10886|
|   mean|               null|2.5066139996325556|0.02856880396839978|0.6808745177291935| 1.418427337865148|20.230859819952173|23.65508405291192| 61.88645967297446|12.799395406945093|36.02195480433584| 155.5521771082124|191.57413191254824|
| stddev|               null|1.1161743093443237|0.16659885062470944|0.4661591687997361|0.6338385858190968| 7.791589843987573| 8.47460062648494|19.245033277394704|  8.16453732683871|49.96047657264955|151.03903308192452|181.14445383028493|
|    min|2011-01-01 00:00:00|                 1|                  0|                 0|                 1|              0.82|             0.76|                 0|               0.0|                0|                 0|                 1|
|    max|2012-12-19 23:00:00|                 4|                  1|                 1|                 4|              41.0|           45.455|               100|           56.9969|              367|               886|               977|
+-------+-------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+

Get descriptive statistics for a data type column (test set)

test.describe().show()

+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|summary|           datetime|            season|             holiday|         workingday|           weather|              temp|             atemp|         humidity|        windspeed|
+-------+-------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|  count|               6493|              6493|                6493|               6493|              6493|              6493|              6493|             6493|             6493|
|   mean|               null|  2.49330047743724|0.0

最低0.47元/天解锁文章

Limerencebb

关注

5
点赞
踩
26

收藏

觉得还不错? 一键收藏
1
评论
基于Spark和pandas平台的数据分析与建模——共享单车租赁数量预测

AbstractThis project is a modeling and prediction project based on cloud computingWe used the historical bike-sharing usage data set from Kaggle in the Washington area, analyzed and modeled the data on the cloud computing platform of Apache Spark and th
复制链接

扫一扫