使用google colab自动化数据准备，读取并处理zip文件中的citi bike数据

最新推荐文章于 2021-04-09 19:38:49 发布

weixin_26721705

最新推荐文章于 2021-04-09 19:38:49 发布

阅读量541

点赞数

文章标签： python java linux 大数据

原文链接：https://towardsdatascience.com/automate-data-preparation-using-google-colab-read-and-process-citi-bike-data-in-zip-file-2ab34dd33205

版权

From time to time I get requests from colleagues to process some large data files and report some statistics from the data. Since they rely on Excel as their main data processing/analysis tool and don’t use Python, R or SQL, reading and processing data files with more than 1,048,576 rows (the row capacity of Excel) become a problem.

我不时收到同事的要求，以处理一些大数据文件并报告数据中的一些统计信息。由于他们依赖Excel作为主要的数据处理/分析工具，并且不使用Python ， R或SQL ，因此读取和处理行数超过1,048,576行( Excel的行容量)的数据文件成为一个问题。

To process data in a zip file, normally I would download the zip file, unzip to get CSV file, and then import the CSV file into either SQL, R or Python to process. Once I got the results, I need to output the results to CSV files and sent the files to the colleagues who requested the work. This whole procedure could get cumbersome when I have to repeat the process every month when new data comes in.

要处理zip文件中的数据，通常我会下载zip文件，解压缩以获取CSV文件，然后将CSV文件导入到SQL，R或Python中进行处理。得到结果后，我需要将结果输出到CSV文件并将文件发送给请求工作的同事。当我每个月都要输入新数据时要重复此过程时，整个过程可能会很麻烦。

Google Colab使自动化成为可能 (Google Colab makes automation possible)

Google Colab (i.e., Google Colaboratory) provides a platform to share Python code and run it in web browser. It is especially helpful to share your Python solution with non Python users.

Google Colab (即Google Colaboratory)提供了一个共享Python代码并在网络浏览器中运行的平台。与非Python用户共享Python解决方案特别有用。

Here is an example of Python program using Google Colab to automate the data preparation of monthly Citi Bike data. The program can read the zip file directly from the Citi Bike online portal, and process the data to report the total number of Citi Bike trips for each calendar day. The users only need to provide the URL of the data file that they want to process, and download/save the results to their local computers at the end.

这是一个使用Google Colab自动执行每月Citi Bike数据准备的Python程序示例。该程序可以直接从Citi Bike在线门户读取zip文件，并处理数据以报告每个日历日的Citi Bike旅行总数。用户只需要提供要处理的数据文件的URL，最后将结果下载/保存到本地计算机即可。

花旗自行车数据 (Citi Bike Data)

Citi Bike website publishes user usage data every month, and maintained historical data back to June 2013. This is a good data source for analyzing micro mobility trend in New York metropolitan region.

花旗自行车网站每月都会发布用户使用情况数据，并保留历史数据(可追溯到2013年6月)。这是分析纽约都会区微观出行趋势的良好数据来源。

The number of records in the Citi Bike data file could be very large. For instance, the data for the month of August 2020 contains more than two million records which exceeded the row capacity of Excel.

花旗自行车数据文件中的记录数量可能非常大。例如，2020年8月的数据包含超过200万条记录，超过了Excel的行容量。

用于处理花旗自行车数据的Python程序 (The Python Program for Processing Citi Bike Data)

First go to Citi Bike Trip data portal and copy the link address of the file you would like to process:

首先转到花旗自行车旅行数据门户并复制您要处理的文件的链接地址：

https://s3.amazonaws.com/tripdata/index.html

To run the code, you need to sign in to your Google account. Once signed in, you can run the Python code by clicking on the arrow at the top left corner of the code window.

要运行代码，您需要登录Google帐户。登录后，您可以通过单击代码窗口左上角的箭头来运行Python代码。

Paste the data file link address in the input box “Data url: ”, and hit “enter”. The code will download the zip file from the file address, and read in the data line by line, and save the data into a DataFrame.

将数据文件链接地址粘贴到输入框“ Data url：”中，然后单击“ enter”。该代码将从文件地址下载zip文件，并逐行读取数据，然后将数据保存到DataFrame中。

data_address = input('Data url: ')
url = urllib.request.urlopen(data_address)
data = []
df = pd.DataFrame()
with ZipFile(BytesIO(url.read())) as my_zip_file:
    for contained_file in my_zip_file.namelist():
        for line in my_zip_file.open(contained_file).readlines():
            s=str(line,'utf-8')
            s = re.sub(r"\n", "", s)
            s = re.sub(r"\"", "", s)
            line_s = s.split(",")
            data.append(line_s)
                       
df = pd.DataFrame(data)

Things get easier and fun once we have the data stored into a DataFrame. The first row in the original data contains variable names (or column names). Use the data in the first row as column names for the DataFrame and then remove the first row from the DataFrame. Then aggregate the DataFrame by the “starttime” of each trip to count the number of trips for each calendar day, and print the result to the output console.

一旦我们将数据存储到DataFrame中，事情就会变得轻松有趣。原始数据的第一行包含变量名(或列名)。将第一行中的数据用作DataFrame的列名，然后从DataFrame中删除第一行。然后，按每次行程的“开始时间”汇总DataFrame，以计算每个日历日的行程数，并将结果打印到输出控制台。

col_name = df.iloc[0].astype(str)
df.columns = col_name
df = df.drop([0])
df['startdate']=df['starttime'].astype('datetime64[ns]').dt.date
date_count = df.groupby('startdate').count()[['bikeid']]
date_count.columns = ['count']
print(date_count)

Once the result is output to the console, download the result to your local computer by specifying the output file name in the input box “Output File Name:” at the bottom of the console.

将结果输出到控制台后，通过在控制台底部的“输出文件名：”输入框中指定输出文件名，将结果下载到本地计算机。

output_file_name = input('Output File Name: ')
date_count.to_csv(output_file_name+'.csv') 
files.download(output_file_name+'.csv')

The complete code is as follows:

完整的代码如下：

from io import BytesIO
from zipfile import ZipFile
import pandas as pd
import urllib.request
import re
from google.colab import files

data_address = input('Data url: ')
url = urllib.request.urlopen(data_address)
data = []
df = pd.DataFrame()
with ZipFile(BytesIO(url.read())) as my_zip_file:
    for contained_file in my_zip_file.namelist():
        for line in my_zip_file.open(contained_file).readlines():
            s=str(line,'utf-8')
            s = re.sub(r"\n", "", s)
            s = re.sub(r"\"", "", s)
            line_s = s.split(",")
            data.append(line_s)
                       
df = pd.DataFrame(data)
col_name = df.iloc[0].astype(str)
df.columns = col_name
df = df.drop([0])
df['startdate']=df['starttime'].astype('datetime64[ns]').dt.date
date_count = df.groupby('startdate').count()[['bikeid']]
date_count.columns = ['count']
print(date_count)

output_file_name = input('Output File Name: ')
date_count.to_csv(output_file_name+'.csv') 
files.download(output_file_name+'.csv')

Link to Google Colab Notebook:

链接到Google Colab Notebook：

https://colab.research.google.com/drive/1SpTToe6sL9OnB5-LhAPpOwGCIPe1miHw#scrollTo=kRMFFxaW30Rc

Link to GitHub:

链接到GitHub：

https://github.com/shi093/CitiBike-Monthly-Report/blob/master/citiBike_dailyCount.ipynb

翻译自: https://towardsdatascience.com/automate-data-preparation-using-google-colab-read-and-process-citi-bike-data-in-zip-file-2ab34dd33205

weixin_26721705

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用google colab自动化数据准备，读取并处理zip文件中的citi bike数据

From time to time I get requests from colleagues to process some large data files and report some statistics from the data. Since they rely on Excel as their main data processing/analysis tool and don...
复制链接

扫一扫