概述
利用 Python 探索与以下三大美国城市的自行车共享系统相关的数据:芝加哥、纽约和华盛顿特区。编写代码导入数据,并通过计算描述性统计数据回答有趣的问题。写一个脚本,该脚本会接受原始输入并在终端中创建交互式体验,以展现这些统计信息。
自行车共享数据
在过去十年内,自行车共享系统的数量不断增多,并且在全球多个城市内越来越受欢迎。自行车共享系统使用户能够按照一定的金额在短时间内租赁自行车。用户可以在 A 处借自行车,并在 B 处还车,或者他们只是想骑一下,也可以在同一地点还车。每辆自行车每天可以供多位用户使用。
由于信息技术的迅猛发展,共享系统的用户可以轻松地访问系统中的基座并解锁或还回自行车。这些技术还提供了大量数据,使我们能够探索这些自行车共享系统的使用情况。
在此项目中,你将使用 Motivate 提供的数据探索自行车共享使用模式,Motivate 是一家入驻美国很多大型城市的自行车共享系统。你将比较以下三座城市的系统使用情况:芝加哥、纽约市和华盛顿特区。
数据集
提供了三座城市 2017 年上半年的数据。三个数据文件都包含相同的核心六列:
起始时间 Start Time(例如 2017-01-01 00:07:57)
结束时间 End Time(例如 2017-01-01 00:20:53)
骑行时长 Trip Duration(例如 776 秒)
起始车站 Start Station(例如百老汇街和巴里大道)
结束车站 End Station(例如塞奇威克街和北大道)
用户类型 User Type(订阅者 Subscriber/Registered 或客户Customer/Casual)
芝加哥和纽约市文件还包含以下两列(数据格式可以查看下面的图片):
性别 Gender
出生年份 Birth Year
问题
1.起始时间(Start Time 列)中哪个月份最常见?
2.起始时间中,一周的哪一天(比如 Monday, Tuesday)最常见?
3.起始时间中,一天当中哪个小时最常见?
4.总骑行时长(Trip Duration)是多久,平均骑行时长是多久?
5.哪个起始车站(Start Station)最热门,哪个结束车站(End Station)最热门?
6.哪一趟行程最热门(即,哪一个起始站点与结束站点的组合最热门)?
7.每种用户类型有多少人?
8.每种性别有多少人?
9.出生年份最早的是哪一年、最晚的是哪一年,最常见的是哪一年?
---------------------
import time
import pandas as pd
import numpy as np
CITY_DATA = { 'chicago': 'chicago.csv',
'new york city': 'new_york_city.csv',
'washington': 'washington.csv' }
months = ['january', 'february', 'march', 'april', 'may', 'june','all']
weekdays = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday','all']
def get_filters():
"""
Asks user to specify a city, month, and day to analyze.
Returns:
(str) city - name of the city to analyze
(str) month - name of the month to filter by, or "all" to apply no month filter
(str) day - name of the day of week to filter by, or "all" to apply no day filter
"""
print('Hello! Let\'s explore some US bikeshare data!')
# TO DO: get user input for city (chicago, new york city, washington). HINT: Use a while loop to handle invalid inputs
# cities = ['chicago', 'new york city', 'washington']
# while inputCity not in cities:
# print( CITY_DATA.keys())
inputCity = input_Arg('\nPlease input for city (chicago, new york city, washington)\n',
'\nSome Error.Please input for city (chicago, new york city, washington)\n' ,\
CITY_DATA.keys())
# TO DO: get user input for month (all, january, february, ... , june)
inputMonth = input_Arg('\nPlease input for month (all, january, february, ... , june).\n',\
'\nSome Error.Please input for month (all, january, february, ... , june).\n' ,\
months)
# inputMonth = input('\nPlease input for month (all, january, february, ... , june).\n')
# while inputMonth not in months:
# inputMonth = input('\nPlease input for month (all, january, february, ... , june).\n')
# TO DO: get user input for day of week (all, monday, tuesday, ... sunday)
inputDay = input_Arg('\nPlease input for week (all, monday, tuesday, ... sunday).\n',\
'\nSome Input Error.Please input for week (all, monday, tuesday, ... sunday).\n' ,\
weekdays)
# inputDay = input('\nPlease input for day of week (all, monday, tuesday, ... sunday).\n')
# while inputDay not in weekdays:
# inputDay = input('\nPlease input for day of week (all, monday, tuesday, ... sunday).\n')
print('-'*40)
return inputCity, inputMonth, inputDay
def input_Arg(input_print,error_print,enterable_list):
#inArg = input('\nPlease input for city (chicago, new york city, washington).\n')
ret = input(input_print).lower().strip()
while ret not in enterable_list:
# inArg = input('\nPlease input for city (chicago, new york city, washington).\n')
ret = input(error_print).lower().strip()
return ret
def load_data(city, month, day):
"""
Loads data for the specified city and filters by month and day if applicable.
Args:
(str) city - name of the city to analyze
(str) month - name of the month to filter by, or "all" to apply no month filter
(str) day - name of the day of week to filter by, or "all" to apply no day filter
Returns:
df - Pandas DataFrame containing city data filtered by month and day
"""
# load data file into a dataframe
df = pd.read_csv(CITY_DATA[city])
# convert the Start Time column to datetime
df['Start Time'] = pd.to_datetime(df['Start Time'])
# extract month and day of week from Start Time to create new columns
df['month'] = df['Start Time'].dt.month
df['day_of_week'] = df['Start Time'].dt.weekday_name
# filter by month if applicable
if month != 'all':
# use the index of the months list to get the corresponding int
months = ['january', 'february', 'march', 'april', 'may', 'june']
month = months.index(month) + 1
# filter by month to create the new dataframe
df = df[df["month"] == month]
# print(df)
# filter by day of week if applicable
if day != 'all':
# weekdays = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']
# weekday = weekdays.index(day)
# print(weekday)
# filter by day of week to create the new dataframe
df = df[df["day_of_week"] == day.title()]
return df
def time_stats(df):
"""Displays statistics on the most frequent times of travel."""
print('\nCalculating The Most Frequent Times of Travel...\n')
start_time = time.time()
# TO DO: display the most common month
popular_month = df['month'].mode()[0]
print('The most common month is:', popular_month)
# TO DO: display the most common day of week
popular_day_of_week = df['day_of_week'].mode()[0]
print('The most common popular_day_of_week is:', popular_day_of_week)
# TO DO: display the most common start hour
df['hour'] = df['Start Time'].dt.hour
popular_hour = df['hour'].mode()[0]
print('The most common hour is:', popular_hour)
print("\nThis took %s seconds." % (time.time() - start_time))
print('-'*40)
def station_stats(df):
"""Displays statistics on the most popular stations and trip."""
print('\nCalculating The Most Popular Stations and Trip...\n')
start_time = time.time()
# TO DO: display most commonly used start station
start_station = df['Start Station'].mode()[0]
print('The most common start_station is:\n', start_station)
# TO DO: display most commonly used end station
end_station = df['End Station'].mode()[0]
print('The most common end_station is:\n', end_station)
# TO DO: display most frequent combination of start station and end station trip
#print(station.sort('').first())
#print('The most common station is:\n', df[['Start Station', 'End Station']][df[['Start Station', 'End Station']].duplicated()])
#print('The most common station is:\n', df[['Start Station','End Station']].count(['Start Station','End Station']))
# df['Trip'] = df['Start Station'].str.cat(df['End Station'], sep='->')
#print(df['Trip'])
# Trip = df['Trip'].mode()[0]
#print('The most frequent popular trip is:', Trip)
Trip = df.groupby(['Start Station', 'End Station']).size().idxmax()
print("The most frequent combination of start station and end station trip is {} to {}".format(Trip[0], Trip[1]))
print("\nThis took %s seconds." % (time.time() - start_time))
print('-'*40)
def trip_duration_stats(df):
"""Displays statistics on the total and average trip duration."""
print('\nCalculating Trip Duration...\n')
start_time = time.time()
# TO DO: display total travel time
total = df["Trip Duration"].count()
print("\nTotal duration is:\n",total)
# TO DO: display mean travel time
df['time'] = pd.to_datetime(df["End Time"]) - df["Start Time"]
mean_time = df['time'].mean()
print("\nMean travel time is:\n",mean_time)
print("\nThis took %s seconds." % (time.time() - start_time))
print('-'*40)
def user_stats(df):
"""Displays statistics on bikeshare users."""
print('\nCalculating User Stats...\n')
start_time = time.time()
# TO DO: Display counts of user types
user_types = df['User Type'].value_counts()
print("\nThe results of counts of user types is:\n",user_types)
# TO DO: Display counts of gender
try:
gender = df['Gender'].value_counts()
print("\nThe results of counts of gender is:\n", gender)
except:
pass
# TO DO: Display earliest, most recent, and most common year of birth
try:
earliest = df['Birth Year'].min()
recent = df['Birth Year'].max()
common = df['Birth Year'].mode()[0]
print("\nearliest of birth is:\n", earliest)
print("\nmost recent of birth is:\n", recent)
print("\nmost common year of birth is:\n", common)
except:
pass
print("\nThis took %s seconds." % (time.time() - start_time))
print('-'*40)
def main():
while True:
city, month, day = get_filters()
df = load_data(city, month, day)
time_stats(df)
station_stats(df)
trip_duration_stats(df)
user_stats(df)
restart = input('\nWould you like to restart? Enter yes or no.\n')
if restart.lower() != 'yes':
break
if __name__ == "__main__":
main()