Udacity数据分析——探索美国共享单车数据

最新推荐文章于 2024-06-13 21:29:05 发布

诚王

最新推荐文章于 2024-06-13 21:29:05 发布

阅读量2.3k

点赞数 1

分类专栏：数据分析文章标签：数据分析探索美国共享单车数据

本文链接：https://blog.csdn.net/u010173059/article/details/84705133

版权

数据分析专栏收录该内容

2 篇文章 0 订阅

订阅专栏

概述
利用 Python 探索与以下三大美国城市的自行车共享系统相关的数据：芝加哥、纽约和华盛顿特区。编写代码导入数据，并通过计算描述性统计数据回答有趣的问题。写一个脚本，该脚本会接受原始输入并在终端中创建交互式体验，以展现这些统计信息。

自行车共享数据
在过去十年内，自行车共享系统的数量不断增多，并且在全球多个城市内越来越受欢迎。自行车共享系统使用户能够按照一定的金额在短时间内租赁自行车。用户可以在 A 处借自行车，并在 B 处还车，或者他们只是想骑一下，也可以在同一地点还车。每辆自行车每天可以供多位用户使用。

由于信息技术的迅猛发展，共享系统的用户可以轻松地访问系统中的基座并解锁或还回自行车。这些技术还提供了大量数据，使我们能够探索这些自行车共享系统的使用情况。

在此项目中，你将使用 Motivate 提供的数据探索自行车共享使用模式，Motivate 是一家入驻美国很多大型城市的自行车共享系统。你将比较以下三座城市的系统使用情况：芝加哥、纽约市和华盛顿特区。

数据集
提供了三座城市 2017 年上半年的数据。三个数据文件都包含相同的核心六列：

起始时间 Start Time（例如 2017-01-01 00:07:57）
结束时间 End Time（例如 2017-01-01 00:20:53）
骑行时长 Trip Duration（例如 776 秒）
起始车站 Start Station（例如百老汇街和巴里大道）
结束车站 End Station（例如塞奇威克街和北大道）
用户类型 User Type（订阅者 Subscriber/Registered 或客户Customer/Casual）
芝加哥和纽约市文件还包含以下两列（数据格式可以查看下面的图片）：

性别 Gender
出生年份 Birth Year

问题
1.起始时间（Start Time 列）中哪个月份最常见？
2.起始时间中，一周的哪一天（比如 Monday, Tuesday）最常见？
3.起始时间中，一天当中哪个小时最常见？
4.总骑行时长（Trip Duration）是多久，平均骑行时长是多久？
5.哪个起始车站（Start Station）最热门，哪个结束车站（End Station）最热门？
6.哪一趟行程最热门（即，哪一个起始站点与结束站点的组合最热门）？
7.每种用户类型有多少人？
8.每种性别有多少人？
9.出生年份最早的是哪一年、最晚的是哪一年，最常见的是哪一年？
---------------------

import time
import pandas as pd
import numpy as np

CITY_DATA = { 'chicago': 'chicago.csv',
              'new york city': 'new_york_city.csv',
              'washington': 'washington.csv' }
months = ['january', 'february', 'march', 'april', 'may', 'june','all']
weekdays = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday','all']
def get_filters():
    """
    Asks user to specify a city, month, and day to analyze.

    Returns:
        (str) city - name of the city to analyze
        (str) month - name of the month to filter by, or "all" to apply no month filter
        (str) day - name of the day of week to filter by, or "all" to apply no day filter
    """
    print('Hello! Let\'s explore some US bikeshare data!')
    # TO DO: get user input for city (chicago, new york city, washington). HINT: Use a while loop to handle invalid inputs
    # cities = ['chicago', 'new york city', 'washington']
    # while inputCity not in cities:
    # print( CITY_DATA.keys())
    inputCity = input_Arg('\nPlease input for city (chicago, new york city, washington)\n',
                 '\nSome Error.Please input for city (chicago, new york city, washington)\n' ,\
                  CITY_DATA.keys())
    # TO DO: get user input for month (all, january, february, ... , june)
    inputMonth = input_Arg('\nPlease input for month (all, january, february, ... , june).\n',\
                 '\nSome Error.Please input for month (all, january, february, ... , june).\n' ,\
                  months)
    # inputMonth = input('\nPlease input for month (all, january, february, ... , june).\n')
    # while inputMonth not in months:
    #     inputMonth = input('\nPlease input for month (all, january, february, ... , june).\n')
    # TO DO: get user input for day of week (all, monday, tuesday, ... sunday)
    inputDay = input_Arg('\nPlease input for week (all, monday, tuesday, ... sunday).\n',\
                 '\nSome Input Error.Please input for week (all, monday, tuesday, ... sunday).\n' ,\
                  weekdays)
    # inputDay = input('\nPlease input for day of week (all, monday, tuesday, ... sunday).\n')
    # while inputDay not in weekdays:
    #     inputDay = input('\nPlease input for day of week (all, monday, tuesday, ... sunday).\n')
    print('-'*40)
    return inputCity, inputMonth, inputDay


def input_Arg(input_print,error_print,enterable_list):
    #inArg = input('\nPlease input for city (chicago, new york city, washington).\n')
    ret = input(input_print).lower().strip()
    while ret not in enterable_list:
        # inArg = input('\nPlease input for city (chicago, new york city, washington).\n')
        ret = input(error_print).lower().strip()
    return ret


def load_data(city, month, day):
    """
    Loads data for the specified city and filters by month and day if applicable.

    Args:
        (str) city - name of the city to analyze
        (str) month - name of the month to filter by, or "all" to apply no month filter
        (str) day - name of the day of week to filter by, or "all" to apply no day filter
    Returns:
        df - Pandas DataFrame containing city data filtered by month and day
    """

    # load data file into a dataframe
    df = pd.read_csv(CITY_DATA[city])

    # convert the Start Time column to datetime
    df['Start Time'] = pd.to_datetime(df['Start Time'])

    # extract month and day of week from Start Time to create new columns
    df['month'] = df['Start Time'].dt.month
    df['day_of_week'] = df['Start Time'].dt.weekday_name

    # filter by month if applicable
    if month != 'all':
        # use the index of the months list to get the corresponding int
        months = ['january', 'february', 'march', 'april', 'may', 'june']
        month = months.index(month) + 1

        # filter by month to create the new dataframe
        df = df[df["month"] == month]
        # print(df)

    # filter by day of week if applicable
    if day != 'all':
        # weekdays = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']
        # weekday = weekdays.index(day)
        # print(weekday)
        # filter by day of week to create the new dataframe
        df = df[df["day_of_week"] == day.title()]
    return df


def time_stats(df):
    """Displays statistics on the most frequent times of travel."""

    print('\nCalculating The Most Frequent Times of Travel...\n')
    start_time = time.time()

    # TO DO: display the most common month
    popular_month = df['month'].mode()[0]
    print('The most common month is:', popular_month)

    # TO DO: display the most common day of week
    popular_day_of_week = df['day_of_week'].mode()[0]
    print('The most common popular_day_of_week is:', popular_day_of_week)

    # TO DO: display the most common start hour
    df['hour'] = df['Start Time'].dt.hour
    popular_hour = df['hour'].mode()[0]
    print('The most common hour is:', popular_hour)

    print("\nThis took %s seconds." % (time.time() - start_time))
    print('-'*40)


def station_stats(df):
    """Displays statistics on the most popular stations and trip."""

    print('\nCalculating The Most Popular Stations and Trip...\n')
    start_time = time.time()

    # TO DO: display most commonly used start station
    start_station = df['Start Station'].mode()[0]
    print('The most common start_station is:\n', start_station)

    # TO DO: display most commonly used end station
    end_station = df['End Station'].mode()[0]
    print('The most common end_station is:\n', end_station)

    # TO DO: display most frequent combination of start station and end station trip
    #print(station.sort('').first())
    #print('The most common station is:\n', df[['Start Station', 'End Station']][df[['Start Station', 'End Station']].duplicated()])
    #print('The most common station is:\n', df[['Start Station','End Station']].count(['Start Station','End Station']))
    # df['Trip'] = df['Start Station'].str.cat(df['End Station'], sep='->')
    #print(df['Trip'])
    # Trip = df['Trip'].mode()[0]
    #print('The most frequent popular trip is:', Trip)
    Trip = df.groupby(['Start Station', 'End Station']).size().idxmax()
    print("The most frequent combination of start station and end station trip is {} to {}".format(Trip[0], Trip[1]))


    print("\nThis took %s seconds." % (time.time() - start_time))
    print('-'*40)


def trip_duration_stats(df):
    """Displays statistics on the total and average trip duration."""

    print('\nCalculating Trip Duration...\n')
    start_time = time.time()

    # TO DO: display total travel time
    total = df["Trip Duration"].count()
    print("\nTotal duration is:\n",total)

    # TO DO: display mean travel time
    df['time'] = pd.to_datetime(df["End Time"]) - df["Start Time"]
    mean_time = df['time'].mean()
    print("\nMean travel time is:\n",mean_time)

    print("\nThis took %s seconds." % (time.time() - start_time))
    print('-'*40)


def user_stats(df):
    """Displays statistics on bikeshare users."""

    print('\nCalculating User Stats...\n')
    start_time = time.time()

    # TO DO: Display counts of user types
    user_types = df['User Type'].value_counts()
    print("\nThe results of counts of user types is:\n",user_types)
    # TO DO: Display counts of gender
    try:
        gender = df['Gender'].value_counts()
        print("\nThe results of counts of gender is:\n", gender)
    except:
        pass
    # TO DO: Display earliest, most recent, and most common year of birth
    try:
        earliest = df['Birth Year'].min()
        recent = df['Birth Year'].max()
        common =  df['Birth Year'].mode()[0]
        print("\nearliest of birth is:\n", earliest)
        print("\nmost recent of birth is:\n", recent)
        print("\nmost common year of birth is:\n", common)
    except:
        pass
    print("\nThis took %s seconds." % (time.time() - start_time))
    print('-'*40)


def main():
    while True:
        city, month, day = get_filters()
        df = load_data(city, month, day)

        time_stats(df)
        station_stats(df)
        trip_duration_stats(df)
        user_stats(df)

        restart = input('\nWould you like to restart? Enter yes or no.\n')
        if restart.lower() != 'yes':
            break


if __name__ == "__main__":
	main()