Introduction to Data Science in Python Week 2

Introduction

When using pandas, Stock Overflow is the best place to ask questions related to pandas.
OTHER Sources:
Learning the Pandas Library by Matt Harrison
planet python.org or it’s Twitter @PlanetPython
Data Skeptic Podcast

The Series Data Structure

See the documentation of Series

import pandas as pd
pd.Series?
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)
>>> 0	Tiger
	1	Bear
	2	Moose
	dtype: object

None type in pandas

animals=['Tiger', 'Bear', None] #此时index为2处返回None
animals=[1, 2, None] #此时index为2处返回NaN,dtype为float64
# NaN != None

The index value can be set to the keys from our dictionary.

import pandas as pd
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s
>>>	Golf	Scotland
	Sumo	Japan
	Hockey	NaN
	dtype:	object

Querying a Series

iloc and loc

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s.iloc[3]
>>> 'South Korea'
s.loc['Golf']
>>> 'Scotland'

iloc and loc are not methods, they are attributes. iloc 直接返回的是value。

s[3]
>>>'South Korea'
s['Golf']
>>>'Scotland'

返回value里index相对应的内容的方法
s.iloc[第几个] (返回value里第几个)
s.loc[index]
s[第几个] (返回value里第几个)
s[index]

sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead
>>>Trackback

Using Numpy’s vectorization to increase manipulation speed.

s = pd.Series([100.00, 120.00, 101.00, 3.00])
import numpy as np
total = np.sum(s)
print(total)
>>>324.0

Jupyter Notebook 中比较非vectorize和vectorize二者的时间

#this creates a big series of random numbers
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label]= value+2
>>> 1.44 s ± 58.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2
>>> 288 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

There’s an example where index values are not unique, and this makes data frames different, conceptually, that a relational database might be.

original_sports = pd.Series({'Archery': 'Bhutan',
						'Golf': 'Scotland',
                        'Sumo': 'Japan',
                        'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'], 
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)
>>>	Archery		Bhutan
	Golf		Scotland
	Sumo		Japan
	Taekwondo	South Korea
	dtype:		object

all_countries
>>>	Archery		Bhutan
	Golf		Scotland
	Sumo		Japan
	Taekwondo	South Korea
	Cricket		Australia
	Cricket		Barbados
	Cricket		Pakistan
	Cricket		England
	dtype:		object

all_countries.loc['Cricket']
>>>	Cricket		Australia
	Cricket     Barbados
	Cricket     Pakistan
	Cricket		England
	dtype:		object

The original series values are not change
When using ‘Cricket’ as the index, we don’t get a single value, but a series itself.

The DataFrame Data Structure

The data frame is conceptually 2 dimensional series of objects with an index and multiple columns, with each column having a label.

How to create a data frame data structure

import pandas as pd
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df.head()

And it returns
在这里插入图片描述

df.loc['Store 1']

it returns a table of first two rows as well as the head of the table.

For the purchase records from the pet store, how would you get a list of all items which had been purchased (regardless of where they might have been purchased, or by whom)?

purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])

# Your code here
df['Item Purchased']

If we are only interested in Store1 cost, we can write

df.loc['Store 1','Cost']

loc和iloc用作 row 的处理。
If we want to do a column selection and get a list of cost

df['Cost']

Here is an example of we are asking for all of the name and cost values for all stores using the .loc operator.

df.loc[:,['Name', 'Cost']]

Droping data

df.drop('Store 1') #only remain store 2.
#use this don't change the original data, it returns a copy. Instead, you should make a copy and do operations on the copied one.
copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df

Drop has two interesting optional parameters. The first is called in place, and if it’s set to true, the DataFrame will be updated in place, instead of a copy being returned. The second parameter is the axes, which should be dropped. By default, this value is 0, indicating the row axes. But you could change it to 1 if you want to drop a column.

del copy_df['Name']
#it acts immediately on copy_df and does't return a copy

If we want to add a new column

df['Location'] = None
#it can broadcast

Eg. For the purchase records from the pet store, how would you update the DataFrame, applying a discount of 20% across all the values in the ‘Cost’ column?

purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])

# Your answer here
df['Cost']=df['Cost']*0.8
df

DataFrame Indexing and Loading

!cat olympics.csv #this works on Mac and Linux, not on windows
df = pd.read_csv('olympics.csv')
df.head()

it returns 在这里插入图片描述

however, we can set the column and row by using

df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
df.head()

and it returns
在这里插入图片描述

We can also rename columns

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold' + col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#' + col[1:]}, inplace=True) 

df.head()

and it returns
在这里插入图片描述

Querying a DataFrame

Boolean mask
Bollean masks are created by applying operators directly to the pandas series or DataFrame objects.
在这里插入图片描述

If we only want to a table return the countries who won gold

only_gold = df[df['Gold'] > 0]
only_gold.head() #return first few rows
only_gold #return all the table

len(df[(df['Gold'] > 0) | (df['Gold.1'] > 0)]) #returns the number of countries who won Gold in Summer or in Winter
df[(df['Gold.1'] > 0) & (df['Gold'] == 0)] #returns a table of countries who won Gold in winter and never won Gold in summer.
#这里需要加上括号,因为masks有运算顺序

Eg. Write a query to return all of the names of people who bought products worth more than $3.00.

purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])


# Your code here
df['Name'][df['Cost']>3]

Indexing Dataframes

The index is essentially a row level label, and we know that rows correspond to axis zero.
One option for setting an index is to use the set_index function. This function takes a list of columns and promotes those columns to an index. It is a destructive process, it doesn’t keep the current index.

df['country'] = df.index
df = df.set_index('Gold')
df.head()

在这里插入图片描述

We can use

df=df.reset_index()

to create a default index.
An other example:

df=df.read_csv('census.csv') #read csv file
df.head()
df['SUMLEV'].unique()#find unique values in column 'SUMLEV'
df=df[df['SUMLEV'] ==50]# generate table with SUMLEV equals to 50
df.head()
columns_to_keep = ['STNAME',
                   'CTYNAME',
                   'BIRTHS2010',
                   'BIRTHS2011',
                   'BIRTHS2012',
                   'BIRTHS2013',
                   'BIRTHS2014',
                   'BIRTHS2015',
                   'POPESTIMATE2010',
                   'POPESTIMATE2011',
                   'POPESTIMATE2012',
                   'POPESTIMATE2013',
                   'POPESTIMATE2014',
                   'POPESTIMATE2015']
df = df[columns_to_keep] #generate table with these columns
df.head()
df = df.set_index(['STNAME', 'CTYNAME'])# Use STNAME and CTYNAME as two indexes
df.head()
df.loc['Michigan', 'Washtenaw County'] #When you have multiple indexes, you should consider the order.
df.loc[ [('Michigan', 'Washtenaw County'),
         ('Michigan', 'Wayne County')] ] #return the table with STNAME Michigan as the first index and CTYNAME Washtenaw County and Wayne County as the second index.

Eg. Reindex the purchase records DataFrame to be indexed hierarchically, first by store, then by person. Name these indexes ‘Location’ and ‘Name’. Then add a new entry to it with the value of:
Name: ‘Kevyn’, Item Purchased: ‘Kitty Food’, Cost: 3.00 Location: ‘Store 2’.

purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])

# Your answer here
df = df.set_index([df.index, 'Name'])
df.index.names = ['Location', 'Name'] #此处用df.index.names 更改index名称
df = df.append(pd.Series(data={'Cost': 3.00, 'Item Purchased': 'Kitty Food'}, name=('Store 2', 'Kevyn'))) #使用name来表示index, 注意优先级。
df

Missing Values

One of the handy functions that Pandas has for working with missing values is the filling function, fillna.
This function takes a number of parameters, for instance, you could pass in a single value which is called a scalar value to change all of the missing data to one value.

df.fillna?
Signature: df.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Use

fillna(method='ffill')

将上一行的值填充到NaN的行中。
在做一些运算,例如计算平均值时,这些NaN value通常不会被考虑在内。

Homework

Assignment 2 - Pandas Introduction
All questions are weighted the same in this assignment.

Part 1

The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.

The columns are organized as # of Summer games, Summer medals, # of Winter games, Winter medals, total # number of games, total # of medals. Use this dataset to answer the questions below.

import pandas as pd

df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#'+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index) 
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)

df = df.drop('Totals')
df.head()

Question 0 (Example)
What is the first country in df?

This function should return a Series.

# You should write your whole answer within the function provided. The autograder will call
# this function and compare the return value against the correct solution value
def answer_zero():
    # This function returns the row for Afghanistan, which is a Series object. The assignment
    # question description will tell you the general format the autograder is expecting
    return df.iloc[0]

# You can examine what your function returns by calling it in the cell. If you have questions
# about the assignment formats, check out the discussion forums for any FAQs
answer_zero() 

Question 1
Which country has won the most gold medals in summer games?

This function should return a single string value.

import pandas as pd
def answer_one():
    return df['Gold'].idxmax()
answer_one()

Question 2
Which country had the biggest difference between their summer and winter gold medal counts?

This function should return a single string value.

def answer_two():
    return (df['Gold']-df['Gold.1']).idxmax()
answer_two()

Question 3
Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count?
Only include countries that have won at least 1 gold in both summer and winter.

This function should return a single string value.

def answer_three():
    only_gold=df.where((df['Gold']> 0) & (df['Gold.1'] > 0))
    only_gold = only_gold.dropna()
    relative = abs((only_gold['Gold']-only_gold['Gold.1'])/only_gold['Gold.2'])
    return relative.idxmax()
answer_three()

Question 4
Write a function that creates a Series called “Points” which is a weighted value where each gold medal (Gold.2) counts for 3 points, silver medals (Silver.2) for 2 points, and bronze medals (Bronze.2) for 1 point. The function should return only the column (a Series object) which you created, with the country names as indices.

This function should return a Series named Points of length 146

def answer_four():
    df['points']=df['Gold.2']*3+df['Silver.2']*2+df['Bronze.2']*1
    return df['points']
answer_four()
#len(df['points'])

Part 2

For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.

The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.

Question 5
Which state has the most counties in it? (hint: consider the sumlevel key carefully! You’ll need this for future questions too…)

This function should return a single string value.

census_df = pd.read_csv('census.csv')
census_df.head()
def answer_five():
    new_df = census_df[census_df['SUMLEV'] == 50]
    return new_df.groupby('STNAME').count()['SUMLEV'].idxmax()
answer_five()

Question 6
Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.

This function should return a list of string values.

def answer_six():
    new_df=census_df[census_df['SUMLEV']==50]
    highest_population=new_df.sort_values('CENSUS2010POP',ascending=False).groupby('STNAME').head(3)
    return highest_population.groupby('STNAME').sum().sort_values('CENSUS2010POP',ascending=False).head(3).index.tolist()
answer_six()

Question 7
Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.)

e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.

This function should return a single string value.

def answer_seven():
    new_df = census_df[census_df['SUMLEV'] == 50][[6, 9, 10, 11, 12, 13, 14]]
    new_df["MaxDiff"] = abs(new_df.max(axis=1) - new_df.min(axis=1))
    most_change = new_df.sort_values(by=["MaxDiff"], ascending = False)
    return most_change.iloc[0][0]
answer_seven()

Question 8
In this datafile, the United States is broken up into four regions using the “REGION” column.

Create a query that finds the counties that belong to regions 1 or 2, whose name starts with ‘Washington’, and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014.

This function should return a 5x2 DataFrame with the columns = [‘STNAME’, ‘CTYNAME’] and the same index ID as the census_df (sorted ascending by index).

def answer_eight():
    counties = census_df[census_df['SUMLEV'] == 50]
    region = counties[(counties['REGION'] == 1) | (counties['REGION'] == 2)]
    start_washington = region[region['CTYNAME'].str.startswith("Washington")]
    grow = start_washington[start_washington['POPESTIMATE2015'] > start_washington['POPESTIMATE2014']]
    return grow[['STNAME', 'CTYNAME']]

answer_eight()
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值