DataCamp DataScientist系列之intermediate python的学习笔记(001)
个人感悟:接触python是从2017年1月开始的,中间的学习之路也是断断续续的,学了忘,忘了学。前几天通过datacamp重拾python的基本操作,发现很有必要将一些要点记录下来,形成系统的笔记。嗯,所以这篇笔记就这么诞生了。以后还要写sql的学习笔记,一个人学习太艰难了!加油!坚持写下去!
1.Dictionaries & Pandas
2.Logic, Control Flow and Filtering
3.Loops
4.需要再深入了解的函数
1.Dictionaries & Pandas
1.1 字典基本操作
1.1.1 Motivation for dictionaries(列表的索引操作)
提示 :
- Use the
index()
method on countries to find the index of
’germany’. Store this index as ind_ger
使用index()函数打印
使用index()函数打印索引
代码实现
# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']
# Get index of 'germany': ind_ger
ind_ger=countries.index('germany')
# Use ind_ger to print out capital of Germany
print(capitals[ind_ger])
1.1.2 Access dictionary(创建字典以及访问)
提示 :
示例 europe[‘france’]
- Check out which keys are in europe by calling the –
keys()
method on europe. Print out the result.
打印字典的所有键- Print out the value that belongs to the key
'norway'
.
打印指定键‘norway’的值
代码实现
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# Print out the keys in europe
print(europe.keys())
# Print out value that belongs to key 'norway'
print(europe['norway'])
结果
dict_keys(['norway', 'spain', 'france', 'germany'])
oslo
1.1.3 Dictionary Manipulation1(字典增加)
提示 :
- Add the key
'italy'
with the value'rome'
to europe.
增加一对健值对- To assert that
'italy'
is now a key ineurope
, print out'italy'
ineurope
.
检验'italy'
是否在字典中- Add another key:value pair to
europe
:'poland'
is the key,'warsaw'
is the corresponding value.
增加一对健值对
代码实现
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# Add italy to europe
europe['italy']='rome'
# Print out italy in europe
print('italy' in europe)
# Add poland to europe
europe['poland']='warsaw'
# Print europe
print(europe)
1.1.4 Dictionary Manipulation 2(字典更新&删除)
提示 :
- 更新某个键的值
- 删除字典中的键值对
代码实现
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
'norway':'oslo', 'italy':'rome', 'poland':'warsaw',
'australia':'vienna' }
# Update capital of germany
europe['germany']='berlin'
# Remove australia
del europe['australia']
# Print europe
print(europe)
1.1.5 dictionariception(多层字典)
提示 :
- 使用多级中括号打印
France
的capital
- 新创建一个字典
- 将新创建的字典嵌套到第一个字典里
代码实现
# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
'france': { 'capital':'paris', 'population':66.03 },
'germany': { 'capital':'berlin', 'population':80.62 },
'norway': { 'capital':'oslo', 'population':5.084 } }
# Print out the capital of France
print(europe['france']['capital'])
# Create sub-dictionary data
data={'capital':'rome','population':59.83}
# Add data to europe under key 'italy'
europe['italy']=data
# Print europe
print(europe)
运行结果
paris
{'italy': {'population': 59.83, 'capital': 'rome'}, 'norway': {'population': 5.084, 'capital': 'oslo'}, 'spain': {'population': 46.77, 'capital': 'madrid'}, 'france': {'population': 66.03, 'capital': 'paris'}, 'germany': {'population': 80.62, 'capital': 'berlin'}}
1.2 pandas$dataframe初步
1.2.1 Dictionary to DataFrame (1)(字典转df)
提示 :
Usepd.DataFrame()
to turn your dict into a DataFrame calledcars
.
将字典转为dataframe
代码实现
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
# Import pandas as pd
import pandas as pd
# Create dictionary my_dict with three key:value pairs: my_dict
my_dict={'country':names,'drives_right':dr,'cars_per_cap':cpc}
# Build a DataFrame cars from my_dict: cars
cars=pd.DataFrame(my_dict)
# Print cars
print(cars)
运行结果
cars_per_cap country drives_right
0 809 United States True
1 731 Australia False
2 588 Japan False
3 18 India False
4 200 Russia True
5 70 Morocco True
6 45 Egypt True
1.2.2 Dictionary to DataFrame 2(添加df行索引)
提示
列表 → 字典 → dataframe
Specify the row labels by settingcars.index
equal torow_labels
指定df行索引
代码实现
import pandas as pd
# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
cars_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(cars_dict)
print(cars)
# Definition of row_labels
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']
# Specify row labels of cars
cars.index=row_labels
# Print cars again
print(cars)
运行结果
cars_per_cap country drives_right
0 809 United States True
1 731 Australia False
2 588 Japan False
3 18 India False
4 200 Russia True
5 70 Morocco True
6 45 Egypt True
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
1.2.3 CSV to DataFrame1 (CSV转DataFrame1)
提示
pd.read_csv()
代码实现
import pandas as pd
#Import the cars.csv data: cars
cars=pd.read_csv('cars.csv')
#Print out cars
print(cars)
运行结果
Unnamed: 0 cars_per_cap country drives_right
0 US 809 United States True
1 AUS 731 Australia False
2 JPN 588 Japan False
3 IN 18 India False
4 RU 200 Russia True
5 MOR 70 Morocco True
6 EG 45 Egypt True
1.2.4 CSV to DataFrame1 (CSV转DataFrame2行索引设置)
提示:
将第一列设置为行索引
代码实现
# Import pandas as pd
import pandas as pd
# Fix import by including index_col
cars = pd.read_csv('cars.csv',index_col=0)
# Print out cars
print(cars)
运行结果
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
1.3 pandas$ dataframe filtering(datafame 筛选)
1.3.1 Square Brackets (1)
提示:
- Use single square brackets to print out the
country
column ofcars
as a Pandas Series.
筛选出1列做series- Use double square brackets to print out the
country
column ofcars
as a Pandas DataFrame.
筛选出1列做df- Use double square brackets to print out a DataFrame with both the
country
anddrives_right
columns ofcars
, in this order.
筛选出两列做df
代码实现
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
print(cars)
### output
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
# Print out country column as Pandas Series
print(cars['country'])
### output
US United States
AUS Australia
JPN Japan
IN India
RU Russia
MOR Morocco
EG Egypt
Name: country, dtype: object
# Print out country column as Pandas DataFrame
print(cars[['country']])
### output
country
US United States
AUS Australia
JPN Japan
IN India
RU Russia
MOR Morocco
EG Egypt
# Print out DataFrame with country and drives_right columns
print(cars[['country','drives_right']])
### output
country drives_right
US United States True
AUS Australia False
JPN Japan False
IN India False
RU Russia True
MOR Morocco True
EG Egypt True
1.3.2 Square Brackets (2) (筛选dataframe特定行)
提示
- Select the first 3 observations from
cars
and print them out.
选出第3行的观察值- Select the fourth, fifth and sixth observation, corresponding to row indexes 3, 4 and 5, and print them out.
选出第3,4,5行的观察值
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# print(cars)
###output
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
# Print out first 3 observations
print(cars[0:3])
###output
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
# Print out fourth, fifth and sixth observation
print(cars.iloc[[3,4,5]])
###output
cars_per_cap country drives_right
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
1.3.3 loc and iloc 1(dataframe行筛选按照row_index或row_lable)
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
提示 :
以下命令具有同样效果:
cars.loc[‘RU’]
cars.iloc[4]
返回pandas.core.series.Series
In [8]: cars.loc['RU']
Out[8]:
cars_per_cap 200
country Russia
drives_right True
Name: RU, dtype: object
In [9]: cars.iloc[4]
Out[9]:
cars_per_cap 200
country Russia
drives_right True
Name: RU, dtype: object
In [10]: type(cars.loc['RU'])
Out[10]: pandas.core.series.Series
提示 :
以下命令具有同样效果:
cars.loc[[‘RU’]]
cars.iloc[[4]]
In [5]: cars.loc[['RU']]
Out[5]:
cars_per_cap country drives_right
RU 200 Russia True
In [6]: type(cars.loc[['RU']])
Out[6]: pandas.core.frame.DataFrame
In [7]: cars.iloc[[4]]
Out[7]:
cars_per_cap country drives_right
RU 200 Russia True
提示 :
以下命令具有同样效果:
cars.loc[[‘RU’, ‘AUS’]]
cars.iloc[[4, 1]]
In [13]: cars.loc[['RU', 'AUS']]
Out[13]:
cars_per_cap country drives_right
RU 200 Russia True
AUS 731 Australia False
In [14]: cars.iloc[[4, 1]]
Out[14]:
cars_per_cap country drives_right
RU 200 Russia True
AUS 731 Australia False
1.3.4 loc and iloc 2(dataframe多行列区域筛选)
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
提示 :
以下命令具有同样效果:
cars.loc[‘IN’, ‘cars_per_cap’]
cars.iloc[3, 0]
In [1]: cars.loc['IN', 'cars_per_cap']
Out[1]: 18
In [2]: cars.iloc[3, 0]
Out[2]: 18
提示 :
以下命令具有同样效果:
cars.loc[[‘IN’, ‘RU’], ‘cars_per_cap’]
cars.iloc[[3, 4], 0]
In [3]: cars.loc[['IN', 'RU'], 'cars_per_cap']
Out[3]:
IN 18
RU 200
Name: cars_per_cap, dtype: int64
In [4]: cars.iloc[[3, 4], 0]
Out[4]:
IN 18
RU 200
Name: cars_per_cap, dtype: int64
提示 :
以下命令具有同样效果:
cars.loc[[‘IN’, ‘RU’], [‘cars_per_cap’, ‘country’]]
cars.iloc[[3, 4], [0, 1]]
In [5]: cars.loc[['IN', 'RU'], ['cars_per_cap', 'country']]
Out[5]:
cars_per_cap country
IN 18 India
RU 200 Russia
In [6]: cars.iloc[[3, 4], [0, 1]]
Out[6]:
cars_per_cap country
IN 18 India
RU 200 Russia
# Print out drives_right value of Morocco
print(cars.loc[['MOR'],['drives_right']])
#
drives_right
MOR True
# Print sub-DataFrame
print(cars.loc[['RU','MOR'],['country','drives_right']])
#output
country drives_right
RU Russia True
MOR Morocco True
1.3.5 loc and iloc 3(dataframe全行单或多列区域筛选)
提示 :
以下命令具有同样效果:
cars.loc[:, ‘country’]
cars.iloc[:, 1]
In [2]: cars.loc[:, 'country']
Out[2]:
US United States
AUS Australia
JPN Japan
IN India
RU Russia
MOR Morocco
EG Egypt
Name: country, dtype: object
In [3]: cars.iloc[:, 1]
Out[3]:
US United States
AUS Australia
JPN Japan
IN India
RU Russia
MOR Morocco
EG Egypt
Name: country, dtype: object
提示 :
以下命令具有同样效果:
cars.loc[:, [‘country’,‘drives_right’]]
cars.iloc[:, [1, 2]]
In [4]: cars.loc[:, ['country','drives_right']]
Out[4]:
country drives_right
US United States True
AUS Australia False
JPN Japan False
IN India False
RU Russia True
MOR Morocco True
EG Egypt True
In [5]: cars.iloc[:, [1, 2]]
Out[5]:
country drives_right
US United States True
AUS Australia False
JPN Japan False
IN India False
RU Russia True
MOR Morocco True
EG Egypt True
In [6]: # Print out drives_right column as Series
... print(cars.loc[:,'drives_right'])
###output
US True
AUS False
JPN False
IN False
RU True
MOR True
EG True
Name: drives_right, dtype: bool
In [7]: # Print out drives_right column as DataFrame
... print(cars.loc[:,['drives_right']])
###output
drives_right
US True
AUS False
JPN False
IN False
RU True
MOR True
EG True
In [8]: # Print out cars_per_cap and drives_right as DataFrame
... print(cars.loc[:,['cars_per_cap','drives_right']])
###output
cars_per_cap drives_right
US 809 True
AUS 731 False
JPN 588 False
IN 18 False
RU 200 True
MOR 70 True
EG 45 True
2. Logic, Control Flow and Filtering
2.1 Boolean operators with array
2.1.1 Boolean operators with Numpy(数组数组大小判断中布尔值的运用
提示 :
To use these operators with Numpy, you will neednp.logical_and()
,np.logical_or()
andnp.logical_not()
. Here’s an example on the my_house and your_house arrays from before to give you an idea:
# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])
# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house>18.5,my_house<10))
#output
[False True False True]
# Both my_house and your_house smaller than 11
# np.logical_and
print(np.logical_and(my_house<11,your_house<11))
#output
[False True False True]
2.2 Filtering pandas DataFrames(DataFrame数值条件筛选)
数据概览
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
#output
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
2.2.1 Driving right
要求:筛选出“靠右驾驶"的国家
步骤1 :
- Extract the
drives_right
column as a Pandas Series and store it asdr
.
In [4]: dr=cars['drives_right']
In [5]: dr
Out[5]:
US True
AUS False
JPN False
IN False
RU True
MOR True
EG True
Name: drives_right, dtype: bool
步骤2 :
- Use
dr
, a boolean Series, to subset thecars
DataFrame. Store the resulting selection insel
.
In [6]: # Use dr to subset cars: sel
... sel=cars[dr==True]
In [7]: sel
Out[7]:
cars_per_cap country drives_right
US 809 United States True
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
Convert the code on the right to a one-liner that calculates the variable sel as before.
以上步骤可以简略为一行
In [1]: sel= cars[cars['drives_right']]
print(sel)
##output
cars_per_cap country drives_right
US 809 United States True
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
2.2.2 Cars per capita1
要求:筛选出大于'cars_per_cap'
列大于500的国家
步骤 :
- Select the
cars_per_cap
column fromcars
as a Pandas Series and store it ascpc
- Use
cpc
in combination with a comparison operator and500
. You want to end up with a boolean Series that’sTrue
if the corresponding country has acars_per_cap
of more than500
andFalse
otherwise. Store this boolean Series asmany_cars
.- Use
many_cars
to subset cars, similar to what you did before. Store the result ascar_maniac
.- Print out
car_maniac
to see if you got it right.
代码实现
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Create car_maniac: observations that have a cars_per_cap over 500
cpc=cars['cars_per_cap']
many_cars=cpc>500
car_maniac=cars[many_cars]
# Print car_maniac
print(car_maniac)
运行结果
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
2.2.3 Cars per capita 2(多条件筛选)
要求:
Use the code sample above to create a DataFramemedium
, that includes all the observations ofcars
that have acars_per_cap
· between100
and500
.
Print out medium.
提示:
Remember aboutnp.logical_and()
,np.logical_or()
andnp.logical_not()
cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 10, cpc < 80)
medium = cars[between]
代码实现
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Import numpy, you'll need this
import numpy as np
# Create medium: observations with cars_per_cap between 100 and 500
cpc=cars['cars_per_cap']
between=np.logical_and(cpc<500,cpc>100)
medium=cars[between]
# Print medium
print(medium)
运行结果
cars_per_cap country drives_right
RU 200 Russia True
看看中间变量长啥样
In [2]: cpc
Out[2]:
US 809
AUS 731
JPN 588
IN 18
RU 200
MOR 70
EG 45
Name: cars_per_cap, dtype: int64
In [3]: between
Out[3]:
US False
AUS False
JPN False
IN False
RU True
MOR False
EG False
Name: cars_per_cap, dtype: bool
3. Loops
3.1 Loop over Numpy array
3.1.1 Loop over Numpy array(一维或二维数组遍历打印)
In [3]: np_baseball[:6]
Out[3]:
array([[ 74, 180],
[ 74, 215],
[ 72, 210],
[ 72, 210],
[ 73, 188],
[ 69, 176]])
In [4]: np_height[:6]
Out[4]: array([74, 74, 72, 72, 73, 69])
# Import numpy as np
import numpy as np
# For loop over np_height 一维数组
for x in np_height:
print('%s inches'%x)
# For loop over np_baseball 二维数组,会先打印第一列
for i in np.nditer(np_baseball):
print(i)
3.2 Loop over DataFrame
3.2.1 Loop over DataFrame1
提示: 按照索引遍历每行
for lab, row in brics.iterrows() :
数据总览
In [2]: cars
Out[2]:
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
代码实现
# Iterate over rows of cars
for lab,row in cars.iterrows():
print(lab)
print(row)
运行结果
US
cars_per_cap 809
country United States
drives_right True
Name: US, dtype: object
AUS
cars_per_cap 731
country Australia
drives_right False
Name: AUS, dtype: object
JPN
cars_per_cap 588
country Japan
drives_right False
Name: JPN, dtype: object
IN
cars_per_cap 18
country India
drives_right False
Name: IN, dtype: object
RU
cars_per_cap 200
country Russia
drives_right True
Name: RU, dtype: object
MOR
cars_per_cap 70
country Morocco
drives_right True
Name: MOR, dtype: object
EG
cars_per_cap 45
country Egypt
drives_right True
Name: EG, dtype: object
3.2.2 Add column (1)(增加列操作)
- Use a
for
loop to add a new column, namedCOUNTRY
, that contains a uppercase version of the country names in the"country"
column. You can use the string methodupper()
for this.- To see if your code worked, print out cars. Don’t indent this code, so that it’s not part of the for loop.
代码实现
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Code for loop that adds COUNTRY column
for lab,row in cars.iterrows():
cars.loc[lab,'COUNTRY']=row['country'].upper()
# Print cars
print(cars)
运行结果
cars_per_cap country drives_right COUNTRY
US 809 United States True UNITED STATES
AUS 731 Australia False AUSTRALIA
JPN 588 Japan False JAPAN
IN 18 India False INDIA
RU 200 Russia True RUSSIA
MOR 70 Morocco True MOROCCO
EG 45 Egypt True EGYPT
3.2.2 Add column (2)(增加列操作apply())
use
apply()
代码实现
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Use .apply(str.upper)
cars["COUNTRY"] = cars["country"].apply(str.upper)
print(cars)
运行结果
cars_per_cap country drives_right COUNTRY
US 809 United States True UNITED STATES
AUS 731 Australia False AUSTRALIA
JPN 588 Japan False JAPAN
IN 18 India False INDIA
RU 200 Russia True RUSSIA
MOR 70 Morocco True MOROCCO
EG 45 Egypt True EGYPT
4.需要再深入了解的函数
apply()
的使用
先写到这里啦,以后还会完善的!