pandas学习笔记(第四弹)

注:本教程为系列教程此章节接前面第一弹

15 布尔索引

15.1 导入数据

# 以movie_title为索引列
movies = pd.read_csv("./pandasLearnData/movie.csv",index_col="movie_title")
movies.head(5)
colordirector_namenum_critic_for_reviewsduration...actor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
movie_title
AvatarColorJames Cameron723.0178.0...936.07.91.7833000
Pirates of the Caribbean: At World's EndColorGore Verbinski302.0169.0...5000.07.12.350
SpectreColorSam Mendes602.0148.0...393.06.82.3585000
The Dark Knight RisesColorChristopher Nolan813.0164.0...23000.08.52.35164000
Star Wars: Episode VII - The Force AwakensNaNDoug WalkerNaNNaN...12.07.1NaN0

5 rows × 27 columns

15.2 构建布尔索引

# 找出时长在两个小时以上的电影
movie_2_hours = movies.duration > 120
print("类别:",type(movie_2_hours))
print("<"+"="*75+">")
print(movie_2_hours)
# 可以看见经过比较运算符后,我们得到了一个以movie_title为索引,值为布尔类型的Series
# 这就是布尔索引
类别: <class 'pandas.core.series.Series'>
<===========================================================================>
movie_title
Avatar                                       True
Pirates of the Caribbean: At World's End     True
Spectre                                      True
The Dark Knight Rises                        True
                                            ...  
The Following                               False
A Plague So Pleasant                        False
Shanghai Calling                            False
My Date with Drew                           False
Name: duration, Length: 4916, dtype: bool

15.3 统计布尔值

# 统计电影时长超过两小时的电影数量
movie_2_hours.sum()
1039
# 统计时长超过两个小时的电影所占比例
movie_2_hours.mean()
0.2113506916192026
# 因为原来duration字段原来有空缺值
print("duration字段有是否空缺值:",movies.duration.notnull().any())
# 所以先去掉空缺值再计算
movies.duration.dropna().gt(120).mean()
duration字段有是否空缺值: True





0.21199755152009794

15.4 比较同一个DataFrame中的两列

# 电影中女二号脸书粉丝数大于女二号的电影
movies.actor_1_facebook_likes < movies.actor_2_facebook_likes
movie_title
Avatar                                      False
Pirates of the Caribbean: At World's End    False
Spectre                                     False
The Dark Knight Rises                       False
                                            ...  
The Following                               False
A Plague So Pleasant                        False
Shanghai Calling                            False
My Date with Drew                           False
Length: 4916, dtype: bool

15.5 any() 和 all() 的区别

# all() 全真时才为真
print(pd.Series([True,True,True]).all())
# all() 有一个假时即返回假
print(pd.Series([True,False,True]).all())
True
False
# any() 全假时才为假
print(pd.Series([False,False,False]).any())
# any() 有一个真时即返回真
print(pd.Series([False,False,True]).any())
False
True

15.6 多个布尔索引的逻辑运算

!!!注意逻辑运算符的优先级高于比较运算符所以要打括号!!!

15.6.1 与运算

# 电影时长大于两个小时且女一号的粉丝数大于1000
(movies["duration"] > 120) & (movies["actor_1_facebook_likes"] > 1000)
movie_title
Avatar                                      False
Pirates of the Caribbean: At World's End     True
Spectre                                      True
The Dark Knight Rises                        True
                                            ...  
The Following                               False
A Plague So Pleasant                        False
Shanghai Calling                            False
My Date with Drew                           False
Length: 4916, dtype: bool

15.6.2 或运算

# 电影时长大于120 或者 小于 100的电影
(movies["duration"] < 100) | (movies["duration"] > 120)
movie_title
Avatar                                       True
Pirates of the Caribbean: At World's End     True
Spectre                                      True
The Dark Knight Rises                        True
                                            ...  
The Following                                True
A Plague So Pleasant                         True
Shanghai Calling                            False
My Date with Drew                            True
Name: duration, Length: 4916, dtype: bool

15.6.3 非运算

# 找出时长不大于120的电影
~ (movies["duration"] > 120)
movie_title
Avatar                                      False
Pirates of the Caribbean: At World's End    False
Spectre                                     False
The Dark Knight Rises                       False
                                            ...  
The Following                                True
A Plague So Pleasant                         True
Shanghai Calling                             True
My Date with Drew                            True
Name: duration, Length: 4916, dtype: bool

15.6.4 in运算

# 找出电影时长在集合中的电影
movies["duration"].isin([90.0,100.0,120.0])
movie_title
Avatar                                      False
Pirates of the Caribbean: At World's End    False
Spectre                                     False
The Dark Knight Rises                       False
                                            ...  
The Following                               False
A Plague So Pleasant                        False
Shanghai Calling                             True
My Date with Drew                            True
Name: duration, Length: 4916, dtype: bool

15.7 使用布尔索引获取数据

# 取数据集当中电影时长大于两个小时的行
movies[movie_2_hours]
colordirector_namenum_critic_for_reviewsduration...actor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
movie_title
AvatarColorJames Cameron723.0178.0...936.07.91.7833000
Pirates of the Caribbean: At World's EndColorGore Verbinski302.0169.0...5000.07.12.350
SpectreColorSam Mendes602.0148.0...393.06.82.3585000
The Dark Knight RisesColorChristopher Nolan813.0164.0...23000.08.52.35164000
..............................
Intolerance: Love's Struggle Throughout the AgesBlack and WhiteD.W. Griffith69.0123.0...22.08.01.33691
The Big ParadeBlack and WhiteKing Vidor48.0151.0...12.08.31.33226
OrdetBlack and WhiteCarl Theodor Dreyer54.0126.0...0.08.11.37863
The RidgesNaNBrandon LandersNaN143.0...19.03.0NaN33

1039 rows × 27 columns

# 获取女一号脸书粉丝数大于1万且电影时长超过两小时的行
movies[(movies["actor_1_facebook_likes"] > 10000) & movie_2_hours]
colordirector_namenum_critic_for_reviewsduration...actor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
movie_title
Pirates of the Caribbean: At World's EndColorGore Verbinski302.0169.0...5000.07.12.350
SpectreColorSam Mendes602.0148.0...393.06.82.3585000
The Dark Knight RisesColorChristopher Nolan813.0164.0...23000.08.52.35164000
Spider-Man 3ColorSam Raimi392.0156.0...11000.06.22.350
..............................
That Thing You Do!ColorTom Hanks75.0149.0...9000.06.91.370
StonewallColorRoland Emmerich74.0129.0...463.04.52.350
The Good, the Bad and the UglyColorSergio Leone181.0142.0...34.08.92.3520000
RockyColorJohn G. Avildsen141.0145.0...1000.08.11.330

435 rows × 27 columns

15.8 扩展

15.8.1 使用标签索引代替布尔索引

15.8.1.1 布尔索引法
# 找出所有德克萨斯州的学校信息
college_data[college_data["STABBR"] == "TX"]
CITYSTABBRHBCUMENONLY...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
INSTNM
Abilene Christian UniversityAbileneTX0.00.0...0.55270.03814020025985
Alvin Community CollegeAlvinTX0.00.0...0.06250.2841345006750
Amarillo CollegeAmarilloTX0.00.0...0.15730.34313170010950
Angelina CollegeLufkinTX0.00.0...0.00000.260326900PrivacySuppressed
..............................
Strayer University-San AntonioSan AntonioTXNaNNaN...NaNNaNNaN36173.5
Strayer University-StaffordStaffordTXNaNNaN...NaNNaNNaN36173.5
Vantage CollegeEl PasoTXNaNNaN...NaNNaNNaN9500
Excel Learning Center-San Antonio SouthSan AntonioTXNaNNaN...NaNNaNNaN12125

472 rows × 26 columns

15.8.1.2 标签索引法
# 先设置 STABBR 字段为行索引
college_data2 = college_data.set_index("STABBR")
college_data2.loc["TX"]
CITYHBCUMENONLYWOMENONLY...PCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
STABBR
TXAbilene0.00.00.0...0.55270.03814020025985
TXAlvin0.00.00.0...0.06250.2841345006750
TXAmarillo0.00.00.0...0.15730.34313170010950
TXLufkin0.00.00.0...0.00000.260326900PrivacySuppressed
..............................
TXSan AntonioNaNNaNNaN...NaNNaNNaN36173.5
TXStaffordNaNNaNNaN...NaNNaNNaN36173.5
TXEl PasoNaNNaNNaN...NaNNaNNaN9500
TXSan AntonioNaNNaNNaN...NaNNaNNaN12125

472 rows × 25 columns

15.8.1.3 比较两种方法的速度
%timeit college_data[college_data["STABBR"] == "TX"]
%timeit college_data2.loc["TX"]
# 可以看到标签索引法相较于布尔索引法在时间上差了整整两个数量级
1.12 ms ± 84.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
559 µs ± 52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

15.8.2 使用有序索引进行获取

15.8.2.1 排序前
# 没排序之前查看索引是否有序
college_data2.index.is_monotonic
False
15.8.2.2 排序后
college_data3 = college_data2.sort_index(ascending=True)
15.8.2.3 排序前后的速度
%timeit college_data2.loc["TX"]
%timeit college_data3.loc["TX"]
# 克见排序后的获取数据的时间快了一倍
570 µs ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
179 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

15.8.3 使用唯一索引获取数据

15.8.3.1 判断索引是否唯一
college_data.index.is_unique
True
15.8.3.2 再比较两种的速度
%timeit college_data.loc["Stanford University"]
college_data4 = college_data.sort_index()
%timeit college_data4.loc["Stanford University"]
157 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
159 µs ± 855 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

15.8.4 使用query方法提高可读性

c = "Color"
movies.query("100 <= duration <= 120 and actor_1_facebook_likes > 10000 and color==@c")
colordirector_namenum_critic_for_reviewsduration...actor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
movie_title
The Golden CompassColorChris Weitz251.0113.0...6000.06.12.350
Alice in WonderlandColorTim Burton451.0108.0...25000.06.51.8524000
X-Men: The Last StandColorBrett Ratner334.0104.0...808.06.82.350
Monsters UniversityColorDan Scanlon376.0104.0...779.07.31.8544000
..............................
The Slaughter RuleColorAlex Smith17.0112.0...1000.06.12.35183
Now Is GoodColorOl Parker48.0103.0...766.07.22.350
Chasing AmyColorKevin Smith147.0113.0...1000.07.31.850
The Grace CardColorDavid G. Evans25.0101.0...21.06.4NaN0

543 rows × 27 columns

movies.where(movies.duration < 60)
colordirector_namenum_critic_for_reviewsduration...actor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
movie_title
AvatarNaNNaNNaNNaN...NaNNaNNaNNaN
Pirates of the Caribbean: At World's EndNaNNaNNaNNaN...NaNNaNNaNNaN
SpectreNaNNaNNaNNaN...NaNNaNNaNNaN
The Dark Knight RisesNaNNaNNaNNaN...NaNNaNNaNNaN
..............................
The FollowingColorNaN43.043.0...593.07.516.032000.0
A Plague So PleasantNaNNaNNaNNaN...NaNNaNNaNNaN
Shanghai CallingNaNNaNNaNNaN...NaNNaNNaNNaN
My Date with DrewNaNNaNNaNNaN...NaNNaNNaNNaN

4916 rows × 27 columns

15.8.5 使用where对Series进行查询

ac1_fb_likes = movies["actor_1_facebook_likes"].dropna()
# 使用were时会返回一个同等大小的Series,不满足条件的值会进行填充,通过other参数进行设置,默认为np.NaN
ac1_fb_likes.where(ac1_fb_likes > 300,other=-1).where(ac1_fb_likes < 10000,other=-1)
movie_title
Avatar                                      1000.0
Pirates of the Caribbean: At World's End      -1.0
Spectre                                       -1.0
The Dark Knight Rises                         -1.0
                                             ...  
The Following                                841.0
A Plague So Pleasant                          -1.0
Shanghai Calling                             946.0
My Date with Drew                             -1.0
Name: actor_1_facebook_likes, Length: 4909, dtype: float64

15.8.6 DataFrame的mask操作

# mask 操作不会真的将数据删除,只是会将满足条件的行的所有字段置为空值
movies.mask(movies.actor_1_facebook_likes <= 1000)
colordirector_namenum_critic_for_reviewsduration...actor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
movie_title
AvatarNaNNaNNaNNaN...NaNNaNNaNNaN
Pirates of the Caribbean: At World's EndColorGore Verbinski302.0169.0...5000.07.12.350.0
SpectreColorSam Mendes602.0148.0...393.06.82.3585000.0
The Dark Knight RisesColorChristopher Nolan813.0164.0...23000.08.52.35164000.0
..............................
The FollowingNaNNaNNaNNaN...NaNNaNNaNNaN
A Plague So PleasantNaNNaNNaNNaN...NaNNaNNaNNaN
Shanghai CallingNaNNaNNaNNaN...NaNNaNNaNNaN
My Date with DrewNaNNaNNaNNaN...NaNNaNNaNNaN

4916 rows × 27 columns

# 利用mask操作和dropna可以很灵活的删除想删除的数据
# dropna 的all参数代表要整行都为空才能删除,默认值为any表示只要字段中有一个为空即删除
movies.mask(movies.actor_1_facebook_likes <= 1000).dropna(how="all")
colordirector_namenum_critic_for_reviewsduration...actor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
movie_title
Pirates of the Caribbean: At World's EndColorGore Verbinski302.0169.0...5000.07.12.350.0
SpectreColorSam Mendes602.0148.0...393.06.82.3585000.0
The Dark Knight RisesColorChristopher Nolan813.0164.0...23000.08.52.35164000.0
Spider-Man 3ColorSam Raimi392.0156.0...11000.06.22.350.0
..............................
Cheap ThrillsColorE.L. Katz193.088.0...982.06.82.350.0
Happy ChristmasColorJoe Swanberg65.082.0...969.05.61.85812.0
CountingColorJem Cohen12.0111.0...NaN6.01.785.0
Smiling Fish & Goat on FireColorKevin Jordan21.090.0...467.07.61.850.0

1966 rows × 27 columns

15.8.7 使用布尔值、整数、标签进行选取

# 根据布尔条件选取 这里使用loc方法
condition = movies.actor_1_facebook_likes < 1000
movies.loc[condition]
colordirector_namenum_critic_for_reviewsduration...actor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
movie_title
Star Wars: Episode VII - The Force AwakensNaNDoug WalkerNaNNaN...12.07.1NaN0
John CarterColorAndrew Stanton462.0132.0...632.06.62.3524000
TangledColorNathan Greno324.0100.0...553.07.81.8529000
Quantum of SolaceColorMarc Forster403.0106.0...412.06.72.350
..............................
The FollowingColorNaN43.043.0...593.07.516.0032000
A Plague So PleasantColorBenjamin Roberds13.076.0...0.06.3NaN16
Shanghai CallingColorDaniel Hsia14.0100.0...719.06.32.35660
My Date with DrewColorJon Gunn43.090.0...23.06.61.85456

2514 rows × 27 columns

# 比较是否和布尔索引方法得出的结果一致
movies.loc[condition].equals(movies[condition])
True
# 使用iloc利用整数进行获取
# 这里就必须要提取Series中的boll值,返回的也就是一个类型为array的bool数组
movies.iloc[condition.values]
colordirector_namenum_critic_for_reviewsduration...actor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
movie_title
Star Wars: Episode VII - The Force AwakensNaNDoug WalkerNaNNaN...12.07.1NaN0
John CarterColorAndrew Stanton462.0132.0...632.06.62.3524000
TangledColorNathan Greno324.0100.0...553.07.81.8529000
Quantum of SolaceColorMarc Forster403.0106.0...412.06.72.350
..............................
The FollowingColorNaN43.043.0...593.07.516.0032000
A Plague So PleasantColorBenjamin Roberds13.076.0...0.06.3NaN16
Shanghai CallingColorDaniel Hsia14.0100.0...719.06.32.35660
My Date with DrewColorJon Gunn43.090.0...23.06.61.85456

2514 rows × 27 columns

# 利用布尔数组选取相应类型的列
movies.loc[:,movies.dtypes == "object"]
# 同理 使用iloc时要取其 array数组
colordirector_nameactor_2_namegenres...movie_imdb_linklanguagecountrycontent_rating
movie_title
AvatarColorJames CameronJoel David MooreAction|Adventure|Fantasy|Sci-Fi...http://www.imdb.com/title/tt0499549/?ref_=fn_t...EnglishUSAPG-13
Pirates of the Caribbean: At World's EndColorGore VerbinskiOrlando BloomAction|Adventure|Fantasy...http://www.imdb.com/title/tt0449088/?ref_=fn_t...EnglishUSAPG-13
SpectreColorSam MendesRory KinnearAction|Adventure|Thriller...http://www.imdb.com/title/tt2379713/?ref_=fn_t...EnglishUKPG-13
The Dark Knight RisesColorChristopher NolanChristian BaleAction|Thriller...http://www.imdb.com/title/tt1345836/?ref_=fn_t...EnglishUSAPG-13
..............................
The FollowingColorNaNValorie CurryCrime|Drama|Mystery|Thriller...http://www.imdb.com/title/tt2071645/?ref_=fn_t...EnglishUSATV-14
A Plague So PleasantColorBenjamin RoberdsMaxwell MoodyDrama|Horror|Thriller...http://www.imdb.com/title/tt2107644/?ref_=fn_t...EnglishUSANaN
Shanghai CallingColorDaniel HsiaDaniel HenneyComedy|Drama|Romance...http://www.imdb.com/title/tt2070597/?ref_=fn_t...EnglishUSAPG-13
My Date with DrewColorJon GunnBrian HerzlingerDocumentary...http://www.imdb.com/title/tt0378407/?ref_=fn_t...EnglishUSAPG

4916 rows × 11 columns

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值