注:本教程为系列教程此章节接前面第一弹
跳转到专题地址
9.选取多个DataFrame
9.1 用列表选取多列
movies[["actor_1_facebook_likes","actor_2_facebook_likes","actor_3_facebook_likes"]]
| actor_1_facebook_likes | actor_2_facebook_likes | actor_3_facebook_likes |
---|
0 | 1000.0 | 936.0 | 855.0 |
1 | 40000.0 | 5000.0 | 1000.0 |
2 | 11000.0 | 393.0 | 161.0 |
3 | 27000.0 | 23000.0 | 23000.0 |
... | ... | ... | ... |
4912 | 841.0 | 593.0 | 319.0 |
4913 | 0.0 | 0.0 | 0.0 |
4914 | 946.0 | 719.0 | 489.0 |
4915 | 86.0 | 23.0 | 16.0 |
4916 rows × 3 columns
9.2 选取单列
movies[["actor_1_facebook_likes"]]
| actor_1_facebook_likes |
---|
0 | 1000.0 |
1 | 40000.0 |
2 | 11000.0 |
3 | 27000.0 |
... | ... |
4912 | 841.0 |
4913 | 0.0 |
4914 | 946.0 |
4915 | 86.0 |
4916 rows × 1 columns
!!!注意在这里要区别下面的这种取法!!!
print(type(movies["actor_1_facebook_likes"]))
print(type(movies[["actor_1_facebook_likes"]]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
9.3 通过类型选择DataFrame
movies.select_dtypes(include=["int","float"])
| num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | ... | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes |
---|
0 | 723.0 | 178.0 | 0.0 | 855.0 | ... | 936.0 | 7.9 | 1.78 | 33000 |
1 | 302.0 | 169.0 | 563.0 | 1000.0 | ... | 5000.0 | 7.1 | 2.35 | 0 |
2 | 602.0 | 148.0 | 0.0 | 161.0 | ... | 393.0 | 6.8 | 2.35 | 85000 |
3 | 813.0 | 164.0 | 22000.0 | 23000.0 | ... | 23000.0 | 8.5 | 2.35 | 164000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4912 | 43.0 | 43.0 | NaN | 319.0 | ... | 593.0 | 7.5 | 16.00 | 32000 |
4913 | 13.0 | 76.0 | 0.0 | 0.0 | ... | 0.0 | 6.3 | NaN | 16 |
4914 | 14.0 | 100.0 | 0.0 | 489.0 | ... | 719.0 | 6.3 | 2.35 | 660 |
4915 | 43.0 | 90.0 | 16.0 | 16.0 | ... | 23.0 | 6.6 | 1.85 | 456 |
4916 rows × 16 columns
9.4 通过过滤器选取DataFrame
9.4.1 like 过滤法
movies.filter(like="actor_1")
| actor_1_facebook_likes | actor_1_name |
---|
0 | 1000.0 | CCH Pounder |
1 | 40000.0 | Johnny Depp |
2 | 11000.0 | Christoph Waltz |
3 | 27000.0 | Tom Hardy |
... | ... | ... |
4912 | 841.0 | Natalie Zea |
4913 | 0.0 | Eva Boehnke |
4914 | 946.0 | Alan Ruck |
4915 | 86.0 | John August |
4916 rows × 2 columns
9.4.2 items 选取
movies.filter(items=["actor_1_name","actor_1_facebook_likes"])
| actor_1_name | actor_1_facebook_likes |
---|
0 | CCH Pounder | 1000.0 |
1 | Johnny Depp | 40000.0 |
2 | Christoph Waltz | 11000.0 |
3 | Tom Hardy | 27000.0 |
... | ... | ... |
4912 | Natalie Zea | 841.0 |
4913 | Eva Boehnke | 0.0 |
4914 | Alan Ruck | 946.0 |
4915 | John August | 86.0 |
4916 rows × 2 columns
9.4.3 正则选取
movies.filter(regex="^(?!actor)")
| color | director_name | num_critic_for_reviews | duration | ... | title_year | imdb_score | aspect_ratio | movie_facebook_likes |
---|
0 | Color | James Cameron | 723.0 | 178.0 | ... | 2009.0 | 7.9 | 1.78 | 33000 |
1 | Color | Gore Verbinski | 302.0 | 169.0 | ... | 2007.0 | 7.1 | 2.35 | 0 |
2 | Color | Sam Mendes | 602.0 | 148.0 | ... | 2015.0 | 6.8 | 2.35 | 85000 |
3 | Color | Christopher Nolan | 813.0 | 164.0 | ... | 2012.0 | 8.5 | 2.35 | 164000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4912 | Color | NaN | 43.0 | 43.0 | ... | NaN | 7.5 | 16.00 | 32000 |
4913 | Color | Benjamin Roberds | 13.0 | 76.0 | ... | 2013.0 | 6.3 | NaN | 16 |
4914 | Color | Daniel Hsia | 14.0 | 100.0 | ... | 2012.0 | 6.3 | 2.35 | 660 |
4915 | Color | Jon Gunn | 43.0 | 90.0 | ... | 2004.0 | 6.6 | 1.85 | 456 |
4916 rows × 22 columns
10.按照给定的列顺序排序
disc_core = ['movie_title','title_year', 'content_rating','genres']
disc_people = ['director_name','actor_1_name', 'actor_2_name','actor_3_name']
disc_other = ['color','country','language','plot_keywords','movie_imdb_link']
cont_fb = ['director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes', 'cast_total_facebook_likes', 'movie_facebook_likes']
cont_finance = ['budget','gross']
cont_num_reviews = ['num_voted_users','num_user_for_reviews', 'num_critic_for_reviews']
cont_other = ['imdb_score','duration', 'aspect_ratio','facenumber_in_poster']
new_col_order = disc_core + disc_people + disc_other + cont_fb + cont_finance + cont_num_reviews + cont_other
print(set(new_col_order) == set(movies.columns))
movies[new_col_order]
True
| movie_title | title_year | content_rating | genres | ... | imdb_score | duration | aspect_ratio | facenumber_in_poster |
---|
0 | Avatar | 2009.0 | PG-13 | Action|Adventure|Fantasy|Sci-Fi | ... | 7.9 | 178.0 | 1.78 | 0.0 |
1 | Pirates of the Caribbean: At World's End | 2007.0 | PG-13 | Action|Adventure|Fantasy | ... | 7.1 | 169.0 | 2.35 | 0.0 |
2 | Spectre | 2015.0 | PG-13 | Action|Adventure|Thriller | ... | 6.8 | 148.0 | 2.35 | 1.0 |
3 | The Dark Knight Rises | 2012.0 | PG-13 | Action|Thriller | ... | 8.5 | 164.0 | 2.35 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4912 | The Following | NaN | TV-14 | Crime|Drama|Mystery|Thriller | ... | 7.5 | 43.0 | 16.00 | 1.0 |
4913 | A Plague So Pleasant | 2013.0 | NaN | Drama|Horror|Thriller | ... | 6.3 | 76.0 | NaN | 0.0 |
4914 | Shanghai Calling | 2012.0 | PG-13 | Comedy|Drama|Romance | ... | 6.3 | 100.0 | 2.35 | 5.0 |
4915 | My Date with Drew | 2004.0 | PG | Documentary | ... | 6.6 | 90.0 | 1.85 | 0.0 |
4916 rows × 28 columns
11 操作整个DataFrame
11.1 获取DataFrame的行和列数
movies.shape
(4916, 28)
11.2 获取DataFrame的个数
movies.size
137648
11.3 获取DataFrame的长度
len(movies)
4916
11.4 获取DataFrame的维度
movies.ndim
2
11.5 统计DataFrame中各种类型的个数
movies.count()
color 4897
director_name 4814
num_critic_for_reviews 4867
duration 4901
...
actor_2_facebook_likes 4903
imdb_score 4916
aspect_ratio 4590
movie_facebook_likes 4916
Length: 28, dtype: int64
11.6 取DataFrame中各列的最小值
movies.min()
num_critic_for_reviews 1.00
duration 7.00
director_facebook_likes 0.00
actor_3_facebook_likes 0.00
...
actor_2_facebook_likes 0.00
imdb_score 1.60
aspect_ratio 1.18
movie_facebook_likes 0.00
Length: 16, dtype: float64
11.7 取DataFrame中各列的最大值
movies.max()
num_critic_for_reviews 813.0
duration 511.0
director_facebook_likes 23000.0
actor_3_facebook_likes 23000.0
...
actor_2_facebook_likes 137000.0
imdb_score 9.5
aspect_ratio 16.0
movie_facebook_likes 349000.0
Length: 16, dtype: float64
11.8 取DataFrame中各列的平均值
movies.mean()
num_critic_for_reviews 137.988905
duration 107.090798
director_facebook_likes 691.014541
actor_3_facebook_likes 631.276313
...
actor_2_facebook_likes 1621.923516
imdb_score 6.437429
aspect_ratio 2.222349
movie_facebook_likes 7348.294142
Length: 16, dtype: float64
11.9 取DataFrame中各列的中位数
movies.median()
num_critic_for_reviews 108.00
duration 103.00
director_facebook_likes 48.00
actor_3_facebook_likes 366.00
...
actor_2_facebook_likes 593.00
imdb_score 6.60
aspect_ratio 2.35
movie_facebook_likes 159.00
Length: 16, dtype: float64
11.10 取DataFrame中各列的分位数
movies.quantile([0.5,0.3])
| num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | ... | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes |
---|
0.5 | 108.0 | 103.0 | 48.0 | 366.0 | ... | 593.0 | 6.6 | 2.35 | 159.0 |
0.3 | 60.0 | 95.0 | 11.0 | 176.0 | ... | 345.0 | 6.0 | 1.85 | 0.0 |
2 rows × 16 columns
11.11 取DataFrame中各列的标准差
movies.std()
num_critic_for_reviews 120.239379
duration 25.286015
director_facebook_likes 2832.954125
actor_3_facebook_likes 1625.874802
...
actor_2_facebook_likes 4011.299523
imdb_score 1.127802
aspect_ratio 1.402940
movie_facebook_likes 19206.016458
Length: 16, dtype: float64
11.12 取DataFrame中各列的和
movies.sum()
num_critic_for_reviews 671592.00
duration 524852.00
director_facebook_likes 3326544.00
actor_3_facebook_likes 3088835.00
...
actor_2_facebook_likes 7952291.00
imdb_score 31646.40
aspect_ratio 10200.58
movie_facebook_likes 36124214.00
Length: 16, dtype: float64
补充说明:上面的统计函数可以传入skip = bool 选择是否跳过空值
11.13 获取DataFrame中各列的统计信息
movies.describe()
| num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | ... | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes |
---|
count | 4867.000000 | 4901.000000 | 4814.000000 | 4893.000000 | ... | 4903.000000 | 4916.000000 | 4590.000000 | 4916.000000 |
mean | 137.988905 | 107.090798 | 691.014541 | 631.276313 | ... | 1621.923516 | 6.437429 | 2.222349 | 7348.294142 |
std | 120.239379 | 25.286015 | 2832.954125 | 1625.874802 | ... | 4011.299523 | 1.127802 | 1.402940 | 19206.016458 |
min | 1.000000 | 7.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 1.600000 | 1.180000 | 0.000000 |
25% | 49.000000 | 93.000000 | 7.000000 | 132.000000 | ... | 277.000000 | 5.800000 | 1.850000 | 0.000000 |
50% | 108.000000 | 103.000000 | 48.000000 | 366.000000 | ... | 593.000000 | 6.600000 | 2.350000 | 159.000000 |
75% | 191.000000 | 118.000000 | 189.750000 | 633.000000 | ... | 912.000000 | 7.200000 | 2.350000 | 2000.000000 |
max | 813.000000 | 511.000000 | 23000.000000 | 23000.000000 | ... | 137000.000000 | 9.500000 | 16.000000 | 349000.000000 |
8 rows × 16 columns
11.14 判断DataFrame中的空值
movies.isnull()
| color | director_name | num_critic_for_reviews | duration | ... | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes |
---|
0 | False | False | False | False | ... | False | False | False | False |
1 | False | False | False | False | ... | False | False | False | False |
2 | False | False | False | False | ... | False | False | False | False |
3 | False | False | False | False | ... | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4912 | False | True | False | False | ... | False | False | False | False |
4913 | False | False | False | False | ... | False | False | True | False |
4914 | False | False | False | False | ... | False | False | False | False |
4915 | False | False | False | False | ... | False | False | False | False |
4916 rows × 28 columns
11.15 判断DataFrame中的非空值
movies.notnull()
| color | director_name | num_critic_for_reviews | duration | ... | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes |
---|
0 | True | True | True | True | ... | True | True | True | True |
1 | True | True | True | True | ... | True | True | True | True |
2 | True | True | True | True | ... | True | True | True | True |
3 | True | True | True | True | ... | True | True | True | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4912 | True | False | True | True | ... | True | True | True | True |
4913 | True | True | True | True | ... | True | True | False | True |
4914 | True | True | True | True | ... | True | True | True | True |
4915 | True | True | True | True | ... | True | True | True | True |
4916 rows × 28 columns
11.16 判断DataFrame中的值是否全为真
movies.notnull().all()
color False
director_name False
num_critic_for_reviews False
duration False
...
actor_2_facebook_likes False
imdb_score True
aspect_ratio False
movie_facebook_likes True
Length: 28, dtype: bool
11.17 在进行对有空值的列进行统计时,填充空值后才能统计
movies[["director_name"]].fillna("").max()
director_name Étienne Faure
dtype: object
12.DataFrame算数运算
12.1准备数据
matrix = pd.DataFrame([np.random.randint(0,10,3) for i in range(3)])
matrix
12.2 加法运算
12.2.1 操作符方式
matrix + 5
12.2.2 函数方式
matrix.add(5)
12.3 减法运算
12.3.1 操作符方法
matrix - 5
12.3.2 函数方法
matrix.sub(5)
12.4 乘法运算
12.4.1 操作符方法
matrix * 5
12.4.2 函数方法
matrix.mul(5)
12.5 除法运算
12.5.1 操作符方法
matrix / 5
| 0 | 1 | 2 |
---|
0 | 0.4 | 0.0 | 1.0 |
1 | 1.6 | 0.0 | 1.2 |
2 | 1.2 | 1.0 | 1.8 |
12.5.2 函数方法
matrix.div(5)
| 0 | 1 | 2 |
---|
0 | 0.4 | 0.0 | 1.0 |
1 | 1.6 | 0.0 | 1.2 |
2 | 1.2 | 1.0 | 1.8 |
12.6 整除运算
12.6.1 操作符方法
matrix // 5
12.6.2 函数方法
matrix.floordiv(5)
12.7 取模运算
12.7.1 操作符方法
matrix % 5
12.7.2 函数方法
matrix.mod(5)
13.DataFrame比较运算
13.1 大于
13.1.1 算数方法
matrix > 5
| 0 | 1 | 2 |
---|
0 | False | False | False |
1 | True | False | True |
2 | True | False | True |
13.1.2 函数方法
matrix.gt(5)
| 0 | 1 | 2 |
---|
0 | False | False | False |
1 | True | False | True |
2 | True | False | True |
13.2 大于等于
13.2.1 算数方法
matrix >= 5
| 0 | 1 | 2 |
---|
0 | False | False | True |
1 | True | False | True |
2 | True | True | True |
13.2.2 函数方法
matrix.ge(5)
| 0 | 1 | 2 |
---|
0 | False | False | True |
1 | True | False | True |
2 | True | True | True |
13.3 小于
13.3.1 算数方法
matrix < 5
| 0 | 1 | 2 |
---|
0 | True | True | False |
1 | False | True | False |
2 | False | False | False |
13.3.2 函数方法
matrix.lt(5)
| 0 | 1 | 2 |
---|
0 | True | True | False |
1 | False | True | False |
2 | False | False | False |
13.4 小于等于
13.4.1 算数方法
matrix <= 5
| 0 | 1 | 2 |
---|
0 | True | True | True |
1 | False | True | False |
2 | False | True | False |
13.4.2 函数方法
matrix.le(5)
| 0 | 1 | 2 |
---|
0 | True | True | True |
1 | False | True | False |
2 | False | True | False |
13.5 等于
13.5.1 算数方法
matrix == 5
| 0 | 1 | 2 |
---|
0 | False | False | True |
1 | False | False | False |
2 | False | True | False |
13.5.2 函数方法
matrix.eq(5)
| 0 | 1 | 2 |
---|
0 | False | False | True |
1 | False | False | False |
2 | False | True | False |
13.6 不等于
13.6.1 算数方法
matrix != 5
| 0 | 1 | 2 |
---|
0 | True | True | False |
1 | True | True | True |
2 | True | False | True |
13.6.2 函数方法
matrix.ne(5)
| 0 | 1 | 2 |
---|
0 | True | True | False |
1 | True | True | True |
2 | True | False | True |