Pandas CookBook -- 02DataFrame基础操作

Pandas基础操作

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option('max_columns',5 , 'max_rows', 5)

1 选取多个DataFrame列

1.1 用列表选取多个列

movie = pd.read_csv('data/movie.csv')
cols =['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
movie_actor_director = movie[cols]
movie_actor_director
actor_1_nameactor_2_nameactor_3_namedirector_name
0CCH PounderJoel David MooreWes StudiJames Cameron
1Johnny DeppOrlando BloomJack DavenportGore Verbinski
...............
4914Alan RuckDaniel HenneyEliza CoupeDaniel Hsia
4915John AugustBrian HerzlingerJon GunnJon Gunn

4916 rows × 4 columns

1.2 使用select_dtypes选取类型

select_dtypes(include=None, exclude=None)

  • To select all numeric types, use np.number or 'number'
  • To select strings you must use the object dtype, but note that this will return all object dtype columns,See the numpy dtype hierarchy
  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
  • To select Pandas categorical dtypes, use 'category'
movie.shape
(4916, 28)

1.2.1 选取整数列

movie.select_dtypes(include=['int']).head()
num_voted_userscast_total_facebook_likesmovie_facebook_likes
0886204483433000
1471220483500
22758681170085000
31144337106759164000
481430

1.2.2 选取非整数列

movie.select_dtypes(exclude=['int']).head()
colordirector_name...imdb_scoreaspect_ratio
0ColorJames Cameron...7.91.78
1ColorGore Verbinski...7.12.35
2ColorSam Mendes...6.82.35
3ColorChristopher Nolan...8.52.35
4NaNDoug Walker...7.1NaN

5 rows × 25 columns

1.2.3 通过filter函数过滤选取多列

filter(items=None, like=None, regex=None, axis=None)

  • items : list-like
    • List of info axis to restrict to (must not all be present)
    • 传递个列名或行名列表
  • like : string
    • Keep info axis where “arg in col == True”
    • 类似Python里面字符串的find()函数,col.find(arg)
  • regex : string (regular expression)
    • Keep info axis with re.search(regex, col) == True

通过filter()函数过滤选取多列

movie.filter(like='facebook').head()
director_facebook_likesactor_3_facebook_likes...actor_2_facebook_likesmovie_facebook_likes
00.0855.0...936.033000
1563.01000.0...5000.00
20.0161.0...393.085000
322000.023000.0...23000.0164000
4131.0NaN...12.00

5 rows × 6 columns

通过正则表达式选取多列

movie.filter(regex='\d').head()
actor_3_facebook_likesactor_2_name...actor_3_nameactor_2_facebook_likes
0855.0Joel David Moore...Wes Studi936.0
11000.0Orlando Bloom...Jack Davenport5000.0
2161.0Rory Kinnear...Stephanie Sigman393.0
323000.0Christian Bale...Joseph Gordon-Levitt23000.0
4NaNRob Walker...NaN12.0

5 rows × 6 columns

filter()函数,传递列表到参数items,选取多列

movie.filter(items=['actor_1_name', 'actor_3_name']).head()
actor_1_nameactor_3_name
0CCH PounderWes Studi
1Johnny DeppJack Davenport
2Christoph WaltzStephanie Sigman
3Tom HardyJoseph Gordon-Levitt
4Doug WalkerNaN

2 DataFrame上操作

2.1 基本方法

数据的个数 数据集的维度 数据集的长度

 movie.shape,movie.size,movie.ndim
((4916, 28), 137648, 2)

各个列的非空值的个数

movie.count()
color                   4897
director_name           4814
                        ... 
aspect_ratio            4590
movie_facebook_likes    4916
Length: 28, dtype: int64

2.2 统计信息

movie.shape
(4916, 28)

2.2.1 最大 最小值

2.2.1.1 数值类型
# min max quantile
movie_min = movie.min()
movie_min.name = '最小值'
movie_min
num_critic_for_reviews    1.00
duration                  7.00
                          ... 
aspect_ratio              1.18
movie_facebook_likes      0.00
Name: 最小值, Length: 16, dtype: float64

计算是默认会跳过缺失值的,可设置skipna=False使其包含缺失,但这样不具有意义

movie.min(skipna=False)
num_critic_for_reviews    NaN
duration                  NaN
                         ... 
aspect_ratio              NaN
movie_facebook_likes      0.0
Length: 16, dtype: float64
2.2.1.2 字符串类型

当字符串类型的列包含缺失值时,聚合方法min、max、sum,不会返回任何值。

movie[['color', 'movie_title', 'color']].max()
Series([], dtype: float64)

要让pandas强行返回每列的值,必须填入缺失值。下面填入的是空字符串

movie[['color', 'movie_title', 'color']].fillna('').max()
color             Color
movie_title    Æon Flux
color             Color
dtype: object

2.2.2 统计信息

2.2.2.1 数值型

使用percentiles参数指定分位数

movie.describe(percentiles=[.01, .3, .99])
num_critic_for_reviewsduration...aspect_ratiomovie_facebook_likes
count4867.0000004901.000000...4590.0000004916.000000
mean137.988905107.090798...2.2223497348.294142
..................
99%546.680000189.000000...4.00000093850.000000
max813.000000511.000000...16.000000349000.000000

9 rows × 16 columns

2.2.2.2 字符串型
movie.select_dtypes(include='object').describe()
colordirector_name...countrycontent_rating
count48974814...49114616
unique22397...6518
topColorSteven Spielberg...USAR
freq469326...37102067

4 rows × 12 columns

2.3 方法的组合

使用isnull方法将每个值转变为布尔值

movie.isnull().head()
colordirector_name...aspect_ratiomovie_facebook_likes
0FalseFalse...FalseFalse
1FalseFalse...FalseFalse
2FalseFalse...FalseFalse
3FalseFalse...FalseFalse
4TrueFalse...TrueFalse

5 rows × 28 columns

sum统计布尔值,返回的是Series

movie.isnull().sum().head()
color                       19
director_name              102
num_critic_for_reviews      49
duration                    15
director_facebook_likes    102
dtype: int64

对这个Series再使用sum,返回整个DataFrame的缺失值的个数,返回值是个标量

movie.isnull().sum().sum()
2654

判断整个DataFrame有没有缺失值,方法是连着使用两个any

movie.isnull().any().any()
True

2.4 运算符

行索引名设为INSTNM,用UGDS_过滤出本科生的种族比例

college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds_ = college.filter(like='UGDS_')
college_ugds_
UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKN
INSTNM
Alabama A & M University0.03330.9353...0.00590.0138
University of Alabama at Birmingham0.59220.2600...0.01790.0100
..................
Bay Area Medical Academy - San Jose Satellite LocationNaNNaN...NaNNaN
Excel Learning Center-San Antonio SouthNaNNaN...NaNNaN

7535 rows × 9 columns

college_ugds_的数值类型都是float,可以进行整数运算

college_ugds_.dtypes
UGDS_WHITE    float64
UGDS_BLACK    float64
               ...   
UGDS_NRA      float64
UGDS_UNKN     float64
Length: 9, dtype: object

2.4.1 加减乘除

college_ugds_.head() + .00501
UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKN
INSTNM
Alabama A & M University0.038310.94031...0.010910.01881
University of Alabama at Birmingham0.597210.26501...0.022910.01501
Amridge University0.304010.42421...0.005010.27651
University of Alabama in Huntsville0.703810.13051...0.038210.04001
Alabama State University0.020810.92581...0.029310.01871

5 rows × 9 columns

2.4.2 计算样例数据的百分比

2.4.2.1 方式一
college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
college_ugds_op_round.head()
UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKN
INSTNM
Alabama A & M University0.030.94...0.010.01
University of Alabama at Birmingham0.590.26...0.020.01
Amridge University0.300.42...0.000.27
University of Alabama in Huntsville0.700.13...0.030.04
Alabama State University0.020.92...0.020.01

5 rows × 9 columns

2.4.2.2 方式二
college_ugds_round = (college_ugds_ + .00001).round(2)
college_ugds_round.head()
UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKN
INSTNM
Alabama A & M University0.030.94...0.010.01
University of Alabama at Birmingham0.590.26...0.020.01
Amridge University0.300.42...0.000.27
University of Alabama in Huntsville0.700.13...0.030.04
Alabama State University0.020.92...0.020.01

5 rows × 9 columns

2.4.2.3 方式三
college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)
college_ugds_op_round_methods.head()
UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKN
INSTNM
Alabama A & M University0.030.94...0.010.01
University of Alabama at Birmingham0.590.26...0.020.01
Amridge University0.300.42...0.000.27
University of Alabama in Huntsville0.700.13...0.030.04
Alabama State University0.020.92...0.020.01

5 rows × 9 columns

3 比较缺失值

Pandas使用NumPy NaN(np.nan)对象表示缺失值。这是一个不等于自身的特殊对象:

np.nan == np.nan
False

所有和np.nan的比较都返回False,除了不等于:

5 > np.nan
False
5 != np.nan
True

无法通过直接比较比较,含有缺失值的df是否一致

movie_equal = movie == movie
movie_equal.all().all()
False
movie_equal.size - movie_equal.sum().sum()
2654
movie.isnull().sum().sum()
2654

比较两个DataFrame最直接的方法是使用equals()方法

from pandas.testing import assert_frame_equal
assert_frame_equal(movie, movie)

转载于:https://www.cnblogs.com/shiyushiyu/p/9734621.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值