pandas courses 2

Data Types and Missing Values

Data Types

The data type for a column in a DataFrame or a Series is known as the dtype.

reviews.price.dtype

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MMVbwaJJ-1653838505975)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525192249225.png)]

reviews.dtypes

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ld97uUWV-1653838505976)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525192604271.png)]

columns consisting entirely of strings do not get their own type; they are instead given the object type.

we may transform the points column from its existing int64 data type into a float64 data type:

reviews.points.astype('float64')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KFPn954l-1653838505977)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525192702067.png)]

Missing Values

Entries missing values are given the value NaN, short for “Not a Number”. For technical reasons these NaN values are always of the float64 dtype.

Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull() (or its companion pd.notnull()). This is meant to be used thusly:

reviews[pd.isnull(reviews.country)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mspEqE5u-1653838505977)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525194140126.png)]

Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna(). fillna() provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":

reviews.region_2.fillna("Unknown")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XmUKAiUU-1653838505978)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525194258711.png)]

One way to reflect a value in the dataset is using the replace() method:

reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oFw5gxvx-1653838505979)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525194557016.png)]

The replace() method is worth mentioning here because it’s handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.

Renaming and Combining

Renaming

The first function we’ll introduce here is rename(), which lets you change index names and/or column names. For example, to change the points column in our dataset to score, we would do:

reviews.rename(columns={'points': 'score'})

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yFRaKkm6-1653838505980)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220529215425992.png)]

reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VnJFAWDa-1653838505980)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220529220642985.png)]

Both the row index and the column index can have their own name attribute. The complimentary rename_axis() method may be used to change these names. For example:

reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RZVMnes6-1653838505981)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220529221748175.png)]

Combining

The simplest combining method is concat(). Given a list of elements, this function will smush those elements together along an axis.

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

The middlemost combiner in terms of complexity is join(). join() lets you combine different DataFrame objects which have an index in common. For example, to pull down videos that happened to be trending on the same day in both Canada and the UK, we could do the following:

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

The lsuffix and rsuffix parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn’t true (because, say, we’d renamed them beforehand) we wouldn’t need them.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值