Data Types and Missing Values
Data Types
The data type for a column in a DataFrame or a Series is known as the dtype.
reviews.price.dtype
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MMVbwaJJ-1653838505975)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525192249225.png)]
reviews.dtypes
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ld97uUWV-1653838505976)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525192604271.png)]
columns consisting entirely of strings do not get their own type; they are instead given the object
type.
we may transform the points
column from its existing int64
data type into a float64
data type:
reviews.points.astype('float64')
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KFPn954l-1653838505977)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525192702067.png)]
Missing Values
Entries missing values are given the value NaN
, short for “Not a Number”. For technical reasons these NaN
values are always of the float64
dtype.
Pandas provides some methods specific to missing data. To select NaN
entries you can use pd.isnull()
(or its companion pd.notnull()
). This is meant to be used thusly:
reviews[pd.isnull(reviews.country)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mspEqE5u-1653838505977)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525194140126.png)]
Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna()
. fillna()
provides a few different strategies for mitigating such data. For example, we can simply replace each NaN
with an "Unknown"
:
reviews.region_2.fillna("Unknown")
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XmUKAiUU-1653838505978)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525194258711.png)]
One way to reflect a value in the dataset is using the replace()
method:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oFw5gxvx-1653838505979)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220525194557016.png)]
The replace()
method is worth mentioning here because it’s handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown"
, "Undisclosed"
, "Invalid"
, and so on.
Renaming and Combining
Renaming
The first function we’ll introduce here is rename()
, which lets you change index names and/or column names. For example, to change the points
column in our dataset to score
, we would do:
reviews.rename(columns={'points': 'score'})
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yFRaKkm6-1653838505980)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220529215425992.png)]
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VnJFAWDa-1653838505980)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220529220642985.png)]
Both the row index and the column index can have their own name
attribute. The complimentary rename_axis()
method may be used to change these names. For example:
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RZVMnes6-1653838505981)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220529221748175.png)]
Combining
The simplest combining method is concat()
. Given a list of elements, this function will smush those elements together along an axis.
canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")
pd.concat([canadian_youtube, british_youtube])
The middlemost combiner in terms of complexity is join()
. join()
lets you combine different DataFrame objects which have an index in common. For example, to pull down videos that happened to be trending on the same day in both Canada and the UK, we could do the following:
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
The lsuffix
and rsuffix
parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn’t true (because, say, we’d renamed them beforehand) we wouldn’t need them.