1.理论部分
1.用dtype
函数查看数据类型
reviews.price.dtype
dtype('float64')
reviews.dtypes
country object
description object
...
variety object
winery object
Length: 13, dtype: object
2.用astype
函数更换数据类型
reviews.points.astype('float64')
0 87.0
1 87.0
...
129969 90.0
129970 90.0
Name: points, Length: 129971, dtype: float64
3.通过isnull
函数查找有缺失值的数据
reviews[pd.isnull(reviews.country)]
4.用fillna
函数填充缺失值
reviews.region_2.fillna("Unknown")
0 Unknown
1 Unknown
...
129969 Unknown
129970 Unknown
Name: region_2, Length: 129971, dtype: object
5.用replace
函数替换值
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
0 @kerino
1 @vossroger
...
129969 @vossroger
129970 @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object
2.实践部分
1.What is the data type of the points
column in the dataset?
dtype = reviews.points.dtype
2.Create a Series from entries in the points
column, but convert the entries to strings. Hint: strings are str
in native Python.
point_strings = reviews.points.astype('str')
3.Sometimes the price column is null. How many reviews in the dataset are missing a price?
n_missing_prices = pd.isnull(reviews.price).sum()
4.What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1
field. This field is often missing data, so replace missing values with Unknown
. Sort in descending order. Your output should look something like this:
Unknown 21247
Napa Valley 4480
...
Bardolino Superiore 1
Primitivo del Tarantino 1
Name: region_1, Length: 1230, dtype: int64
reviews_per_region = reviews.region_1.fillna('Unknow').value_counts().sort_values(ascending=False)