对数据进行描述的操作
1. 对于numerical 变量:
reviews.points.describe()
Out:
count 129971.000000
mean 88.447138
...
75% 91.000000
max 100.000000
Name: points, Length: 8, dtype: float64
2. 对于其他变量:只显示可用的功能
例如Out:
count 103727
unique 19
top Roger Voss
freq 25514
Name: taster_name, dtype: object
3. 显示独一无二的value
countries = reviews.country.unique()
4. 显示每一个value出现的次数
reviews.country.value_counts()
type操作
1. Dtype
在DataFrame和Series中获取其中的一列的数据类型可以使用.dtype
reviews.points.dtype
Out:
dtype('int64')
也可以用.dtypes获得DataFrame中所有列的数据类型
reviews.dtypes
Out:
country object
description object
designation object
points int64
price float64
province object
region_1 object
region_2 object
taster_name object
taster_twitter_handle object
title object
variety object
winery object
dtype: object
2. 转换数据类型
.astype()
reviews.points.astype(str)
3. 数据替换
检测出为NaN的数据
pd.isnull()
pd.notnull()
.fillna()
reviews.region_2.fillna("Unknown")
内容替换
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
行和列的重命名
列:
reviews.rename(columns={'points': 'score'})
行:
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})
或者可以:
reviews.rename_axis("wines", axis='rows')
.rename_axis("fields", axis='columns')
重命名行列标签的名称
reviews.rename_axis("wines", axis='rows')
.rename_axis("fields", axis='columns')
合并
1. concat()
要求:
数据一定要求有相同的 fields (columns).
gaming_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/g/gaming.csv")
movie_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/m/movies.csv")
# 合并两个dataframe
combined_products = pd.concat([movie_products,gaming_products])
2. join()
要求:不同的DataFrame但是index一定要相同
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
lsuffix:左前缀(可以不加)
powerlifting_combined = powerlifting_meets.set_index('MeetID').join(powerlifting_competitors.set_index('MeetID'))