【Kaggle 学习笔记】 | Pandas

最新推荐文章于 2024-09-21 17:56:48 发布

Continuing_Road

最新推荐文章于 2024-09-21 17:56:48 发布

阅读量200

点赞数 1

分类专栏： Kaggle 学习笔记文章标签： python 数据分析数据挖掘

本文链接：https://blog.csdn.net/Continuing_Road/article/details/105134217

版权

Kaggle 学习笔记专栏收录该内容

9 篇文章 4 订阅

订阅专栏

创建，读取和保存

两个结构单元DataFrame，Series

# DataFrame
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])
pd.DataFrame['Product A']
# Series
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

# 读取winemag-data_first150k.csv文件，且索引为其第一列
reviews = pd.read_csv('../input/wine-reviews/winemag-data_first150k.csv', index_col=0)
reviews.head()
# 保存animals数据集为文件cows_and_goats.csv
animals.to_csv("cows_and_goats.csv")

索引、选择和分配

# 简单常用的索引方法
reviews.country
reviews['country'][0]

reviews.set_index("title") # 设置title列为索引列
reviews['index_backwards'] = range(len(reviews), 0, -1) # 给数据单元赋值

loc和iloc是pandas自己的索引函数

loc：基于列表的元素或条件来选取对应数据

iloc：基于索引位来选取数据集，0:4就是选取 0，1，2，3这四行，需要注意的是这里是前闭后开集合

reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
reviews.loc[reviews.country.isin(['Italy', 'France'])] # 返回包含Italy和France的数据行
reviews.loc[reviews.price.notnull()] # 返回价格不是NaN的数据行

reviews.iloc[1:3, 0] # 返回第1列和第2列

了解数据和映射

reviews.points.describe() # 统计参数
reviews.points.mean() # 平均数
reviews.taster_name.unique() # 列表值的种类
reviews.taster_name.value_counts() # 列表值的数据个数

# 用map()函数来返回一个新序列，每个序列的值由映射函数值
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)

# 用apply()调用每一行的自定义方法来转换整个数据帧，返回一个新的DataFrame,不会修改原reviews的值
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

分组排序

reviews.groupby('points').points.count() # 对points进行分组，并找出每个组的数据个数
reviews.groupby('points').price.min() # 对points进行分组，并找出每个组的最小的price

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()]) 
# 对points和province进行分组,并通过points进行从大到小排序
reviews.groupby(['country']).price.agg([len, min, max]) # 对country进行分组，并得到price的len、min、max

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len]) # 产生多级索引
countries_reviewed.reset_index() # 取消多级索引

countries_reviewed.sort_values(by='len') # 对len字段进行升序
countries_reviewed.sort_values(by='len', ascending=False) # 对len字段进行降序
countries_reviewed.sort_values(by=['country', 'len']) # 多个字段排序，country为主字段，len为次子段

数据类型和缺失值

reviews.price.dtype # 查看数据类型
reviews.points.astype('float64') # 转换数据类型

reviews[pd.isnull(reviews.country)] # 找出reviews.country为缺失值的数据段
reviews.region_2.fillna("Unknown") # 替换缺失值为Unknown
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino") # 替换

重命名和组合

reviews.rename(columns={'points': 'score'}) # 更改属性名points为score
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'}) # 更改索引，一般使用set_index()
reindexed = reviews.rename_axis('wines', axis='rows') # 重命名索引
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns') # 添加行列

pd.concat([canadian_youtube, british_youtube]) # 相同字段的叠加，即增加行数
left.join(right, lsuffix='_CAN', rsuffix='_UK') # 相同索引的叠加，即增加列数,列名重复时需要用到后两个参数