Pandas

最新推荐文章于 2024-03-21 00:21:12 发布

拉普拉斯之妖

最新推荐文章于 2024-03-21 00:21:12 发布

阅读量190

点赞数

分类专栏：数据分析

本文链接：https://blog.csdn.net/weixin_45603120/article/details/109202682

版权

数据分析专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Creating, Reading and Writing

There are two core objects in pandas: the DataFrame and the Series.

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

Reading data

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()

index_col=0：使用数据集自带索引

wine_reviews.shape

Indexing, Selecting & Assigning

reviews.country
reviews['country']

Index-based selection

reviews.iloc[:, 0]
reviews.iloc[[0, 1, 2], 0]
reviews.iloc[-5:] # 最后五个元素

Label-based selection

reviews.loc[0, 'country']
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
df.loc['Apples':'Potatoes']

注意iloc是左闭右开，loc是左闭右闭

Manipulating the index

reviews.set_index("title")

Conditional selection

reviews.country == 'Italy' # This operation produced a Series of True/False booleans based on the country of each record. 
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
reviews.loc[reviews.country.isin(['Italy', 'France'])]
reviews.loc[reviews.price.notnull()] # 常用于筛选缺失值，还有一个isnull()

Assigning data

reviews['critic'] = 'everyone'
reviews['index_backwards'] = range(len(reviews), 0, -1)

0         129971
1         129970
           ...  
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

统计信息与map操作

Summary functions

reviews.points.describe()

count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

reviews.points.mean()
reviews.taster_name.unique()
reviews.taster_name.value_counts()

map和apply
map返回series，apply返回dataframe

reviews.points.map(lambda p: p - review_points_mean)

def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

有时也可以直接操作

reviews.points - review_points_mean
reviews.country + " - " + reviews.region_1

Grouping and Sorting

Groupwise analysis

reviews.groupby('points').price.min()
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
reviews.groupby(['country']).price.agg([len, min, max])

Multi-indexes

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed.reset_index() #复位为单索引结构

Sorting
sort_values() defaults to an ascending sort, where the lowest values go first.

countries_reviewed.sort_values(by='len', ascending=False)
countries_reviewed.sort_values(by=['country', 'len'])

To sort by index values, use the companion method sort_index(). This method has the same arguments and default order:

countries_reviewed.sort_index()

Data Types and Missing Values

The data type for a column in a DataFrame or a Series is known as the dtype.

reviews.price.dtype
reviews.dtypes

格式转换

reviews.points.astype('float64')

Missing data

reviews[pd.isnull(reviews.country)]
reviews.region_2.fillna("Unknown")
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

拉普拉斯之妖

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录