Pandas

Creating, Reading and Writing

There are two core objects in pandas: the DataFrame and the Series.

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

Reading data

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()

index_col=0:使用数据集自带索引

wine_reviews.shape

Indexing, Selecting & Assigning

reviews.country
reviews['country']

Index-based selection

reviews.iloc[:, 0]
reviews.iloc[[0, 1, 2], 0]
reviews.iloc[-5:] # 最后五个元素

Label-based selection

reviews.loc[0, 'country']
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
df.loc['Apples':'Potatoes']

注意iloc是左闭右开,loc是左闭右闭

Manipulating the index

reviews.set_index("title")

Conditional selection

reviews.country == 'Italy' # This operation produced a Series of True/False booleans based on the country of each record. 
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
reviews.loc[reviews.country.isin(['Italy', 'France'])]
reviews.loc[reviews.price.notnull()] # 常用于筛选缺失值,还有一个isnull()

Assigning data

reviews['critic'] = 'everyone'
reviews['index_backwards'] = range(len(reviews), 0, -1)
0         129971
1         129970
           ...  
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

统计信息与map操作

Summary functions

reviews.points.describe()
count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64
reviews.taster_name.describe()
count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object
reviews.points.mean()
reviews.taster_name.unique()
reviews.taster_name.value_counts()

map和apply
map返回series,apply返回dataframe

reviews.points.map(lambda p: p - review_points_mean)
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

有时也可以直接操作

reviews.points - review_points_mean
reviews.country + " - " + reviews.region_1

Grouping and Sorting

Groupwise analysis

reviews.groupby('points').price.min()
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
reviews.groupby(['country']).price.agg([len, min, max])

Multi-indexes

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed.reset_index() #复位为单索引结构

Sorting
sort_values() defaults to an ascending sort, where the lowest values go first.

countries_reviewed.sort_values(by='len', ascending=False)
countries_reviewed.sort_values(by=['country', 'len'])

To sort by index values, use the companion method sort_index(). This method has the same arguments and default order:

countries_reviewed.sort_index()

Data Types and Missing Values

The data type for a column in a DataFrame or a Series is known as the dtype.

reviews.price.dtype
reviews.dtypes

格式转换

reviews.points.astype('float64')

Missing data

reviews[pd.isnull(reviews.country)]
reviews.region_2.fillna("Unknown")
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值