Creating, Reading and Writing
There are two core objects in pandas: the DataFrame and the Series.
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
Reading data
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()
index_col=0:使用数据集自带索引
wine_reviews.shape
Indexing, Selecting & Assigning
reviews.country
reviews['country']
Index-based selection
reviews.iloc[:, 0]
reviews.iloc[[0, 1, 2], 0]
reviews.iloc[-5:] # 最后五个元素
Label-based selection
reviews.loc[0, 'country']
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
df.loc['Apples':'Potatoes']
注意iloc是左闭右开,loc是左闭右闭
Manipulating the index
reviews.set_index("title")
Conditional selection
reviews.country == 'Italy' # This operation produced a Series of True/False booleans based on the country of each record.
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
reviews.loc[reviews.country.isin(['Italy', 'France'])]
reviews.loc[reviews.price.notnull()] # 常用于筛选缺失值,还有一个isnull()
Assigning data
reviews['critic'] = 'everyone'
reviews['index_backwards'] = range(len(reviews), 0, -1)
0 129971
1 129970
...
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int64
统计信息与map操作
Summary functions
reviews.points.describe()
count 129971.000000
mean 88.447138
...
75% 91.000000
max 100.000000
Name: points, Length: 8, dtype: float64
reviews.taster_name.describe()
count 103727
unique 19
top Roger Voss
freq 25514
Name: taster_name, dtype: object
reviews.points.mean()
reviews.taster_name.unique()
reviews.taster_name.value_counts()
map和apply
map返回series,apply返回dataframe
reviews.points.map(lambda p: p - review_points_mean)
def remean_points(row):
row.points = row.points - review_points_mean
return row
reviews.apply(remean_points, axis='columns')
有时也可以直接操作
reviews.points - review_points_mean
reviews.country + " - " + reviews.region_1
Grouping and Sorting
Groupwise analysis
reviews.groupby('points').price.min()
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
reviews.groupby(['country']).price.agg([len, min, max])
Multi-indexes
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed.reset_index() #复位为单索引结构
Sorting
sort_values() defaults to an ascending sort, where the lowest values go first.
countries_reviewed.sort_values(by='len', ascending=False)
countries_reviewed.sort_values(by=['country', 'len'])
To sort by index values, use the companion method sort_index(). This method has the same arguments and default order:
countries_reviewed.sort_index()
Data Types and Missing Values
The data type for a column in a DataFrame or a Series is known as the dtype.
reviews.price.dtype
reviews.dtypes
格式转换
reviews.points.astype('float64')
Missing data
reviews[pd.isnull(reviews.country)]
reviews.region_2.fillna("Unknown")
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")