group这个函数可以在python中实现类似excel表格中数据透视表的可视化效果
ignore_index
和reset_index()
的区别:
前者是方法中的一个参数,后者是一个方法。
前者是新dataframe从0开始计数,后者是将分组后的数据(数据透视表形式)划为面板数据
idxmax()
这个方法可以找到当前列最大值对应的索引
9.5号发现个set_index()
方法,这个方法是设置索引方便join进行表连接。reset_index()和set_index()是完全不同是两个方法。
文章目录
方法介绍
group函数用法
1.实现value_counts()的效果
reviews.groupby('points').points.count() #计算不同数据的出现次数
2.计算分组情况下各组price的最小值是多少
reviews.groupby('points').price.min()
3.分组后通过lambda函数计算各组title的第一行数据
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
4.分组后通过lambda函数计算各组中points最高的index对应的行
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
5.功能类似于describe()函数的agg()函数的用法
长度、最小值、最大值、分位数等
reviews.groupby(['country']).price.agg([len, min, max])
6.多重索引以及reset_index()转为面板数据
mutil-index其实就是分组了,通过reset_index()方法可以让分组后(形似数据透视表)的数据转为面板数据
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed
mi = countries_reviewed.index
type(mi)
#pandas.core.indexes.multi.MultiIndex
countries_reviewed.reset_index()
sort_values和sort_index排序函数
countries_reviewed.sort_values(by='len')
countries_reviewed.sort_values(by='len', ascending=False)
countries_reviewed.sort_index()
countries_reviewed.sort_values(by=['country', 'len'])
习题
1.Who are the most common wine reviewers in the dataset? Create a Series whose index is the taster_twitter_handle category from the dataset, and whose values count how many reviews each person wrote.
reviews_written = reviews.groupby('taster_twitter_handle').size()
#或者
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
2.What is the best wine I can buy for a given amount of money? Create a Series whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0 dollars is at the top and 3300.0 dollars is at the bottom).
best_rating_per_price = reviews.groupby('price').apply(lambda df : df.loc[df.points.idxmax()])
best_rating_per_price = pd.Series(best_rating_per_price['points'],index=best_rating_per_price['price']
3.What is the best wine I can buy for a given amount of money? Create a Series whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0 dollars is at the top and 3300.0 dollars is at the bottom).
price_extremes = reviews.groupby('variety').price.agg([min,max])
4.What are the most expensive wine varieties? Create a variable sorted_varieties containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).
sorted_varieties = price_extremes.sort_values(by=['min','max'],ascending=False)
5.Create a Series whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name and points columns.
reviewer_mean_ratings = reviews.groupby('taster_name').points.apply(lambda x:sum(x)/len(x))
#或者
reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()
6.What combination of countries and varieties are most common? Create a Series whose index is a MultiIndexof {country, variety} pairs. For example, a pinot noir produced in the US should map to {“US”, “Pinot Noir”}. Sort the values in the Series in descending order based on wine count.
country_variety_counts = reviews.groupby(['country','variety']).variety.count().sort_values(ascending = False)