pandas course from kaggle (1)

最新推荐文章于 2024-09-21 17:56:48 发布

David_blog

最新推荐文章于 2024-09-21 17:56:48 发布

阅读量168

点赞数

文章标签： python

本文链接：https://blog.csdn.net/m0_53155317/article/details/124918889

版权

pandas courses

Creating data

DataFrame

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-im0Dt9yi-1653234126011)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220514212511660.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YatKfbmI-1653234126012)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220514212525224.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-z5JhEubD-1653234126013)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220514212555081.png)]

Series

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kUhVnS4y-1653234126014)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220514212629165.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KOVmg0FL-1653234126014)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220514212644124.png)]

reading data files

reading csv

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-miYsE9sG-1653234126015)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220514215852247.png)]

writing csv

pd.DataFrame.to_csv('rote')

throw column

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()

in this code,index_col=0 means that we throw the column 0

indexing，selecting & assigning

selecting

for this DataFrame named reviews

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kXmWVUDa-1653234126016)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220515165736751.png)]

if we want to show the column named country

we can use:

reviews.country

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-awLcFH10-1653234126016)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220515165941712.png)]

if we have a Python dictionary, we can access its values using indexing [] operator

reviews['country']

to drill down to a single specific value, we need only use the indexing operator [] once more:

reviews['country'][0]

indexing

index-based selection:

selecting data on its numerical position in the data

iloc[row,column]

to select the first row of data in a DataFrame, we may use the following:

reviews.iloc[0]

to get a column with iloc, we can do the following:

reviews.iloc[:,0]

the : operator, which comes from native Python ,means everything. When combined with other selectors, however, it can be used to indicate a range of values.

For example, to select the country column from just the first, second, and third row, we would do:

reviews.iloc[:3,0]

to select just the second data and third entries, we would do:

reviews.iloc[1:3,0]

it is also possible to pass a list:

reviews.iloc[[0,1,2],0]

select all of the the last five elements of the dataset.

reviews.iloc[-5:]

label-based selection

loc operator: it is the data index value, not its position.

to get first entry in reviews, we would do the following:

reviews.loc[0,'country']

reviews.loc[:,['taster_name', 'taster_twitter_handle', 'points']]

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].

manipulating the index

reviews.set_index("title")

添加一行，名称为title

Conditional selection

checking if it is Italy or not:

reviews.country == 'Italy'

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FF1wWyTi-1653234126017)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220516171642334.png)]

This result can then be used inside of loc to select the relevant data:

reviews.loc[reviews.country == 'Italy']

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-V8OmkyMB-1653234126018)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220516171754067.png)]

We can use the ampersand (&) to bring the two questions together:

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GOYCdJW7-1653234126018)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220516171925820.png)]

we can also change & to | , which means or, but you can’t use string and ,or

if you have to use | and & ,you have to do like this

top_oceania_wines = reviews.loc[(reviews.country=='Australia') | (reviews.country=='New Zealand')].loc[(reviews.points>=95)&(reviews.points<=100)]

isin is lets you select data whose value “is in” a list of values.

reviews.loc[reviews.country.isin(['Italy', 'France'])]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HFzXryTw-1653234126019)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220516172229986.png)]

isnull (and its companion notnull). These methods let you highlight values which are (or are not) empty (NaN).

reviews.loc[reviews.price.notnull()]

For example, to filter out wines lacking a price tag in the dataset:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DXZDaagl-1653234126020)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220516172405478.png)]

assign

assign in other words I think it is append data

assigning data to a DataFrame is easy. You can assign either a constant value:

reviews['critic'] = 'everyone'
reviews['critic']

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-U4WjTFW3-1653234126020)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220516172858219.png)]

reviews.loc[(reviews.country=='Australia') | (reviews.country=='New Zealand') & (reviews.points>=95)&(reviews.points<=100)]

reviews.loc[(reviews.country=='Australia') | (reviews.country=='New Zealand')].loc[(reviews.points>=95)&(reviews.points<=100)]

Summary Functions and Maps

Extract insights from your data.

find the max of index

bargain_index = (reviews.points / reviews.price).idxmax()

Summary functions

Pandas provides many simple “summary functions” (not an official name) which restructure the data in some useful way. For example, consider the describe() method:

reviews.points.describe()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FiYMvWYT-1653234126021)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220520232808423.png)]

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function:

reviews.points.mean()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-g8XxT5Gk-1653234126022)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220520232916774.png)]

To see a list of unique values we can use the unique() function:

reviews.taster_name.unique()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iXPIiQfA-1653234126023)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220520232954256.png)]

To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:

reviews.taster_name.value_counts()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8Gd3Bb6R-1653234126023)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220520233050558.png)]

Maps

There are two mapping methods that you will use often.

map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:

map() returns a new Series where all the values have been transformed by your function.

review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0d7Nykje-1653234126024)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220520233639904.png)]

apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

If we had called reviews.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don’t modify the original data they’re called on. If we look at the first row of reviews, we can see that it still has its original points value

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:

reviews.country + " - " + reviews.region_1

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ETSWFcBy-1653234126025)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220520233957570.png)]

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be “tropical” or “fruity”? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset.

tropical=reviews.description.map(lambda d: "tropical" in d).sum() # with calculate how many "tropical" in the vector description
fruity=reviews.description.map(lambda d: "fruity" in d).sum()
descriptor_counts = pd.Series([tropical, fruity], index=['tropical', 'fruity'])

We’d like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we’d like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

star_ratings = reviews.points.map(lambda x:1 if x<85 else 2 if x< 95 else 3 )
# please know the format of lambda

Grouping and sorting

Scale up your level of insight. The more complex the dataset, the more this matters

groupwise analysis

One function we’ve been using heavily thus far is the value_counts() function. We can replicate what value_counts() does by doing the following:

reviews.groupby('points').points.count()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f6dCjpmV-1653234126025)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220521234942584.png)]

groupby() created a group of reviews which allotted the same point values to the given wines. Then, for each of these groups, we grabbed the points() column and counted how many times it appeared. value_counts() is just a shortcut to this groupby() operation.

We can use any of the summary functions we’ve used before with this data. For example, to get the cheapest wine in each point value category, we can do the following:

reviews.groupby('points').price.min()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HNatXesy-1653234126026)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220521235039530.png)]

You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the apply() method, and we can then manipulate the data in any way we see fit. For example, here’s one way of selecting the name of the first wine reviewed from each winery in the dataset:

reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KelI9eOb-1653234126026)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220521235150178.png)]

you can also group by more than one column. For an example, here’s how we would pick out the best wine by country and province:

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])

Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

reviews.groupby(['country']).price.agg([len, min, max])

Multi-Indexes

A multi-index differs from a regular index in that it has multiple levels. For example:

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8GOeD1D3-1653234126027)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220521235759736.png)]

in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index() method:

countries_reviewed.reset_index()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ySFXXeDg-1653234126027)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220522000005339.png)]

Looking again at countries_reviewed we can see that grouping returns data in index order, not in value order.

sorting

To get data in the order want it in we can sort it ourselves. The sort_values() method is handy for this.

countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o7GnUc2k-1653234126028)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220522000301476.png)]

sort_values() defaults to an ascending sort, where the lowest values go first. However, most of the time we want a descending sort, where the higher numbers go first. That goes thusly:

countries_reviewed.sort_values(by='len', ascending=False)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-joybacc6-1653234126029)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220522000653598.png)]

To sort by index values, use the companion method sort_index(). This method has the same arguments and default order:

countries_reviewed.sort_index()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ca3qxWNz-1653234126029)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220522000715108.png)]

Finally, know that you can sort by more than one column at a time:

countries_reviewed.sort_values(by=['country', 'len'])

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KF3WtMPL-1653234126030)(C:\Users\83989\AppData\Roaming\Typora\typora-user-images\image-20220522000752131.png)]

when you want to groupby first and then you want to sort, please use .size(), for example:

country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values()

David_blog

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫