pandas 数据处理


Pandas


*pandas* is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. *pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing. The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are: * Generating descriptive statistics on data * Data cleaning using built in pandas functions * Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data * Merging multiple datasets using dataframes * Working with timestamps and time-series data **Additional Recommended Resources:** * *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/ * *Python for Data Analysis* by Wes McKinney * *Python Data Science Handbook* by Jake VanderPlas Let’s get started with our first *pandas* notebook! Import Libraries
import pandas as pd

Introduction to pandas Data Structures


*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

*pandas Series* one-dimensional labeled array.
ser = pd.Series(data = [100, 'foo', 300, 'bar', 500], index = ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser.index
Index([‘tom’, ‘bob’, ‘nancy’, ‘dan’, ‘eric’], dtype=’object’)
ser.loc[['nancy','bob']]
nancy 300 bob foo dtype: object
ser[[4, 3, 1]]
eric 500 dan bar bob foo dtype: object
ser.iloc[2]
300
'bob' in ser
True
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser * 2
tom 200 bob foofoo nancy 600 dan barbar eric 1000 dtype: object
ser[['nancy', 'eric']] ** 2
nancy 90000 eric 250000 dtype: object

pandas DataFrame

*pandas DataFrame* is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
df = pd.DataFrame(d)
print(df)
one two apple 100.0 111.0 ball 200.0 222.0 cerill NaN 333.0 clock 300.0 NaN dancy NaN 4444.0
df.index
Index([‘apple’, ‘ball’, ‘cerill’, ‘clock’, ‘dancy’], dtype=’object’)
df.columns
Index([‘one’, ‘two’], dtype=’object’)
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
onetwo
dancyNaN4444.0
ball200.0222.0
apple100.0111.0
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
twofive
dancy4444.0NaN
ball222.0NaN
apple111.0NaN

Create DataFrame from list of Python dictionaries

data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
pd.DataFrame(data)
alexalicedoraemajoe
01.0NaNNaNNaN2.0
1NaN20.010.05.0NaN
pd.DataFrame(data, index=['orange', 'red'])
alexalicedoraemajoe
orange1.0NaNNaNNaN2.0
redNaN20.010.05.0NaN
pd.DataFrame(data, columns=['joe', 'dora','alice'])
joedoraalice
02.0NaNNaN
1NaN10.020.0

Basic DataFrame operations

df
onetwo
apple100.0111.0
ball200.0222.0
cerillNaN333.0
clock300.0NaN
dancyNaN4444.0
df['one']
apple 100.0 ball 200.0 cerill NaN clock 300.0 dancy NaN Name: one, dtype: float64
df['three'] = df['one'] * df['two']
df
onetwothree
apple100.0111.011100.0
ball200.0222.044400.0
cerillNaN333.0NaN
clock300.0NaNNaN
dancyNaN4444.0NaN
df['flag'] = df['one'] > 250
df
onetwothreeflag
apple100.0111.011100.0False
ball200.0222.044400.0False
cerillNaN333.0NaNFalse
clock300.0NaNNaNTrue
dancyNaN4444.0NaNFalse
three = df.pop('three')
three
apple 11100.0 ball 44400.0 cerill NaN clock NaN dancy NaN Name: three, dtype: float64
df
onetwoflag
apple100.0111.0False
ball200.0222.0False
cerillNaN333.0False
clock300.0NaNTrue
dancyNaN4444.0False
del df['two']
df
oneflag
apple100.0False
ball200.0False
cerillNaNFalse
clock300.0True
dancyNaNFalse
df.insert(2, 'copy_of_one', df['one'])
df
oneflagcopy_of_one
apple100.0False100.0
ball200.0False200.0
cerillNaNFalseNaN
clock300.0True300.0
dancyNaNFalseNaN
df['one_upper_half'] = df['one'][:2]
df
oneflagcopy_of_oneone_upper_half
apple100.0False100.0100.0
ball200.0False200.0200.0
cerillNaNFalseNaNNaN
clock300.0True300.0NaN
dancyNaNFalseNaNNaN

Case Study: Movie Data Analysis


This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*. ## Download the Dataset Please note that **you will need to download the dataset**. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints. Here are the links to the data source and location: * **Data Source: ** MovieLens web site (filename: ml-20m.zip) * **Location:** https://grouplens.org/datasets/movielens/ Once the download completes, please make sure the data files are in a directory called *movielens* in your *Week-3-pandas* folder. Let us look at the files in this dataset using the UNIX command ls.
# Note: Adjust the name of the folder to match your local directory
#linux 使用
!ls ./movielens
!cat ./movielens/movies.csv | wc -l
!head -5 ./movielens/ratings.csv

Use Pandas to Read the Dataset


In this notebook, we will be using three CSV files:

  • ratings.csv : userId,movieId,rating, timestamp
  • tags.csv : userId,movieId, tag, timestamp
  • movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.

movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()
userIdmovieIdtagtimestamp
0184141Mark Waters1240597180
165208dark hero1368150078
265353dark hero1368150079
365521noir thriller1368149983
465592dark hero1368150078
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()
userIdmovieIdratingtimestamp
0123.51112486027
11293.51112484676
21323.51112484819
31473.51112484727
41503.51112484580
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series

#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)
pandas.core.series.Series
print(row_0)
userId 18 movieId 4141 tag Mark Waters Name: 0, dtype: object
row_0.index
Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)
row_0['userId']
18
'rating' in row_0
False
row_0.name
0
row_0 = row_0.rename('first_row')
row_0.name
‘first_row’

DataFrames

tags.head()
userIdmovieIdtag
0184141Mark Waters
165208dark hero
265353dark hero
365521noir thriller
465592dark hero
tags.index
RangeIndex(start=0, stop=465564, step=1)
tags.columns
Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)
# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]
userIdmovieIdtag
0184141Mark Waters
11651783noir thriller
200091068554conspiracy theory

Descriptive Statistics

Let’s look how the ratings are distributed!

ratings['rating'].describe()
count 2.000026e+07 mean 3.525529e+00 std 1.051989e+00 min 5.000000e-01 25% 3.000000e+00 50% 3.500000e+00 75% 4.000000e+00 max 5.000000e+00 Name: rating, dtype: float64
ratings.describe()
userIdmovieIdrating
count2.000026e+072.000026e+072.000026e+07
mean6.904587e+049.041567e+033.525529e+00
std4.003863e+041.978948e+041.051989e+00
min1.000000e+001.000000e+005.000000e-01
25%3.439500e+049.020000e+023.000000e+00
50%6.914100e+042.167000e+033.500000e+00
75%1.036370e+054.770000e+034.000000e+00
max1.384930e+051.312620e+055.000000e+00
ratings['rating'].mean()
3.5255285642993797
ratings.mean()
userId 69045.872583 movieId 9041.567330 rating 3.525529 dtype: float64
ratings['rating'].min()
0.5
ratings['rating'].max()
5.0
ratings['rating'].std()
1.051988919275684
ratings['rating'].mode()
0 4.0 dtype: float64
ratings.corr()
userIdmovieIdrating
userId1.000000-0.0008500.001175
movieId-0.0008501.0000000.002606
rating0.0011750.0026061.000000
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False … 20000233 False 20000234 False 20000235 False 20000236 False 20000237 False 20000238 False 20000239 False 20000240 False 20000241 False 20000242 False 20000243 False 20000244 False 20000245 False 20000246 False 20000247 False 20000248 False 20000249 False 20000250 False 20000251 False 20000252 False 20000253 False 20000254 False 20000255 False 20000256 False 20000257 False 20000258 False 20000259 False 20000260 False 20000261 False 20000262 False Name: rating, dtype: bool False
filter_2 = ratings['rating'] > 0
filter_2.all()
True

Data Cleaning: Handling Missing Data

movies.shape
(27278, 3)
#is any row NULL ?

movies.isnull().any()
movieId False title False genres False dtype: bool Thats nice ! No NULL values !
ratings.shape
(20000263, 3)
#is any row NULL ?

ratings.isnull().any()
userId False movieId False rating False dtype: bool Thats nice ! No NULL values !
tags.shape
(465564, 3)
#is any row NULL ?

tags.isnull().any()
userId False movieId False tag True dtype: bool We have some tags which are NULL.
tags = tags.dropna()
#Check again: is any row NULL ?

tags.isnull().any()
userId False movieId False tag False dtype: bool
tags.shape
(465548, 3) Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))
array([[
ratings.boxplot(column='rating', figsize=(15,20))
tags['tag'].head()
0 Mark Waters 1 dark hero 2 dark hero 3 noir thriller 4 dark hero Name: tag, dtype: object
movies[['title','genres']].head()
titlegenres
0Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
1Jumanji (1995)Adventure|Children|Fantasy
2Grumpier Old Men (1995)Comedy|Romance
3Waiting to Exhale (1995)Comedy|Drama|Romance
4Father of the Bride Part II (1995)Comedy
ratings[-10:]
userIdmovieIdrating
20000253138493608164.5
20000254138493611604.0
20000255138493656824.5
20000256138493667624.5
20000257138493683194.5
20000258138493689544.5
20000259138493695264.5
20000260138493696443.0
20000261138493702865.0
20000262138493716192.5
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]
Venice Film Festival Winner 2002 1 based on the life of Buford Pusser 1 but no way as good as the other two 1 see 1 Jeffrey Kimball 1 tolerable 1 Fake History - Don’t Believe a Thing 1 Boy 1 urlaub 1 conservative 1 Name: tag, dtype: int64
tag_counts[:10].plot(kind='bar', figsize=(15,10))
is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]
userIdmovieIdrating
68120214.0
69121004.0
70121184.0
71121384.0
72121404.0
73121434.0
74121734.0
75121744.0
76121934.0
79122884.0
80122914.0
81125424.0
82126284.0
90127624.0
92128724.0
94129444.0
96129594.0
97129684.0
101130814.0
102131534.0
is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]
movieIdtitlegenres
310313Swan Princess, The (1994)Animation|Children
360364Lion King, The (1994)Adventure|Animation|Children|Drama|Musical|IMAX
388392Secret Adventures of Tom Thumb, The (1993)Adventure|Animation
547551Nightmare Before Christmas, The (1993)Animation|Children|Fantasy|Musical
553558Pagemaster, The (1994)Action|Adventure|Animation|Children|Fantasy
582588Aladdin (1992)Adventure|Animation|Children|Comedy|Musical
588594Snow White and the Seven Dwarfs (1937)Animation|Children|Drama|Fantasy|Musical
589595Beauty and the Beast (1991)Animation|Children|Fantasy|Musical|Romance|IMAX
590596Pinocchio (1940)Animation|Children|Fantasy|Musical
604610Heavy Metal (1981)Action|Adventure|Animation|Horror|Sci-Fi
movies[is_animation].head(15)
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
1213Balto (1995)Adventure|Animation|Children
4748Pocahontas (1995)Animation|Children|Drama|Musical|Romance
236239Goofy Movie, A (1995)Animation|Children|Comedy|Romance
241244Gumby: The Movie (1995)Animation|Children
310313Swan Princess, The (1994)Animation|Children
360364Lion King, The (1994)Adventure|Animation|Children|Drama|Musical|IMAX
388392Secret Adventures of Tom Thumb, The (1993)Adventure|Animation
547551Nightmare Before Christmas, The (1993)Animation|Children|Fantasy|Musical
553558Pagemaster, The (1994)Action|Adventure|Animation|Children|Fantasy
582588Aladdin (1992)Adventure|Animation|Children|Comedy|Musical
588594Snow White and the Seven Dwarfs (1937)Animation|Children|Drama|Fantasy|Musical
589595Beauty and the Beast (1991)Animation|Children|Fantasy|Musical|Romance|IMAX
590596Pinocchio (1940)Animation|Children|Fantasy|Musical
604610Heavy Metal (1981)Action|Adventure|Animation|Horror|Sci-Fi

Group By and Aggregate

ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count
movieId
rating
0.5239125
1.0680732
1.5279252
2.01430997
2.5883398
3.04291193
3.52200156
4.05561926
4.51534824
5.02898660
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()
rating
movieId
13.921240
23.211977
33.151040
42.861393
53.064592
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()
rating
movieId
149695
222243
312735
42756
512161
movie_count = ratings[['movieId','rating']].groupby('movieId').count()#选择某一个维度,然后根据维度group by,和sql操作类似
movie_count.tail()
rating
movieId
1312541
1312561
1312581
1312601
1312621

Merge Dataframes

tags.head()
userIdmovieIdtag
0184141Mark Waters
165208dark hero
265353dark hero
365521noir thriller
465592dark hero
movies.head()
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
t = movies.merge(tags, on='movieId', how='inner')
t.head()
#?movies.merge 详细说明
movieIdtitlegenresuserIdtag
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy1644Watched
11Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy1741computer animation
21Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy1741Disney animated feature
31Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy1741Pixar animation
41Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy1741Téa Leoni does not star in this movie

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html


Combine aggreagation, merging, and filters to get useful analytics

avg_ratings = ratings.groupby('movieId', as_index=False).mean().rename(columns={'rating':'avg_rating'})#指定columns field 就能重命名了
del avg_ratings['userId']
avg_ratings.head()
movieIdavg_rating
013.921240
123.211977
233.151040
342.861393
453.064592
avg_ratings = avg_ratings.rename({'ratings':'avg_rating'})
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.head()
movieIdtitlegenresavg_rating
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy3.921240
12Jumanji (1995)Adventure|Children|Fantasy3.211977
23Grumpier Old Men (1995)Comedy|Romance3.151040
34Waiting to Exhale (1995)Comedy|Drama|Romance2.861393
45Father of the Bride Part II (1995)Comedy3.064592
is_highly_rated = box_office['avg_rating'] >= 4.0

box_office[is_highly_rated][-5:]
movieIdtitlegenresavg_rating
26737131250No More School (2000)Comedy4.0
26738131252Forklift Driver Klaus: The First Day on the Jo…Comedy|Horror4.0
26739131254Kein Bund für’s Leben (2007)Comedy4.0
26740131256Feuer, Eis & Dosenbier (2002)Comedy4.0
26743131262Innocence (2014)Adventure|Fantasy|Horror4.0
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]
movieIdtitlegenresavg_rating
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy3.921240
23Grumpier Old Men (1995)Comedy|Romance3.151040
34Waiting to Exhale (1995)Comedy|Drama|Romance2.861393
45Father of the Bride Part II (1995)Comedy3.064592
67Sabrina (1995)Comedy|Romance3.366484
box_office[is_comedy & is_highly_rated][-5:]
movieIdtitlegenresavg_rating
26736131248Brother Bear 2 (2006)Adventure|Animation|Children|Comedy|Fantasy4.0
26737131250No More School (2000)Comedy4.0
26738131252Forklift Driver Klaus: The First Day on the Jo…Comedy|Horror4.0
26739131254Kein Bund für’s Leben (2007)Comedy4.0
26740131256Feuer, Eis & Dosenbier (2002)Comedy4.0

Vectorized String Operations

movies.head()
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy


Split ‘genres’ into multiple columns

movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres[:10]
0123456789
0AdventureAnimationChildrenComedyFantasyNoneNoneNoneNoneNone
1AdventureChildrenFantasyNoneNoneNoneNoneNoneNoneNone
2ComedyRomanceNoneNoneNoneNoneNoneNoneNoneNone
3ComedyDramaRomanceNoneNoneNoneNoneNoneNoneNone
4ComedyNoneNoneNoneNoneNoneNoneNoneNoneNone
5ActionCrimeThrillerNoneNoneNoneNoneNoneNoneNone
6ComedyRomanceNoneNoneNoneNoneNoneNoneNoneNone
7AdventureChildrenNoneNoneNoneNoneNoneNoneNoneNone
8ActionNoneNoneNoneNoneNoneNoneNoneNoneNone
9ActionAdventureThrillerNoneNoneNoneNoneNoneNoneNone


Add a new column for comedy genre flag

movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
movie_genres[:10]
0123456789isComedy
0AdventureAnimationChildrenComedyFantasyNoneNoneNoneNoneNoneTrue
1AdventureChildrenFantasyNoneNoneNoneNoneNoneNoneNoneFalse
2ComedyRomanceNoneNoneNoneNoneNoneNoneNoneNoneTrue
3ComedyDramaRomanceNoneNoneNoneNoneNoneNoneNoneTrue
4ComedyNoneNoneNoneNoneNoneNoneNoneNoneNoneTrue
5ActionCrimeThrillerNoneNoneNoneNoneNoneNoneNoneFalse
6ComedyRomanceNoneNoneNoneNoneNoneNoneNoneNoneTrue
7AdventureChildrenNoneNoneNoneNoneNoneNoneNoneNoneFalse
8ActionNoneNoneNoneNoneNoneNoneNoneNoneNoneFalse
9ActionAdventureThrillerNoneNoneNoneNoneNoneNoneNoneFalse


Extract year from title e.g. (1995)

movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)#接受正则表达式
movies.tail()
movieIdtitlegenresyear
27273131254Kein Bund für’s Leben (2007)Comedy2007
27274131256Feuer, Eis & Dosenbier (2002)Comedy2002
27275131258The Pirates (2014)Adventure2014
27276131260Rentun Ruusu (2001)(no genres listed)2001
27277131262Innocence (2014)Adventure|Fantasy|Horror2014


More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the *tags.csv* dataset and read the timestamps!
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.dtypes
userId int64 movieId int64 tag object timestamp int64 dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags.head(5)
userIdmovieIdtagtimestamp
0184141Mark Waters1240597180
165208dark hero1368150078
265353dark hero1368150079
365521noir thriller1368149983
465592dark hero1368150078
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')#解析时间

Data Type datetime64[ns] maps to either


tags['parsed_time'].dtype
dtype(‘
tags.head(2)
userIdmovieIdtagtimestampparsed_time
0184141Mark Waters12405971802009-04-24 18:19:40
165208dark hero13681500782013-05-10 01:41:18

Selecting rows based on timestamps

greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape
((465564, 5), (12130, 5))

Sorting the table using the timestamps

tags.sort_values(by='parsed_time', ascending=True)[:10]
userIdmovieIdtagtimestampparsed_time
3339321003712788monty python11354292102005-12-24 13:00:10
3339271003711732coen brothers11354292362005-12-24 13:00:36
3339241003711206stanley kubrick11354292482005-12-24 13:00:48
3339231003711193jack nicholson11354293712005-12-24 13:02:51
3339391003715004peter sellers11354293992005-12-24 13:03:19
33392210037147morgan freeman11354294122005-12-24 13:03:32
33392110037147brad pitt11354294122005-12-24 13:03:32
3339361003714011brad pitt11354294312005-12-24 13:03:51
3339371003714011guy ritchie11354294312005-12-24 13:03:51
33392010037132bruce willis11354294422005-12-24 13:04:02

Average Movie Ratings over Time

## Are Movie ratings related to the year of launch?
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()
movieIdrating
267391312544.0
267401312564.0
267411312582.5
267421312603.0
267431312624.0
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()
movieIdrating
movieId1.000000-0.090369
rating-0.0903691.000000
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()#将asindex设为false会作为列名,否则作为行
yearly_average[:10]
yearrating
018913.000000
118933.375000
218943.071429
318953.125000
418963.183036
518983.850000
618993.625000
719003.166667
819015.000000
919023.738189
import matplotlib.pyplot as plt
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
plt.show()

这里写图片描述
Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?

注意: 本文为edx上 UCSD 的课程py for data science笔记

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值