pandas 数据处理

最新推荐文章于 2023-02-14 21:11:11 发布

孤鸿子_

最新推荐文章于 2023-02-14 21:11:11 发布

阅读量3.6k

点赞数

CC 4.0 BY-SA版权

分类专栏： programing-language 机器学习文章标签： python 数据处理 pandas

本文链接：https://blog.csdn.net/Dylan_Frank/article/details/75318739

programing-language 同时被 2 个专栏收录

29 篇文章

订阅专栏

机器学习

17 篇文章

订阅专栏

Pandas

*pandas* is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. *pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing. The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are: * Generating descriptive statistics on data * Data cleaning using built in pandas functions * Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data * Merging multiple datasets using dataframes * Working with timestamps and time-series data **Additional Recommended Resources:** * *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/ * *Python for Data Analysis* by Wes McKinney * *Python Data Science Handbook* by Jake VanderPlas Let’s get started with our first *pandas* notebook! Import Libraries

import pandas as pd

Introduction to pandas Data Structures

*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

*pandas Series* one-dimensional labeled array.

ser = pd.Series(data = [100, 'foo', 300, 'bar', 500], index = ['tom', 'bob', 'nancy', 'dan', 'eric'])

ser

tom 100 bob foo nancy 300 dan bar eric 500 dtype: object

ser.index

Index([‘tom’, ‘bob’, ‘nancy’, ‘dan’, ‘eric’], dtype=’object’)

ser.loc[['nancy','bob']]

nancy 300 bob foo dtype: object

ser[[4, 3, 1]]

eric 500 dan bar bob foo dtype: object

ser.iloc[2]

300

'bob' in ser

True

ser

tom 100 bob foo nancy 300 dan bar eric 500 dtype: object

ser * 2

tom 200 bob foofoo nancy 600 dan barbar eric 1000 dtype: object

ser[['nancy', 'eric']] ** 2

nancy 90000 eric 250000 dtype: object

pandas DataFrame

*pandas DataFrame* is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}

df = pd.DataFrame(d)
print(df)

one two apple 100.0 111.0 ball 200.0 222.0 cerill NaN 333.0 clock 300.0 NaN dancy NaN 4444.0

df.index

Index([‘apple’, ‘ball’, ‘cerill’, ‘clock’, ‘dancy’], dtype=’object’)

df.columns

Index([‘one’, ‘two’], dtype=’object’)

pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

	one	two
dancy	NaN	4444.0
ball	200.0	222.0
apple	100.0	111.0

pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

	two	five
dancy	4444.0	NaN
ball	222.0	NaN
apple	111.0	NaN

Create DataFrame from list of Python dictionaries

data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

pd.DataFrame(data)

	alex	alice	dora	ema	joe
0	1.0	NaN	NaN	NaN	2.0
1	NaN	20.0	10.0	5.0	NaN

pd.DataFrame(data, index=['orange', 'red'])

	alex	alice	dora	ema	joe
orange	1.0	NaN	NaN	NaN	2.0
red	NaN	20.0	10.0	5.0	NaN

pd.DataFrame(data, columns=['joe', 'dora','alice'])

	joe	dora	alice
0	2.0	NaN	NaN
1	NaN	10.0	20.0

Basic DataFrame operations

df

	one	two
apple	100.0	111.0
ball	200.0	222.0
cerill	NaN	333.0
clock	300.0	NaN
dancy	NaN	4444.0

df['one']

apple 100.0 ball 200.0 cerill NaN clock 300.0 dancy NaN Name: one, dtype: float64

df['three'] = df['one'] * df['two']
df

	one	two	three
apple	100.0	111.0	11100.0
ball	200.0	222.0	44400.0
cerill	NaN	333.0	NaN
clock	300.0	NaN	NaN
dancy	NaN	4444.0	NaN

df['flag'] = df['one'] > 250
df

	one	two	three	flag
apple	100.0	111.0	11100.0	False
ball	200.0	222.0	44400.0	False
cerill	NaN	333.0	NaN	False
clock	300.0	NaN	NaN	True
dancy	NaN	4444.0	NaN	False

three = df.pop('three')

three

apple 11100.0 ball 44400.0 cerill NaN clock NaN dancy NaN Name: three, dtype: float64

df

	one	two	flag
apple	100.0	111.0	False
ball	200.0	222.0	False
cerill	NaN	333.0	False
clock	300.0	NaN	True
dancy	NaN	4444.0	False

del df['two']

df

	one	flag
apple	100.0	False
ball	200.0	False
cerill	NaN	False
clock	300.0	True
dancy	NaN	False

df.insert(2, 'copy_of_one', df['one'])
df

	one	flag	copy_of_one
apple	100.0	False	100.0
ball	200.0	False	200.0
cerill	NaN	False	NaN
clock	300.0	True	300.0
dancy	NaN	False	NaN

df['one_upper_half'] = df['one'][:2]
df

	one	flag	copy_of_one	one_upper_half
apple	100.0	False	100.0	100.0
ball	200.0	False	200.0	200.0
cerill	NaN	False	NaN	NaN
clock	300.0	True	300.0	NaN
dancy	NaN	False	NaN	NaN

Case Study: Movie Data Analysis

This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*. ## Download the Dataset Please note that **you will need to download the dataset**. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints. Here are the links to the data source and location: * **Data Source: ** MovieLens web site (filename: ml-20m.zip) * **Location:** https://grouplens.org/datasets/movielens/ Once the download completes, please make sure the data files are in a directory called *movielens* in your *Week-3-pandas* folder. Let us look at the files in this dataset using the UNIX command ls.

# Note: Adjust the name of the folder to match your local directory
#linux 使用
!ls ./movielens

!cat ./movielens/movies.csv | wc -l

!head -5 ./movielens/ratings.csv

Use Pandas to Read the Dataset

In this notebook, we will be using three CSV files:

ratings.csv : userId,movieId,rating, timestamp
tags.csv : userId,movieId, tag, timestamp
movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.

movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)

# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()

	userId	movieId	tag	timestamp
0	18	4141	Mark Waters	1240597180
1	65	208	dark hero	1368150078
2	65	353	dark hero	1368150079
3	65	521	noir thriller	1368149983
4	65	592	dark hero	1368150078

ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

	userId	movieId	rating	timestamp
0	1	2	3.5	1112486027
1	1	29	3.5	1112484676
2	1	32	3.5	1112484819
3	1	47	3.5	1112484727
4	1	50	3.5	1112484580

# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series

#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)

pandas.core.series.Series

print(row_0)

userId 18 movieId 4141 tag Mark Waters Name: 0, dtype: object

row_0.index

Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)

row_0['userId']

'rating' in row_0

False

row_0.name

row_0 = row_0.rename('first_row')
row_0.name

‘first_row’

DataFrames

tags.head()

	userId	movieId	tag
0	18	4141	Mark Waters
1	65	208	dark hero
2	65	353	dark hero
3	65	521	noir thriller
4	65	592	dark hero

tags.index

RangeIndex(start=0, stop=465564, step=1)

tags.columns

Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)

# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]

	userId	movieId	tag
0	18	4141	Mark Waters
11	65	1783	noir thriller
2000	910	68554	conspiracy theory

Descriptive Statistics

Let’s look how the ratings are distributed!

ratings['rating'].describe()

count 2.000026e+07 mean 3.525529e+00 std 1.051989e+00 min 5.000000e-01 25% 3.000000e+00 50% 3.500000e+00 75% 4.000000e+00 max 5.000000e+00 Name: rating, dtype: float64

ratings.describe()

	userId	movieId	rating
count	2.000026e+07	2.000026e+07	2.000026e+07
mean	6.904587e+04	9.041567e+03	3.525529e+00
std	4.003863e+04	1.978948e+04	1.051989e+00
min	1.000000e+00	1.000000e+00	5.000000e-01
25%	3.439500e+04	9.020000e+02	3.000000e+00
50%	6.914100e+04	2.167000e+03	3.500000e+00
75%	1.036370e+05	4.770000e+03	4.000000e+00
max	1.384930e+05	1.312620e+05	5.000000e+00

ratings['rating'].mean()

3.5255285642993797

ratings.mean()

userId 69045.872583 movieId 9041.567330 rating 3.525529 dtype: float64

ratings['rating'].min()

0.5

ratings['rating'].max()

5.0

ratings['rating'].std()

1.051988919275684

ratings['rating'].mode()

0 4.0 dtype: float64

ratings.corr()

	userId	movieId	rating
userId	1.000000	-0.000850	0.001175
movieId	-0.000850	1.000000	0.002606
rating	0.001175	0.002606	1.000000

filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()

0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False … 20000233 False 20000234 False 20000235 False 20000236 False 20000237 False 20000238 False 20000239 False 20000240 False 20000241 False 20000242 False 20000243 False 20000244 False 20000245 False 20000246 False 20000247 False 20000248 False 20000249 False 20000250 False 20000251 False 20000252 False 20000253 False 20000254 False 20000255 False 20000256 False 20000257 False 20000258 False 20000259 False 20000260 False 20000261 False 20000262 False Name: rating, dtype: bool False

filter_2 = ratings['rating'] > 0
filter_2.all()

True

Data Cleaning: Handling Missing Data

movies.shape

(27278, 3)

#is any row NULL ?

movies.isnull().any()

movieId False title False genres False dtype: bool Thats nice ! No NULL values !

ratings.shape

(20000263, 3)

#is any row NULL ?

ratings.isnull().any()

userId False movieId False rating False dtype: bool Thats nice ! No NULL values !

tags.shape

(465564, 3)

#is any row NULL ?

tags.isnull().any()

userId False movieId False tag True dtype: bool We have some tags which are NULL.

tags = tags.dropna()

#Check again: is any row NULL ?

tags.isnull().any()

userId False movieId False tag False dtype: bool

tags.shape

(465548, 3) Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))

array([[

ratings.boxplot(column='rating', figsize=(15,20))

tags['tag'].head()

0 Mark Waters 1 dark hero 2 dark hero 3 noir thriller 4 dark hero Name: tag, dtype: object

movies[['title','genres']].head()

	title	genres
0	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	Jumanji (1995)	Adventure\|Children\|Fantasy
2	Grumpier Old Men (1995)	Comedy\|Romance
3	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	Father of the Bride Part II (1995)	Comedy

ratings[-10:]

	userId	movieId	rating
20000253	138493	60816	4.5
20000254	138493	61160	4.0
20000255	138493	65682	4.5
20000256	138493	66762	4.5
20000257	138493	68319	4.5
20000258	138493	68954	4.5
20000259	138493	69526	4.5
20000260	138493	69644	3.0
20000261	138493	70286	5.0
20000262	138493	71619	2.5

tag_counts = tags['tag'].value_counts()
tag_counts[-10:]

Venice Film Festival Winner 2002 1 based on the life of Buford Pusser 1 but no way as good as the other two 1 see 1 Jeffrey Kimball 1 tolerable 1 Fake History - Don’t Believe a Thing 1 Boy 1 urlaub 1 conservative 1 Name: tag, dtype: int64

tag_counts[:10].plot(kind='bar', figsize=(15,10))

is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]

	userId	movieId	rating
68	1	2021	4.0
69	1	2100	4.0
70	1	2118	4.0
71	1	2138	4.0
72	1	2140	4.0
73	1	2143	4.0
74	1	2173	4.0
75	1	2174	4.0
76	1	2193	4.0
79	1	2288	4.0
80	1	2291	4.0
81	1	2542	4.0
82	1	2628	4.0
90	1	2762	4.0
92	1	2872	4.0
94	1	2944	4.0
96	1	2959	4.0
97	1	2968	4.0
101	1	3081	4.0
102	1	3153	4.0

is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]

	movieId	title	genres
310	313	Swan Princess, The (1994)	Animation\|Children
360	364	Lion King, The (1994)	Adventure\|Animation\|Children\|Drama\|Musical\|IMAX
388	392	Secret Adventures of Tom Thumb, The (1993)	Adventure\|Animation
547	551	Nightmare Before Christmas, The (1993)	Animation\|Children\|Fantasy\|Musical
553	558	Pagemaster, The (1994)	Action\|Adventure\|Animation\|Children\|Fantasy
582	588	Aladdin (1992)	Adventure\|Animation\|Children\|Comedy\|Musical
588	594	Snow White and the Seven Dwarfs (1937)	Animation\|Children\|Drama\|Fantasy\|Musical
589	595	Beauty and the Beast (1991)	Animation\|Children\|Fantasy\|Musical\|Romance\|IMAX
590	596	Pinocchio (1940)	Animation\|Children\|Fantasy\|Musical
604	610	Heavy Metal (1981)	Action\|Adventure\|Animation\|Horror\|Sci-Fi

movies[is_animation].head(15)

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
12	13	Balto (1995)	Adventure\|Animation\|Children
47	48	Pocahontas (1995)	Animation\|Children\|Drama\|Musical\|Romance
236	239	Goofy Movie, A (1995)	Animation\|Children\|Comedy\|Romance
241	244	Gumby: The Movie (1995)	Animation\|Children
310	313	Swan Princess, The (1994)	Animation\|Children
360	364	Lion King, The (1994)	Adventure\|Animation\|Children\|Drama\|Musical\|IMAX
388	392	Secret Adventures of Tom Thumb, The (1993)	Adventure\|Animation
547	551	Nightmare Before Christmas, The (1993)	Animation\|Children\|Fantasy\|Musical
553	558	Pagemaster, The (1994)	Action\|Adventure\|Animation\|Children\|Fantasy
582	588	Aladdin (1992)	Adventure\|Animation\|Children\|Comedy\|Musical
588	594	Snow White and the Seven Dwarfs (1937)	Animation\|Children\|Drama\|Fantasy\|Musical
589	595	Beauty and the Beast (1991)	Animation\|Children\|Fantasy\|Musical\|Romance\|IMAX
590	596	Pinocchio (1940)	Animation\|Children\|Fantasy\|Musical
604	610	Heavy Metal (1981)	Action\|Adventure\|Animation\|Horror\|Sci-Fi

Group By and Aggregate

ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

	movieId
rating
0.5	239125
1.0	680732
1.5	279252
2.0	1430997
2.5	883398
3.0	4291193
3.5	2200156
4.0	5561926
4.5	1534824
5.0	2898660

average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

	rating
movieId
1	3.921240
2	3.211977
3	3.151040
4	2.861393
5	3.064592

movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

	rating
movieId
1	49695
2	22243
3	12735
4	2756
5	12161

movie_count = ratings[['movieId','rating']].groupby('movieId').count()#选择某一个维度，然后根据维度group by，和sql操作类似
movie_count.tail()

	rating
movieId
131254	1
131256	1
131258	1
131260	1
131262	1

Merge Dataframes

tags.head()

	userId	movieId	tag
0	18	4141	Mark Waters
1	65	208	dark hero
2	65	353	dark hero
3	65	521	noir thriller
4	65	592	dark hero

movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

t = movies.merge(tags, on='movieId', how='inner')
t.head()
#?movies.merge 详细说明

	movieId	title	genres	userId	tag
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1644	Watched
1	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	computer animation
2	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	Disney animated feature
3	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	Pixar animation
4	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	TÃ©a Leoni does not star in this movie

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

Combine aggreagation, merging, and filters to get useful analytics

avg_ratings = ratings.groupby('movieId', as_index=False).mean().rename(columns={'rating':'avg_rating'})#指定columns field 就能重命名了
del avg_ratings['userId']
avg_ratings.head()

	movieId	avg_rating
0	1	3.921240
1	2	3.211977
2	3	3.151040
3	4	2.861393
4	5	3.064592

avg_ratings = avg_ratings.rename({'ratings':'avg_rating'})
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.head()

	movieId	title	genres	avg_rating
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	3.921240
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	3.211977
2	3	Grumpier Old Men (1995)	Comedy\|Romance	3.151040
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	2.861393
4	5	Father of the Bride Part II (1995)	Comedy	3.064592

is_highly_rated = box_office['avg_rating'] >= 4.0

box_office[is_highly_rated][-5:]

	movieId	title	genres	avg_rating
26737	131250	No More School (2000)	Comedy	4.0
26738	131252	Forklift Driver Klaus: The First Day on the Jo…	Comedy\|Horror	4.0
26739	131254	Kein Bund für’s Leben (2007)	Comedy	4.0
26740	131256	Feuer, Eis & Dosenbier (2002)	Comedy	4.0
26743	131262	Innocence (2014)	Adventure\|Fantasy\|Horror	4.0

is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]

	movieId	title	genres	avg_rating
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	3.921240
2	3	Grumpier Old Men (1995)	Comedy\|Romance	3.151040
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	2.861393
4	5	Father of the Bride Part II (1995)	Comedy	3.064592
6	7	Sabrina (1995)	Comedy\|Romance	3.366484

box_office[is_comedy & is_highly_rated][-5:]

	movieId	title	genres	avg_rating
26736	131248	Brother Bear 2 (2006)	Adventure\|Animation\|Children\|Comedy\|Fantasy	4.0
26737	131250	No More School (2000)	Comedy	4.0
26738	131252	Forklift Driver Klaus: The First Day on the Jo…	Comedy\|Horror	4.0
26739	131254	Kein Bund für’s Leben (2007)	Comedy	4.0
26740	131256	Feuer, Eis & Dosenbier (2002)	Comedy	4.0

Vectorized String Operations

movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

Split ‘genres’ into multiple columns

movie_genres = movies['genres'].str.split('|', expand=True)

movie_genres[:10]

	0	1	2	3	4	5	6	7	8	9
0	Adventure	Animation	Children	Comedy	Fantasy	None	None	None	None	None
1	Adventure	Children	Fantasy	None	None	None	None	None	None	None
2	Comedy	Romance	None	None	None	None	None	None	None	None
3	Comedy	Drama	Romance	None	None	None	None	None	None	None
4	Comedy	None	None	None	None	None	None	None	None	None
5	Action	Crime	Thriller	None	None	None	None	None	None	None
6	Comedy	Romance	None	None	None	None	None	None	None	None
7	Adventure	Children	None	None	None	None	None	None	None	None
8	Action	None	None	None	None	None	None	None	None	None
9	Action	Adventure	Thriller	None	None	None	None	None	None	None

Add a new column for comedy genre flag

movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

movie_genres[:10]

	0	1	2	3	4	5	6	7	8	9	isComedy
0	Adventure	Animation	Children	Comedy	Fantasy	None	None	None	None	None	True
1	Adventure	Children	Fantasy	None	None	None	None	None	None	None	False
2	Comedy	Romance	None	None	None	None	None	None	None	None	True
3	Comedy	Drama	Romance	None	None	None	None	None	None	None	True
4	Comedy	None	None	None	None	None	None	None	None	None	True
5	Action	Crime	Thriller	None	None	None	None	None	None	None	False
6	Comedy	Romance	None	None	None	None	None	None	None	None	True
7	Adventure	Children	None	None	None	None	None	None	None	None	False
8	Action	None	None	None	None	None	None	None	None	None	False
9	Action	Adventure	Thriller	None	None	None	None	None	None	None	False

Extract year from title e.g. (1995)

movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)#接受正则表达式

movies.tail()

	movieId	title	genres	year
27273	131254	Kein Bund für’s Leben (2007)	Comedy	2007
27274	131256	Feuer, Eis & Dosenbier (2002)	Comedy	2002
27275	131258	The Pirates (2014)	Adventure	2014
27276	131260	Rentun Ruusu (2001)	(no genres listed)	2001
27277	131262	Innocence (2014)	Adventure\|Fantasy\|Horror	2014

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the *tags.csv* dataset and read the timestamps!

tags = pd.read_csv('./movielens/tags.csv', sep=',')

tags.dtypes

userId int64 movieId int64 tag object timestamp int64 dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags.head(5)

	userId	movieId	tag	timestamp
0	18	4141	Mark Waters	1240597180
1	65	208	dark hero	1368150078
2	65	353	dark hero	1368150079
3	65	521	noir thriller	1368149983
4	65	592	dark hero	1368150078

tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')#解析时间

Data Type datetime64[ns] maps to either


tags['parsed_time'].dtype

dtype(‘

tags.head(2)

	userId	movieId	tag	timestamp	parsed_time
0	18	4141	Mark Waters	1240597180	2009-04-24 18:19:40
1	65	208	dark hero	1368150078	2013-05-10 01:41:18

Selecting rows based on timestamps

greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

((465564, 5), (12130, 5))

Sorting the table using the timestamps

tags.sort_values(by='parsed_time', ascending=True)[:10]

	userId	movieId	tag	timestamp	parsed_time
333932	100371	2788	monty python	1135429210	2005-12-24 13:00:10
333927	100371	1732	coen brothers	1135429236	2005-12-24 13:00:36
333924	100371	1206	stanley kubrick	1135429248	2005-12-24 13:00:48
333923	100371	1193	jack nicholson	1135429371	2005-12-24 13:02:51
333939	100371	5004	peter sellers	1135429399	2005-12-24 13:03:19
333922	100371	47	morgan freeman	1135429412	2005-12-24 13:03:32
333921	100371	47	brad pitt	1135429412	2005-12-24 13:03:32
333936	100371	4011	brad pitt	1135429431	2005-12-24 13:03:51
333937	100371	4011	guy ritchie	1135429431	2005-12-24 13:03:51
333920	100371	32	bruce willis	1135429442	2005-12-24 13:04:02

Average Movie Ratings over Time

## Are Movie ratings related to the year of launch?

average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()

	movieId	rating
26739	131254	4.0
26740	131256	4.0
26741	131258	2.5
26742	131260	3.0
26743	131262	4.0

joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()

	movieId	rating
movieId	1.000000	-0.090369
rating	-0.090369	1.000000

yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()#将asindex设为false会作为列名，否则作为行
yearly_average[:10]

	year	rating
0	1891	3.000000
1	1893	3.375000
2	1894	3.071429
3	1895	3.125000
4	1896	3.183036
5	1898	3.850000
6	1899	3.625000
7	1900	3.166667
8	1901	5.000000
9	1902	3.738189

import matplotlib.pyplot as plt
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
plt.show()

这里写图片描述
Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?

注意: 本文为edx上 UCSD 的课程py for data science笔记