pandas数据处理实践一(简单走一遍)

pandas处理数据简单的分为如下步骤:

读取数据-->分析数据-->处理数据-->导出数据

第一次主要是走一个流程

 

df1 = pd.read_csv('/path/xx.csv')   # 通过pd.read_csv读数据,格式为dataframe

# df1.to_csv('df1.csv',index=False) # 把内容写到名为df1.csv的文件中,把索引序号去除

df1的内容为

Sep 2018	Sep 2017	Change	Programming Language	Ratings	Change.1
0	1	1	NaN	Java	17.436%	+4.75%
1	2	2	NaN	C	15.447%	+8.06%
2	3	5	change	Python	7.653%	+4.67%
3	4	3	change	C++	7.394%	+1.83%
4	5	8	change	Visual Basic .NET	5.308%	+3.33%
5	6	4	change	C#	3.295%	-1.48%
6	7	6	change	PHP	2.775%	+0.57%
7	8	7	change	JavaScript	2.131%	+0.11%
8	9	-	change	SQL	2.062%	+2.06%
9	10	18	change	Objective-C	1.509%	+0.00%

df1.columns  # 可以把读取的数据的行标签列出,通过该操作可以索引我们想要的内容

Index(['Sep 2018', 'Sep 2017', 'Change', 'Programming Language', 'Ratings','Change.1'],dtype='object')

df_new = DataFrame(df, columns=['Sep 2019','Sep 2018', 'Change', 'Programming Language']) #该操作可以从原始数据

 #中提取想要的内容,并且可以添加新的列,值初始为nan

Sep 2019	Sep 2018	Change	Programming Language
0	NaN	1	NaN	Java
1	NaN	2	NaN	C
2	NaN	3	change	Python
3	NaN	4	change	C++
4	NaN	5	change	Visual Basic .NET
5	NaN	6	change	C#
6	NaN	7	change	PHP
7	NaN	8	change	JavaScript
8	NaN	9	change	SQL
9	NaN	10	change	Objective-C

df_new['Sep 2019'] = range(10) # 给新插入的列赋值,也很有用

Sep 2019	Sep 2018	Change	Programming Language
0	0	1	NaN	Java
1	1	2	NaN	C
2	2	3	change	Python
3	3	4	change	C++
4	4	5	change	Visual Basic .NET
5	5	6	change	C#
6	6	7	change	PHP
7	7	8	change	JavaScript
8	8	9	change	SQL
9	9	10	change	Objective-C

dataframe进行排序

df2 = df1.sort_values('A') # 按照某列数据值进行整体排序
df2.sort_index() # 按照索引排序

如何一步提取数据中我们想要的内容并根据某个特征排好序

 关键代码:

下面先分布,在一步

df = pd.read_csv('./movie_metadata.csv') # 加载原始数据

df.shape # 查看原始数据的大小也就是形状,从打印可以看出该数据5043行,28列,也就是说有5043个样本,28 
         # 个特征
    (5043, 28)

df.columns # 查看列标签
    Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

df.head(5) # 看数据的前五行,太多不给显示了

# 从原始数据中提取imdb_score,director_name,movie_title,并按照imdb_score降序排序,一句代码实现
df_new = DataFrame(df,columns=['imdb_score','director_name','movie_title']).sort_values('imdb_score',ascending=False)
imdb_score	director_name	movie_title
2765	9.5	John Blanchard	Towering Inferno
1937	9.3	Frank Darabont	The Shawshank Redemption
3466	9.2	Francis Ford Coppola	The Godfather
4409	9.1	John Stockwell	Kickboxer: Vengeance
2824	9.1	NaN	Dekalog
3207	9.1	NaN	Dekalog
66	9.0	Christopher Nolan	The Dark Knight

5043 rows × 3 columns

给出一步操作执行代码:

 pd.read_csv('movie_metadata.csv')[['imdb_score','director_name','movie_title']].sort_values('imdb_score',ascending=False).to_csv('imbd.csv')

pd.read_csv('movie_metadata.csv')[['imdb_score','director_name','movie_title']].sort_values('imdb_score',ascending=False).to_csv('imbd.csv')

这一步从读--->筛选数据---->排序----->导出数据   一步完成,可以分开写,但是这样写的好处有利于培养我们对代码的敏感性

使用jupyter时使用这个很方便,边写边查看数据,例如:

pd.read_csv('movie_metadata.csv').columns进行查看标签栏,或者通过head()显示前五行,执行一次后再删除继续往下写

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

# 此时再进行提取我们想要的内容,例如需要提取['movie_title','imdb_score','director_name']这三个特征的数据,同时使用head()进行查看

pd.read_csv('movie_metadata.csv')[['movie_title','imdb_score','director_name']].head() #默认显示前五行

 movie_titleimdb_scoredirector_name
0Avatar7.9James Cameron
1Pirates of the Caribbean: At World's End7.1Gore Verbinski
2Spectre6.8Sam Mendes
3The Dark Knight Rises8.5Christopher Nolan
4Star Wars: Episode VII - The Force Awakens  ...7.1Doug Walker

# 再下面对数据进行处理,例如进行排序,按照imdb_score进行排序,此时去除head()继续往下写即可

# 按照imdb_score排好序以后可以通过head(10)查看前10行的代码

pd.read_csv('movie_metadata.csv'[['movie_title','imdb_score','director_name']].sort_values('imdb_score',ascending=False).head(10)

        imdb_score	 movie_title	                       director_name
2765	9.5	         Towering Inferno	                   John Blanchard
1937	9.3	         The Shawshank Redemption	           Frank Darabont
3466	9.2	         The Godfather	Francis                Ford Coppola
4409	9.1	         Kickboxer: Vengeance	               John Stockwell
2824	9.1	         Dekalog	NaN
3207	9.1	         Dekalog	NaN
66	    9.0	         The Dark Knight	                   Christopher Nolan
2837	9.0	         The Godfather: Part II	Francis         Ford Coppola
3481	9.0	         Fargo	NaN
339	    8.9	         The Lord of the Rings: The Return of the King	Peter Jackson

# 数据处理完以后(当然还有很多需要处理如缺值),把处理好的数据写入硬盘

# 删除head(10),继续向下写,把数据写入硬盘,命名为imbd_ex.csv'

pd.read_csv('movie_metadata.csv'[['imdb_score','movie_title','director_name']].sort_values('imdb_score',ascending=False).to_csv('imbd_ex.csv')

2018/10/02  17:33           212,877 imbd.csv
2018/10/02  18:18           212,877 imbd_ex.csv
2017/11/13  19:09         1,494,688 movie_metadata.csv

存在一个文件为imbd_ex.csv,打开看看.

 imdb_scoredirector_namemovie_title   
27659.5John BlanchardTowering Inferno聽              
19379.3Frank DarabontThe Shawshank Redemption聽  
34669.2Francis Ford CoppolaThe Godfather聽   
44099.1John StockwellKickboxer: Vengeance聽  
28249.1 Dekalog聽               
32079.1 Dekalog聽               
669Christopher NolanThe Dark Knight聽   
28379Francis Ford CoppolaThe Godfather: Part II聽  
34819 Fargo聽               
3398.9Peter JacksonThe Lord of the Rings: The Return of the King聽
48228.9Sidney Lumet12 Angry Men聽   

发现有原始数据的行序号,不想要怎么办,在写操作时加上index=False 即 to_csv('imbd_ex.csv',index=False),在打开看看

imdb_scoredirector_namemovie_title
9.5John BlanchardTowering Inferno聽            
9.3Frank DarabontThe Shawshank Redemption聽
9.2Francis Ford CoppolaThe Godfather聽
9.1John StockwellKickboxer: Vengeance聽
9.1 Dekalog聽            
9.1 Dekalog聽            
9Christopher NolanThe Dark Knight聽
9Francis Ford CoppolaThe Godfather: Part II聽
9 Fargo聽            
8.9Peter JacksonThe Lord of the Rings: The Return of the King聽
8.9Sidney Lumet12 Angry Men聽
8.9Sergio LeoneThe Good, the Bad and the Ugly聽
8.9Quentin TarantinoPulp Fiction聽
8.9Steven SpielbergSchindler's List聽
8.8David FincherFight Club聽
8.8Robert ZemeckisForrest Gump聽
8.8Peter JacksonThe Lord of the Rings: The Fellowship of the Ring聽
8.8Irvin KershnerStar Wars: Episode V - The Empire Strikes Back聽

 

此时会发现已经没有了,大家会不会想如果我打开这个文件还有序号吗?答案是肯定的,因为Dataframe就是由Series构成的,因此会重新生成序号供我们处理,

pd.read_csv('imbd_ex.csv').head(10)

 imdb_scoremovie_titledirector_name
09.5Towering InfernoJohn Blanchard
19.3The Shawshank RedemptionFrank Darabont
29.2The GodfatherFrancis Ford Coppola
39.1Kickboxer: VengeanceJohn Stockwell
49.1DekalogNaN
59.1DekalogNaN
69.0The Dark KnightChristopher Nolan
79.0The Godfather: Part IIFrancis Ford Coppola
89.0FargoNaN
98.9The Lord of the Rings: The Return of the KingPeter Jackson

# 后面继续添加

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值