python与数据挖掘上机实验pandas_python-pandas的一些小练习——东北大学大数据班数据挖掘实训Python基础二（2）...

最新推荐文章于 2024-04-08 15:34:49 发布

weixin_39671405

最新推荐文章于 2024-04-08 15:34:49 发布

阅读量377

点赞数

文章标签： python与数据挖掘上机实验pandas

因为本博客是直接导入markdown文件(jupter写的)生成的，所以会导致分不清代码和输出，如果大家需要清楚的代码和输出，就请下载以下html文件

pandasTrain.html

实训练习：练习使用pandas 包中的dataframe和方法。

1将数据读取为dataframe类型，命名为df

import pandas as pd

import numpy as np

df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp2\data\\flight.csv")

df.head()

date

hour

minute

dep

arr

dep_delay

arr_delay

carrier

flight

dest

plane

cancelled

time

dist

2011/12/1

14.0

31.0

1431.0

1730.0

-4.0

-9.0

4319

SAV

N11121

104.0

851

2011/12/1

21.0

15.0

2115.0

2229.0

-10.0

-7.0

4342

LRD

N11107

50.0

301

2011/12/1

10.0

47.0

1047.0

1429.0

20.0

17.0

4434

RIC

N11113

146.0

1157

2011/12/1

19.0

24.0

1924.0

2234.0

-1.0

1.0

4525

SAV

N11106

105.0

851

2011/12/1

7.0

18.0

718.0

1105.0

-2.0

-8.0

4602

DTW

N11121

139.0

1075

2查看数据的形状[了解数据有多少行、多少列]

df.shape

(174, 14)

3查看数据的最开始的7条数据和尾部的3条数据

df.head(7)

date

hour

minute

dep

arr

dep_delay

arr_delay

carrier

flight

dest

plane

cancelled

time

dist

2011/12/1

14.0

31.0

1431.0

1730.0

-4.0

-9.0

4319

SAV

N11121

104.0

851

2011/12/1

21.0

15.0

2115.0

2229.0

-10.0

-7.0

4342

LRD

N11107

50.0

301

2011/12/1

10.0

47.0

1047.0

1429.0

20.0

17.0

4434

RIC

N11113

146.0

1157

2011/12/1

19.0

24.0

1924.0

2234.0

-1.0

1.0

4525

SAV

N11106

105.0

851

2011/12/1

7.0

18.0

718.0

1105.0

-2.0

-8.0

4602

DTW

N11121

139.0

1075

2011/12/1

20.0

18.0

2018.0

2200.0

-2.0

-9.0

4674

BNA

N11121

82.0

657

2011/12/2

21.0

20.0

2120.0

2230.0

-5.0

-6.0

4342

LRD

N11113

53.0

301

df.tail(3)

date

hour

minute

dep

arr

dep_delay

arr_delay

carrier

flight

dest

plane

cancelled

time

dist

171

2011/12/30

11.0

14.0

1114.0

1141.0

9.0

6.0

4323

HOB

N11107

73.0

501

172

2011/12/30

18.0

31.0

1831.0

2145.0

16.0

9.0

4634

CLT

N11121

116.0

912

173

2011/12/31

14.0

8.0

1408.0

1705.0

2.0

-1.0

4433

TYS

N11107

98.0

771

4查看数据的各字段的类型

df.dtypes

date object

hour float64

minute float64

dep float64

arr float64

dep_delay float64

arr_delay float64

carrier object

flight int64

dest object

plane object

cancelled int64

time float64

dist int64

dtype: object

5查看数据是否有缺失值

df.isna().any() #列级别的判断，只要该列有为空或者NA的元素，就为True，否则False isna()和isnull()一样

date False

hour True

minute True

dep True

arr True

dep_delay True

arr_delay True

carrier False

flight False

dest False

plane False

cancelled False

time True

dist False

dtype: bool

6如果有缺失值，进行补0操作

df=df.fillna(0)

df.isna().any()

date False

hour False

minute False

dep False

arr False

dep_delay False

arr_delay False

carrier False

flight False

dest False

plane False

cancelled False

time False

dist False

dtype: bool

7从df中按字段取出‘ date’’dist’和‘ flight’形成新的df1

df1 = df.loc[:,['date','dist','flight']]

df1.head()

date

dist

flight

2011/12/1

851

4319

2011/12/1

301

4342

2011/12/1

1157

4434

2011/12/1

851

4525

2011/12/1

1075

4602

8从df中按位置选取行为3和4，列为0,1的数据形成新的df2

df2 = df.iloc[[3,4],[0,1]]

df2

date

hour

2011/12/1

19.0

2011/12/1

7.0

9在df中添加一列‘ low_dest’，内容是‘ dest’这列的字符串的小写形式

df['low_dest'] = df['dest'].str.lower()

df.head()

date

hour

minute

dep

arr

dep_delay

arr_delay

carrier

flight

dest

plane

cancelled

time

dist

low_dest

2011/12/1

14.0

31.0

1431.0

1730.0

-4.0

-9.0

4319

SAV

N11121

104.0

851

sav

2011/12/1

21.0

15.0

2115.0

2229.0

-10.0

-7.0

4342

LRD

N11107

50.0

301

lrd

2011/12/1

10.0

47.0

1047.0

1429.0

20.0

17.0

4434

RIC

N11113

146.0

1157

ric

2011/12/1

19.0

24.0

1924.0

2234.0

-1.0

1.0

4525

SAV

N11106

105.0

851

sav

2011/12/1

7.0

18.0

718.0

1105.0

-2.0

-8.0

4602

DTW

N11121

139.0

1075

dtw

10从df中选取‘ dist’大于1000并且小于1200的数据作为一个新的df3

df3 = df[(df.dist>1000) & (df.dist<1200)]

df3.head()

date

hour

minute

dep

arr

dep_delay

arr_delay

carrier

flight

dest

plane

cancelled

time

dist

low_dest

2011/12/1

10.0

47.0

1047.0

1429.0

20.0

17.0

4434

RIC

N11113

146.0

1157

ric

2011/12/1

7.0

18.0

718.0

1105.0

-2.0

-8.0

4602

DTW

N11121

139.0

1075

dtw

2011/12/2

7.0

13.0

713.0

1058.0

-2.0

-6.0

4501

PIT

N11107

146.0

1117

pit

2011/12/3

19.0

1.0

1901.0

2230.0

-1.0

-14.0

4509

RIC

N11113

134.0

1157

ric

2011/12/4

16.0

8.0

1608.0

2004.0

18.0

40.0

4477

RDU

N11121

123.0

1042

rdu

11从df中选取‘ time’大于100的数据做为新的df4

df4 = df[df.time>100]

df4.head()

date

hour

minute

dep

arr

dep_delay

arr_delay

carrier

flight

dest

plane

cancelled

time

dist

low_dest

2011/12/1

14.0

31.0

1431.0

1730.0

-4.0

-9.0

4319

SAV

N11121

104.0

851

sav

2011/12/1

10.0

47.0

1047.0

1429.0

20.0

17.0

4434

RIC

N11113

146.0

1157

ric

2011/12/1

19.0

24.0

1924.0

2234.0

-1.0

1.0

4525

SAV

N11106

105.0

851

sav

2011/12/1

7.0

18.0

718.0

1105.0

-2.0

-8.0

4602

DTW

N11121

139.0

1075

dtw

2011/12/2

10.0

25.0

1025.0

1352.0

-5.0

-2.0

4366

CLT

N11121

120.0

912

clt

12将df3和df4合并为一个dataframe，命名为df5

df5 = pd.concat([df3,df4],axis = 1) #横向合并 axis决定合并方向 ,axis = 1横向合并 axis = 0竖向合并

df5.head()

date

hour

minute

dep

arr

dep_delay

arr_delay

carrier

flight

dest

...

dep_delay

arr_delay

carrier

flight

dest

plane

cancelled

time

dist

low_dest

NaN

...

-4.0

-9.0

4319.0

SAV

N11121

0.0

104.0

851.0

sav

2011/12/1

10.0

47.0

1047.0

1429.0

20.0

17.0

4434.0

RIC

...

20.0

17.0

4434.0

RIC

N11113

0.0

146.0

1157.0

ric

NaN

...

-1.0

1.0

4525.0

SAV

N11106

0.0

105.0

851.0

sav

2011/12/1

7.0

18.0

718.0

1105.0

-2.0

-8.0

4602.0

DTW

...

-2.0

-8.0

4602.0

DTW

N11121

0.0

139.0

1075.0

dtw

NaN

...

-5.0

-2.0

4366.0

CLT

N11121

0.0

120.0

912.0

clt

5 rows × 30 columns

df3.shape

(43, 15)

df4.shape

(85, 15)

df5.shape #15+15=30

(86, 30)

pd.concat([df3,df4],axis = 0).shape #竖向合并 43+85=128

(128, 15)

13取出df3中的‘ time’,‘dist’ 这两列做为df6，求出df6中每列中最大值与最小值的差值(运用lambda函数的方式进行求解 )

df6 = df3.loc[:,['time','dist']]

df6.head()

df6.apply(lambda x : x.max() - x.min())

14将df根据“ dest”分组，统计各目的地，都有多少条数据，命名为“ dest_count”，然后将“ dest”和“ dest_count”组成新的dataframe命名为df7

df7 = df.groupby('dest',as_index = False)['dest'].agg({'dest_count':np.size})

df7.head()

15根据dep_delay不为0，统计df中有多少次航班起飞延误

df[df.dep_delay!=0].shape[0]

16利用plane字段，统计df中不同航班的飞行次数，以plane和plane_cout格式保存为df8

df8 = df.groupby('plane',as_index = False)['plane'].agg({'dest_count':np.size})

df8.head()

17利用carrier字段，计算df中carrier字段不同属性数据的数据量的比值(如：XE的数据量与MQ的数据量的比值)

df9=df.groupby(['carrier'],as_index=False)['carrier'].size().reset_index(name='carrier_count')

df9.head()

print("XE的数据量与MQ的数据量的比值: ", df9.iat[1,1]/df9.iat[0,1])

18分别将df7和 df8保存到本地文件，文件名分别为df7.csv和df8.csv

df7.to_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp2\\df7.csv")

df8.to_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp2\\df78.csv")

weixin_39671405

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python与数据挖掘上机实验pandas_python-pandas的一些小练习——东北大学大数据班数据挖掘实训Python基础二（2）...

因为本博客是直接导入markdown文件(jupter写的)生成的，所以会导致分不清代码和输出，如果大家需要清楚的代码和输出，就请下载以下html文件pandasTrain.html实训练习：练习使用pandas 包中的dataframe和方法。1将数据读取为dataframe类型，命名为dfimport pandas as pdimport numpy as npdf=pd.read_csv("...
复制链接

扫一扫