数据分析与挖掘练习

1.0 背景

该数据集是澳大利亚某公司无人机送货的记录(2018年8月之前),主要包括以下的列:

  • 'Id' : 记录的ID
  • 'Drone Type' : 无人机的类别分 1类 2类 3类
  • 'Post Type' : 运送的类别 0为普通运送 1为速运
  • 'Package Weight' :包裹的重量
  • 'Origin Region' :出发地的区域代码
  • 'Destination Region' :目的地的区域代码
  • 'Origin Latitude' :出发纬度
  • 'Origin Longitude' :出发经度
  • 'Destination Latitude' :目的地纬度
  • 'Destination Longitude' :目的地经度
  • 'Journey Distance' :运送距离
  • 'Departure Date' :出发日期
  • 'Departure Time' :出发时间
  • 'Travel Time' :飞行时间
  • 'Delivery Time' :到达时间
  • 'Delivery Fare' :运送费用
pd.options.display.max_rows = 10

2.0 载入包和数据

#loading library
import pandas as pd
import re
import matplotlib.pyplot as plt
#import seaborn as sns  !pip intall seaborn
import scipy.stats as st
import numpy as np
import math
from math import *
from datetime import datetime,timedelta

任务1:载入名为‘data.csv’的数据

data = pd.read_csv('data.csv')

DataFrame
Series

type(data)
pandas.core.frame.DataFrame

3.0 数据初步探索

任务2:找出数据有多少行列

data.shape
(37903, 16)

任务3:查看列的统计信息

提示:describe()

data.describe()
Drone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceTravel TimeDelivery Fare
count37893.00000037883.00000037903.00000037893.00000037893.00000037903.00000037903.00000037903.00000037903.00000037903.00000037863.00000037874.000000
mean1.6992850.29870925.66990120.47668420.452722-37.728867145.423058-37.722054145.434035221.954150208.794518126.814976
std0.7798450.45769812.10715011.50111011.5093111.8991836.9239931.8956216.909055116.604355107.61244759.314445
min1.0000000.0000005.0010001.0000001.000000-39.006941-148.337157-39.006941-147.6919020.6640007.42000054.020000
25%1.0000000.00000015.19900011.00000011.000000-38.443034143.965002-38.431293143.951543131.044500125.16500097.440000
50%2.0000000.00000025.44600020.00000020.000000-37.707244145.423386-37.700695145.450794209.796000196.370000120.045000
75%2.0000001.00000035.95350030.00000030.000000-37.094433147.170334-37.080256147.216886302.052000281.250000145.800000
max3.0000001.00000055.99200040.00000040.00000038.986998148.45057638.989473148.450576556.637000545.4600001217.690000

任务4:找出每个列名称

data.columns
Index(['Id', 'Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
       'Destination Region', 'Origin Latitude', 'Origin Longitude',
       'Destination Latitude', 'Destination Longitude', 'Journey Distance',
       'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
       'Delivery Fare'],
      dtype='object')

任务5:找出每个列的属性

是Obeject 还是 float

data.info()
data['Drone Type'] = data['Drone Type'].astype('str')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37903 entries, 0 to 37902
Data columns (total 16 columns):
Id                       37878 non-null object
Drone Type               37893 non-null float64
Post Type                37883 non-null float64
Package Weight           37903 non-null float64
Origin Region            37893 non-null float64
Destination Region       37893 non-null float64
Origin Latitude          37903 non-null float64
Origin Longitude         37903 non-null float64
Destination Latitude     37903 non-null float64
Destination Longitude    37903 non-null float64
Journey Distance         37903 non-null float64
Departure Date           37903 non-null object
Departure Time           37903 non-null object
Travel Time              37863 non-null float64
Delivery Time            37903 non-null object
Delivery Fare            37874 non-null float64
dtypes: float64(12), object(4)
memory usage: 4.6+ MB
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37903 entries, 0 to 37902
Data columns (total 16 columns):
Id                       37878 non-null object
Drone Type               37903 non-null object
Post Type                37883 non-null float64
Package Weight           37903 non-null float64
Origin Region            37893 non-null float64
Destination Region       37893 non-null float64
Origin Latitude          37903 non-null float64
Origin Longitude         37903 non-null float64
Destination Latitude     37903 non-null float64
Destination Longitude    37903 non-null float64
Journey Distance         37903 non-null float64
Departure Date           37903 non-null object
Departure Time           37903 non-null object
Travel Time              37863 non-null float64
Delivery Time            37903 non-null object
Delivery Fare            37874 non-null float64
dtypes: float64(11), object(5)
memory usage: 4.6+ MB

任务6:找出数据的前5行和后5行

data.head()
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
0ID16452821282.00.021.68619.038.0-37.089338144.429529-37.639134142.891391149.2122018-01-1609:38:17140.1911:58:2899.25
1ID1697620764nan0.039.07515.015.0-38.481935146.009567-38.585528146.19982720.1852018-02-1004:28:1722.844:51:07149.04
2ID15439335032.00.07.24333.028.0-38.754167144.509664-38.242224147.855342296.9752018-05-0501:38:03272.526:10:34141.48
3ID17565176082.00.013.38310.038.0-37.240526147.568019-37.687178142.991188407.3962018-06-1111:43:04371.4017:54:27122.82
4ID18323258342.00.08.1231.08.0-38.143985143.798292-38.548315144.76922895.9742018-03-1614:50:2592.5116:22:55111.97
5ID18024485762.00.032.8592.028.0-37.421211148.044072-38.159627148.19404883.2502018-05-1516:35:5081.1217:56:57113.88
6ID19402314081.00.020.61629.036.0-37.173949143.140662-37.021605145.197043183.3632018-04-0119:31:12184.2222:35:2585.60
7ID12993039582.00.044.57736.031.0-37.123190145.236196-37.667199143.877650134.5432018-05-0118:39:36127.0520:46:38114.22
8ID17527220281.00.015.36320.030.0-38.850561148.317253-38.024914144.823938318.1322018-05-2714:48:17314.6420:02:5587.39
9ID59952435901.01.036.19018.028.0-38.070189142.950207-37.996817148.026520445.1062018-06-1712:53:02437.5220:10:33142.95
data.tail()
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
37898NaN3.01.027.15339.016.0-38.446310148.292498-36.739777143.604529454.9682018-07-2308:29:19366.0914:35:24188.49
37899ID58625529911.01.040.3639.038.0-38.983710145.320518-37.673908142.879230258.2592018-06-2615:55:37256.7020:12:18122.98
37900ID53391040821.01.035.95513.032.0-38.292301147.562013-36.605285148.293183198.5972018-03-1916:41:10198.9720:00:08118.47
37901ID54687878662.01.029.56633.023.0-38.853243144.508346-37.727691145.662270160.8162018-02-2604:22:30150.586:53:04161.96
37902ID14481267683.00.044.07036.034.0-37.129313145.266426-38.428477143.341632222.6872018-07-0708:01:42182.7111:04:24144.41

任务7:找出所有列的缺失值个数并且按照多到少排列

隐藏任务:可视化缺失值的列

data.isnull().sum().sort_values(ascending=False)
Travel Time              40
Delivery Fare            29
Id                       25
Post Type                20
Destination Region       10
Origin Region            10
Delivery Time             0
Departure Time            0
Departure Date            0
Journey Distance          0
Destination Longitude     0
Destination Latitude      0
Origin Longitude          0
Origin Latitude           0
Package Weight            0
Drone Type                0
dtype: int64
count = {}
for col in data.columns:
    count_null = data[col].isnull().sum()
    count[col] = count_null
for i,j in sorted(count.items(),key = lambda s: s[1], reverse=True):
    print('列名:%s,存在缺失值 %s 个'%(i,j))
列名:Travel Time,存在缺失值 40 个
列名:Delivery Fare,存在缺失值 29 个
列名:Id,存在缺失值 25 个
列名:Post Type,存在缺失值 20 个
列名:Drone Type,存在缺失值 10 个
列名:Origin Region,存在缺失值 10 个
列名:Destination Region,存在缺失值 10 个
列名:Package Weight,存在缺失值 0 个
列名:Origin Latitude,存在缺失值 0 个
列名:Origin Longitude,存在缺失值 0 个
列名:Destination Latitude,存在缺失值 0 个
列名:Destination Longitude,存在缺失值 0 个
列名:Journey Distance,存在缺失值 0 个
列名:Departure Date,存在缺失值 0 个
列名:Departure Time,存在缺失值 0 个
列名:Delivery Time,存在缺失值 0 个

任务8:找出所有至少含有一个缺失值的行,并统计有多少行

data.isnull().any(axis=1)  # 判断至少有一个缺失值
0        False
1         True
2        False
3        False
4        False
         ...  
37898     True
37899    False
37900    False
37901    False
37902    False
Length: 37903, dtype: bool
data.drop(data.iloc[0,2])
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
1ID1697620764NaN0.039.07515.015.0-38.481935146.009567-38.585528146.19982720.1852018-02-1004:28:1722.844:51:07149.04
2ID15439335032.00.07.24333.028.0-38.754167144.509664-38.242224147.855342296.9752018-05-0501:38:03272.526:10:34141.48
3ID17565176082.00.013.38310.038.0-37.240526147.568019-37.687178142.991188407.3962018-06-1111:43:04371.4017:54:27122.82
4ID18323258342.00.08.1231.08.0-38.143985143.798292-38.548315144.76922895.9742018-03-1614:50:2592.5116:22:55111.97
5ID18024485762.00.032.8592.028.0-37.421211148.044072-38.159627148.19404883.2502018-05-1516:35:5081.1217:56:57113.88
6ID19402314081.00.020.61629.036.0-37.173949143.140662-37.021605145.197043183.3632018-04-0119:31:12184.2222:35:2585.60
7ID12993039582.00.044.57736.031.0-37.123190145.236196-37.667199143.877650134.5432018-05-0118:39:36127.0520:46:38114.22
8ID17527220281.00.015.36320.030.0-38.850561148.317253-38.024914144.823938318.1322018-05-2714:48:17314.6420:02:5587.39
9ID59952435901.01.036.19018.028.0-38.070189142.950207-37.996817148.026520445.1062018-06-1712:53:02437.5220:10:33142.95
10ID14833580882.00.023.17213.027.0-38.225456147.425515-37.642798147.12410470.0512018-03-1909:59:1069.3011:08:2796.95
11ID16267983952.00.019.75423.026.0-37.625368145.838281-36.789955147.133916147.7912018-02-2817:40:59138.9219:59:54117.48
12ID52775490093.01.012.8074.06.0-36.855984142.929596-36.906838145.696986246.4652018-03-2607:55:48201.4911:17:17173.32
13ID19509288832.00.022.33233.019.0-38.894115144.457143-37.173740144.152105193.3652018-06-2816:05:55179.7319:05:38117.67
14ID51437386482.01.025.88033.05.0-38.872372144.606034-37.553304145.120753153.5802018-05-0908:33:29144.1010:57:34132.98
15ID51328979102.01.038.6917.015.0-38.844622144.093195-38.476630145.849992158.1022018-01-0515:55:00148.1518:23:09147.73
16ID12908898021.00.030.74219.014.0-37.178059144.403991-37.713867146.382965184.7832018-06-0513:03:43185.6016:09:1884.82
17ID52263555351.01.018.05510.018.0-37.141728147.256091-37.983127143.290272362.2272018-05-0407:42:51357.3213:40:10116.35
18ID18989783121.00.05.98623.018.0-37.706960145.718119-37.983127143.182464224.9972018-05-2312:53:12224.5116:37:4282.47
19ID52849086192.01.030.66413.014.0-38.250238147.366610-37.774617146.50356892.3712018-07-1112:52:5789.2914:22:14143.96
20ID15855564063.00.024.9423.039.0-38.322643145.505910-38.453191148.300405244.2512018-03-1422:16:33199.741:36:17167.45
21ID19019627791.00.034.01227.037.0-37.516090146.969053-38.852488147.816987166.2362018-07-1716:26:55167.6519:14:3488.98
22ID55902790601.01.055.2297.06.0-38.876145143.911302-36.875012145.759049275.6332018-05-0112:42:19273.5217:15:50688.24
23ID14737180593.00.040.74133.017.0-38.756634144.375564-38.817727147.071495234.0142018-06-2101:04:15191.664:15:54165.37
24ID55516467343.01.015.41935.021.0-36.922413146.362349-37.240137143.768208233.0682018-01-0507:19:48190.9110:30:42174.52
25ID17721229342.00.010.66329.01.0-37.019910142.798295-38.358124143.943955179.9292018-02-2704:42:03167.707:29:44124.95
26ID19876088521.00.020.6859.035.0-38.970092145.435801-37.068624146.317717225.3492018-05-1515:35:10224.8519:20:0080.66
27ID12493523582.00.010.2728.034.0-38.485641144.522135-38.505484143.311788105.4712018-02-0506:13:29101.027:54:30102.09
28ID16116144501.00.07.37316.038.0-36.600877143.566811-37.804113142.793139150.4832018-06-1814:01:42152.4016:34:0682.11
29ID12623792992.00.031.35813.016.0-38.195911147.436921-36.734698143.764279362.9412018-04-2422:40:54331.594:12:29147.21
30ID54982167772.01.06.38340.029.0-37.692167147.890721-37.273398142.951899438.6982018-05-0910:31:10399.4317:10:35152.50
...................................................
37873NaN1.00.039.0387.022.0-38.664403143.702592-36.670478144.312065228.3612018-02-0218:03:25227.7721:51:1187.87
37874NaN1.00.07.17114.01.0-37.589382146.559422-38.229669143.710836260.1252018-02-0322:11:31258.512:30:01106.47
37875NaN1.00.039.1415.01.0-37.491746145.168879-38.168680143.936761131.9592018-04-1922:54:02134.481:08:3094.83
37876NaN2.00.042.33025.05.0-36.618297147.643165-37.563856145.038804254.0692018-01-2711:06:09234.0915:00:14116.10
37877NaN1.01.043.96525.038.0-36.542082147.863736-37.549355142.980000448.1042018-06-1422:15:50440.425:36:15149.26
37878ID15255650312.00.027.5607.035.0-38.689638144.093996-37.046945146.460389276.8942018-02-2223:43:48254.533:58:19135.68
37879NaN1.00.018.54036.018.0-36.974023145.036046-38.015640142.955101217.2992018-05-2807:14:50217.0610:51:5376.31
37880ID12723924581.00.044.22640.03.0-37.740012147.768225-38.276119145.403536215.8132018-02-2121:28:22215.631:03:59105.89
37881ID19091313992.00.06.41525.027.0-36.585246147.853062-37.700285147.027318144.1352018-01-1608:26:33135.6410:42:11100.45
37882ID54203553453.01.030.4713.015.0-38.345519145.542105-38.712933145.97753755.7722018-07-2200:13:5450.941:04:50183.73
37883ID51647530162.01.06.5535.015.0-37.625311145.097611-38.630132146.040376139.0182018-01-1416:51:56131.0619:02:59145.71
37884ID11221032111.00.037.85038.010.0-37.510578142.826103-37.337327147.315144397.2782018-04-2714:54:35391.2421:25:49102.88
37885NaN1.00.013.54434.038.0-38.528632143.386140-37.561137142.835312118.0282018-04-0911:48:46120.9913:49:4569.88
37886ID19704025792.00.038.6561.032.0-38.171542143.871582-36.756484148.322553423.5852018-03-1801:01:33385.907:27:26142.19
37887ID13883850491.00.023.69910.010.0-37.352794147.476662-37.104700147.41702728.1182018-05-2400:41:0833.991:15:0783.77
37888NaN1.01.034.92336.028.0-37.044972144.920592-38.218799148.050871305.3072018-07-2003:13:54302.238:16:07130.70
37889ID12816537471.00.034.1308.036.0-38.434373144.730220-37.047801145.309517162.5542018-02-0606:27:55164.089:11:5969.44
37890ID53490857721.01.018.28622.014.0-36.720129144.588398-37.695893146.430793196.1542018-02-1517:04:44196.6020:21:19121.11
37891ID59723374821.01.011.53833.02.0-38.836802144.357057-37.549931148.306793374.0242018-05-2602:07:50368.738:16:33141.06
37892NaN3.01.05.41620.033.0-38.959090148.294700-38.930545144.661887314.5142018-04-2720:53:02255.211:08:14185.96
37893ID15396500342.00.034.3558.039.0-38.520278144.408786-38.447195148.416066349.2512018-05-0907:26:49319.3312:46:08121.30
37894NaN2.00.041.23238.039.0-37.657406142.777301-38.622040148.366529500.9012018-07-0208:59:29455.1416:34:37139.79
37895ID17969432111.00.044.34123.024.0-37.777223146.024184-38.913981142.913934299.5522018-02-2005:08:12296.6610:04:51113.70
37896ID54298837492.01.017.79811.040.0-38.045551146.736254-37.633007147.63927391.7112018-05-0310:19:3288.7011:48:14130.51
37897NaN1.00.08.8659.02.0-38.839254145.226776-37.695101148.251214293.3942018-03-1112:18:21290.7017:09:0288.71
37898NaN3.01.027.15339.016.0-38.446310148.292498-36.739777143.604529454.9682018-07-2308:29:19366.0914:35:24188.49
37899ID58625529911.01.040.3639.038.0-38.983710145.320518-37.673908142.879230258.2592018-06-2615:55:37256.7020:12:18122.98
37900ID53391040821.01.035.95513.032.0-38.292301147.562013-36.605285148.293183198.5972018-03-1916:41:10198.9720:00:08118.47
37901ID54687878662.01.029.56633.023.0-38.853243144.508346-37.727691145.662270160.8162018-02-2604:22:30150.586:53:04161.96
37902ID14481267683.00.044.07036.034.0-37.129313145.266426-38.428477143.341632222.6872018-07-0708:01:42182.7111:04:24144.41

37902 rows × 16 columns

# axis=1针对的是行;=0针对的是列
data[data.isnull().any(axis=1)].shape
#(data.isnull().sum(axis=1) >= 1).sum()
data[data.isnull().any(axis=1)].shape
data.isnull().any(axis=1)
0        False
1         True
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
37873     True
37874     True
37875     True
37876     True
37877     True
37878    False
37879     True
37880    False
37881    False
37882    False
37883    False
37884    False
37885     True
37886    False
37887    False
37888     True
37889    False
37890    False
37891    False
37892     True
37893    False
37894     True
37895    False
37896    False
37897     True
37898     True
37899    False
37900    False
37901    False
37902    False
dtype: bool

4.0 数据清洗

任务9: 填补 'Id'列的空值

任务9.1 统计‘id’列有多少个空值

data['Id'].isnull().sum()
25

任务9.2 找出所有‘id’为空的行

data[data['Id'].isnull()]
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
37844NaN1.00.022.49830.015.0-37.885792144.875305-38.680341145.874064124.2522018-01-2813:07:09127.0215:14:1074.69
37845NaN3.00.032.30016.036.0-36.571169143.741010-36.993435144.983048120.2972018-03-2602:34:49101.884:16:41162.20
37846NaN3.00.018.60138.029.0-37.694132142.851548-37.058014142.87369870.8382018-03-0220:33:1162.8321:36:00142.10
37850NaN2.01.028.20318.021.0-38.139260143.345778-37.279348143.60418998.3912018-04-1703:28:4194.685:03:21152.30
37851NaN2.01.045.69628.01.0-38.152835147.793072-38.162103144.043047328.2202018-03-2513:50:57300.5018:51:27167.39
37852NaN1.00.027.14338.032.0-37.613135142.854194-36.713765148.383062500.5002018-05-1208:54:52491.1317:05:5997.44
37854NaN1.00.013.00227.03.0-37.428594147.056992-38.383631145.590528167.0052018-02-2003:35:43168.396:24:0692.12
37857NaN1.00.019.4685.031.0-37.570514145.253281-37.824303143.862472125.7162018-04-2011:48:14128.4313:56:3969.86
37860NaN1.00.010.64721.05.0-37.239046143.524656-37.546884145.204154152.4342018-06-1323:05:39154.291:39:5695.65
37861NaN2.00.040.77522.019.0-36.678761144.345069-37.167099144.25704354.9222018-03-2704:31:4355.755:27:28120.70
37863NaN1.00.035.4109.016.0-39.006941145.406988-36.525200143.517906322.3992018-07-2106:48:39318.7712:07:2588.14
37866NaN1.01.040.15114.034.0-37.644061146.625820-38.518928143.369714301.4472018-04-1407:29:13298.5012:27:43118.89
37869NaN3.01.044.55933.010.0-38.733722144.474460-37.220132147.498884314.3232018-04-1804:57:37255.069:12:40201.89
37873NaN1.00.039.0387.022.0-38.664403143.702592-36.670478144.312065228.3612018-02-0218:03:25227.7721:51:1187.87
37874NaN1.00.07.17114.01.0-37.589382146.559422-38.229669143.710836260.1252018-02-0322:11:31258.512:30:01106.47
37875NaN1.00.039.1415.01.0-37.491746145.168879-38.168680143.936761131.9592018-04-1922:54:02134.481:08:3094.83
37876NaN2.00.042.33025.05.0-36.618297147.643165-37.563856145.038804254.0692018-01-2711:06:09234.0915:00:14116.10
37877NaN1.01.043.96525.038.0-36.542082147.863736-37.549355142.980000448.1042018-06-1422:15:50440.425:36:15149.26
37879NaN1.00.018.54036.018.0-36.974023145.036046-38.015640142.955101217.2992018-05-2807:14:50217.0610:51:5376.31
37885NaN1.00.013.54434.038.0-38.528632143.386140-37.561137142.835312118.0282018-04-0911:48:46120.9913:49:4569.88
37888NaN1.01.034.92336.028.0-37.044972144.920592-38.218799148.050871305.3072018-07-2003:13:54302.238:16:07130.70
37892NaN3.01.05.41620.033.0-38.959090148.294700-38.930545144.661887314.5142018-04-2720:53:02255.211:08:14185.96
37894NaN2.00.041.23238.039.0-37.657406142.777301-38.622040148.366529500.9012018-07-0208:59:29455.1416:34:37139.79
37897NaN1.00.08.8659.02.0-38.839254145.226776-37.695101148.251214293.3942018-03-1112:18:21290.7017:09:0288.71
37898NaN3.01.027.15339.016.0-38.446310148.292498-36.739777143.604529454.9682018-07-2308:29:19366.0914:35:24188.49

9.2.1 删除 除ID列之外其余数据重复的行

data[
    ['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
       'Destination Region', 'Origin Latitude', 'Origin Longitude',
       'Destination Latitude', 'Destination Longitude', 'Journey Distance',
       'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
       'Delivery Fare']
]
Drone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
02.00.021.68619.038.0-37.089338144.429529-37.639134142.891391149.2122018-01-1609:38:17140.1911:58:2899.25
1NaN0.039.07515.015.0-38.481935146.009567-38.585528146.19982720.1852018-02-1004:28:1722.844:51:07149.04
22.00.07.24333.028.0-38.754167144.509664-38.242224147.855342296.9752018-05-0501:38:03272.526:10:34141.48
32.00.013.38310.038.0-37.240526147.568019-37.687178142.991188407.3962018-06-1111:43:04371.4017:54:27122.82
42.00.08.1231.08.0-38.143985143.798292-38.548315144.76922895.9742018-03-1614:50:2592.5116:22:55111.97
52.00.032.8592.028.0-37.421211148.044072-38.159627148.19404883.2502018-05-1516:35:5081.1217:56:57113.88
61.00.020.61629.036.0-37.173949143.140662-37.021605145.197043183.3632018-04-0119:31:12184.2222:35:2585.60
72.00.044.57736.031.0-37.123190145.236196-37.667199143.877650134.5432018-05-0118:39:36127.0520:46:38114.22
81.00.015.36320.030.0-38.850561148.317253-38.024914144.823938318.1322018-05-2714:48:17314.6420:02:5587.39
91.01.036.19018.028.0-38.070189142.950207-37.996817148.026520445.1062018-06-1712:53:02437.5220:10:33142.95
102.00.023.17213.027.0-38.225456147.425515-37.642798147.12410470.0512018-03-1909:59:1069.3011:08:2796.95
112.00.019.75423.026.0-37.625368145.838281-36.789955147.133916147.7912018-02-2817:40:59138.9219:59:54117.48
123.01.012.8074.06.0-36.855984142.929596-36.906838145.696986246.4652018-03-2607:55:48201.4911:17:17173.32
132.00.022.33233.019.0-38.894115144.457143-37.173740144.152105193.3652018-06-2816:05:55179.7319:05:38117.67
142.01.025.88033.05.0-38.872372144.606034-37.553304145.120753153.5802018-05-0908:33:29144.1010:57:34132.98
152.01.038.6917.015.0-38.844622144.093195-38.476630145.849992158.1022018-01-0515:55:00148.1518:23:09147.73
161.00.030.74219.014.0-37.178059144.403991-37.713867146.382965184.7832018-06-0513:03:43185.6016:09:1884.82
171.01.018.05510.018.0-37.141728147.256091-37.983127143.290272362.2272018-05-0407:42:51357.3213:40:10116.35
181.00.05.98623.018.0-37.706960145.718119-37.983127143.182464224.9972018-05-2312:53:12224.5116:37:4282.47
192.01.030.66413.014.0-38.250238147.366610-37.774617146.50356892.3712018-07-1112:52:5789.2914:22:14143.96
203.00.024.9423.039.0-38.322643145.505910-38.453191148.300405244.2512018-03-1422:16:33199.741:36:17167.45
211.00.034.01227.037.0-37.516090146.969053-38.852488147.816987166.2362018-07-1716:26:55167.6519:14:3488.98
221.01.055.2297.06.0-38.876145143.911302-36.875012145.759049275.6332018-05-0112:42:19273.5217:15:50688.24
233.00.040.74133.017.0-38.756634144.375564-38.817727147.071495234.0142018-06-2101:04:15191.664:15:54165.37
243.01.015.41935.021.0-36.922413146.362349-37.240137143.768208233.0682018-01-0507:19:48190.9110:30:42174.52
252.00.010.66329.01.0-37.019910142.798295-38.358124143.943955179.9292018-02-2704:42:03167.707:29:44124.95
261.00.020.6859.035.0-38.970092145.435801-37.068624146.317717225.3492018-05-1515:35:10224.8519:20:0080.66
272.00.010.2728.034.0-38.485641144.522135-38.505484143.311788105.4712018-02-0506:13:29101.027:54:30102.09
281.00.07.37316.038.0-36.600877143.566811-37.804113142.793139150.4832018-06-1814:01:42152.4016:34:0682.11
292.00.031.35813.016.0-38.195911147.436921-36.734698143.764279362.9412018-04-2422:40:54331.594:12:29147.21
................................................
378731.00.039.0387.022.0-38.664403143.702592-36.670478144.312065228.3612018-02-0218:03:25227.7721:51:1187.87
378741.00.07.17114.01.0-37.589382146.559422-38.229669143.710836260.1252018-02-0322:11:31258.512:30:01106.47
378751.00.039.1415.01.0-37.491746145.168879-38.168680143.936761131.9592018-04-1922:54:02134.481:08:3094.83
378762.00.042.33025.05.0-36.618297147.643165-37.563856145.038804254.0692018-01-2711:06:09234.0915:00:14116.10
378771.01.043.96525.038.0-36.542082147.863736-37.549355142.980000448.1042018-06-1422:15:50440.425:36:15149.26
378782.00.027.5607.035.0-38.689638144.093996-37.046945146.460389276.8942018-02-2223:43:48254.533:58:19135.68
378791.00.018.54036.018.0-36.974023145.036046-38.015640142.955101217.2992018-05-2807:14:50217.0610:51:5376.31
378801.00.044.22640.03.0-37.740012147.768225-38.276119145.403536215.8132018-02-2121:28:22215.631:03:59105.89
378812.00.06.41525.027.0-36.585246147.853062-37.700285147.027318144.1352018-01-1608:26:33135.6410:42:11100.45
378823.01.030.4713.015.0-38.345519145.542105-38.712933145.97753755.7722018-07-2200:13:5450.941:04:50183.73
378832.01.06.5535.015.0-37.625311145.097611-38.630132146.040376139.0182018-01-1416:51:56131.0619:02:59145.71
378841.00.037.85038.010.0-37.510578142.826103-37.337327147.315144397.2782018-04-2714:54:35391.2421:25:49102.88
378851.00.013.54434.038.0-38.528632143.386140-37.561137142.835312118.0282018-04-0911:48:46120.9913:49:4569.88
378862.00.038.6561.032.0-38.171542143.871582-36.756484148.322553423.5852018-03-1801:01:33385.907:27:26142.19
378871.00.023.69910.010.0-37.352794147.476662-37.104700147.41702728.1182018-05-2400:41:0833.991:15:0783.77
378881.01.034.92336.028.0-37.044972144.920592-38.218799148.050871305.3072018-07-2003:13:54302.238:16:07130.70
378891.00.034.1308.036.0-38.434373144.730220-37.047801145.309517162.5542018-02-0606:27:55164.089:11:5969.44
378901.01.018.28622.014.0-36.720129144.588398-37.695893146.430793196.1542018-02-1517:04:44196.6020:21:19121.11
378911.01.011.53833.02.0-38.836802144.357057-37.549931148.306793374.0242018-05-2602:07:50368.738:16:33141.06
378923.01.05.41620.033.0-38.959090148.294700-38.930545144.661887314.5142018-04-2720:53:02255.211:08:14185.96
378932.00.034.3558.039.0-38.520278144.408786-38.447195148.416066349.2512018-05-0907:26:49319.3312:46:08121.30
378942.00.041.23238.039.0-37.657406142.777301-38.622040148.366529500.9012018-07-0208:59:29455.1416:34:37139.79
378951.00.044.34123.024.0-37.777223146.024184-38.913981142.913934299.5522018-02-2005:08:12296.6610:04:51113.70
378962.01.017.79811.040.0-38.045551146.736254-37.633007147.63927391.7112018-05-0310:19:3288.7011:48:14130.51
378971.00.08.8659.02.0-38.839254145.226776-37.695101148.251214293.3942018-03-1112:18:21290.7017:09:0288.71
378983.01.027.15339.016.0-38.446310148.292498-36.739777143.604529454.9682018-07-2308:29:19366.0914:35:24188.49
378991.01.040.3639.038.0-38.983710145.320518-37.673908142.879230258.2592018-06-2615:55:37256.7020:12:18122.98
379001.01.035.95513.032.0-38.292301147.562013-36.605285148.293183198.5972018-03-1916:41:10198.9720:00:08118.47
379012.01.029.56633.023.0-38.853243144.508346-37.727691145.662270160.8162018-02-2604:22:30150.586:53:04161.96
379023.00.044.07036.034.0-37.129313145.266426-38.428477143.341632222.6872018-07-0708:01:42182.7111:04:24144.41

37903 rows × 15 columns

# drop_duplicates返回一个dataframe,重复的行会标为False
# data_1 = data.drop_duplicates(data[['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
#        'Destination Region', 'Origin Latitude', 'Origin Longitude',
#        'Destination Latitude', 'Destination Longitude', 'Journey Distance',
#        'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
#        'Delivery Fare']])
# 2
data_1 = data.drop_duplicates(['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
    'Destination Region', 'Origin Latitude', 'Origin Longitude',
    'Destination Latitude', 'Destination Longitude', 'Journey Distance',
    'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
    'Delivery Fare'])
data_1
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
0ID16452821282.00.021.68619.038.0-37.089338144.429529-37.639134142.891391149.2122018-01-1609:38:17140.1911:58:2899.25
1ID1697620764NaN0.039.07515.015.0-38.481935146.009567-38.585528146.19982720.1852018-02-1004:28:1722.844:51:07149.04
2ID15439335032.00.07.24333.028.0-38.754167144.509664-38.242224147.855342296.9752018-05-0501:38:03272.526:10:34141.48
3ID17565176082.00.013.38310.038.0-37.240526147.568019-37.687178142.991188407.3962018-06-1111:43:04371.4017:54:27122.82
4ID18323258342.00.08.1231.08.0-38.143985143.798292-38.548315144.76922895.9742018-03-1614:50:2592.5116:22:55111.97
...................................................
37839ID18794230812.00.017.08120.014.0-38.860358148.174855-37.562224146.443991209.2782018-02-1512:09:15193.9815:23:13121.33
37840ID57058408411.01.035.16429.023.0-37.250331142.837244-37.739285145.963420281.4032018-01-0100:07:35279.104:46:41136.58
37841ID12762392093.00.036.70412.011.0-36.578568145.273744-38.305557146.997297245.2692018-01-2406:48:05200.5410:08:37146.69
37842ID14328685833.00.013.19537.021.0-38.755802147.744770-37.487866143.585293390.5992018-04-0718:57:52315.280:13:08170.52
37889ID12816537471.00.034.1308.036.0-38.434373144.730220-37.047801145.309517162.5542018-02-0606:27:55164.089:11:5969.44

37844 rows × 16 columns

9.2.2 设置ID列相同,但其余数据不重复的ID为np.nan

# 返回一个布尔型的series,表示id是否重复行
data_1['Id'].duplicated()
0        False
1        False
2        False
3        False
4        False
         ...  
37839    False
37840    False
37841    False
37842    False
37889     True
Name: Id, Length: 37844, dtype: bool
(data_1['Id'].duplicated()) & ~(data_1[['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
       'Destination Region', 'Origin Latitude', 'Origin Longitude',
       'Destination Latitude', 'Destination Longitude', 'Journey Distance',
       'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
       'Delivery Fare']].duplicated())
0        False
1        False
2        False
3        False
4        False
         ...  
37839    False
37840    False
37841    False
37842    False
37889     True
Length: 37844, dtype: bool
data_1.loc[(data_1['Id'].duplicated()) & ~(data_1[['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
       'Destination Region', 'Origin Latitude', 'Origin Longitude',
       'Destination Latitude', 'Destination Longitude', 'Journey Distance',
       'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
       'Delivery Fare']].duplicated()),'Id'] = np.nan
data_1['Id'].value_counts()
ID1858018960    1
ID1792956453    1
ID1574575344    1
ID1146475421    1
ID1897230811    1
               ..
ID5349642032    1
ID5451065609    1
ID5131869013    1
ID1550669799    1
ID5260585195    1
Name: Id, Length: 37843, dtype: int64

任务9.3 想出空值的填补的2种方法,并且选一种应用

#方法1 :随机数
import random
# 'Id'+str(random.randint(1000000000,2000000000)) in data_1['Id']
#方法2 :随机数填补

def fill_id(Id):
    Id_out = Id
    while Id_out in data_1['Id'].tolist():
        Id_out = 'ID'+str(random.randint(1000000000,2000000000))
        return Id_out
data_1.loc[(data_1['Post Type']==0)&(data_1['Id'].isnull()),'Id'].apply(fill_id)
37889    ID1117118084
Name: Id, dtype: object
data_1.loc[(data_1['Post Type']==0)&(data_1['Id'].isnull()),'Id']
37889    NaN
Name: Id, dtype: object
data_1['Id']
0        ID1645282128
1        ID1697620764
2        ID1543933503
3        ID1756517608
4        ID1832325834
             ...     
37839    ID1879423081
37840    ID5705840841
37841    ID1276239209
37842    ID1432868583
37889             NaN
Name: Id, Length: 37844, dtype: object

任务9.4 检查‘id’是否还有空值

data_1['Id'].isnull().sum()
1

任务10:找出所有重复的id

data[data.Id.duplicated()].Id
37843    ID1874340610
37845             NaN
37846             NaN
37847    ID5156350605
37848    ID1176413101
             ...     
37898             NaN
37899    ID5862552991
37900    ID5339104082
37901    ID5468787866
37902    ID1448126768
Name: Id, Length: 59, dtype: object
data['Id'].value_counts() >= 2
ID5281864060     True
ID1796943211     True
ID1877344172     True
ID5122284320     True
ID1238297934     True
                ...  
ID5672029782    False
ID1114364309    False
ID5495523518    False
ID1874532678    False
ID5260585195    False
Name: Id, Length: 37843, dtype: bool

任务11:删除重复行

data.duplicated()
0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
37873    False
37874    False
37875    False
37876    False
37877    False
37878     True
37879    False
37880     True
37881     True
37882     True
37883     True
37884     True
37885    False
37886     True
37887     True
37888    False
37889    False
37890     True
37891     True
37892    False
37893     True
37894    False
37895     True
37896     True
37897    False
37898    False
37899     True
37900     True
37901     True
37902     True
dtype: bool
data.drop_duplicates()
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
0ID16452821282.00.021.68619.038.0-37.089338144.429529-37.639134142.891391149.2122018-01-1609:38:17140.1911:58:2899.25
1ID1697620764NaN0.039.07515.015.0-38.481935146.009567-38.585528146.19982720.1852018-02-1004:28:1722.844:51:07149.04
2ID15439335032.00.07.24333.028.0-38.754167144.509664-38.242224147.855342296.9752018-05-0501:38:03272.526:10:34141.48
3ID17565176082.00.013.38310.038.0-37.240526147.568019-37.687178142.991188407.3962018-06-1111:43:04371.4017:54:27122.82
4ID18323258342.00.08.1231.08.0-38.143985143.798292-38.548315144.76922895.9742018-03-1614:50:2592.5116:22:55111.97
5ID18024485762.00.032.8592.028.0-37.421211148.044072-38.159627148.19404883.2502018-05-1516:35:5081.1217:56:57113.88
6ID19402314081.00.020.61629.036.0-37.173949143.140662-37.021605145.197043183.3632018-04-0119:31:12184.2222:35:2585.60
7ID12993039582.00.044.57736.031.0-37.123190145.236196-37.667199143.877650134.5432018-05-0118:39:36127.0520:46:38114.22
8ID17527220281.00.015.36320.030.0-38.850561148.317253-38.024914144.823938318.1322018-05-2714:48:17314.6420:02:5587.39
9ID59952435901.01.036.19018.028.0-38.070189142.950207-37.996817148.026520445.1062018-06-1712:53:02437.5220:10:33142.95
10ID14833580882.00.023.17213.027.0-38.225456147.425515-37.642798147.12410470.0512018-03-1909:59:1069.3011:08:2796.95
11ID16267983952.00.019.75423.026.0-37.625368145.838281-36.789955147.133916147.7912018-02-2817:40:59138.9219:59:54117.48
12ID52775490093.01.012.8074.06.0-36.855984142.929596-36.906838145.696986246.4652018-03-2607:55:48201.4911:17:17173.32
13ID19509288832.00.022.33233.019.0-38.894115144.457143-37.173740144.152105193.3652018-06-2816:05:55179.7319:05:38117.67
14ID51437386482.01.025.88033.05.0-38.872372144.606034-37.553304145.120753153.5802018-05-0908:33:29144.1010:57:34132.98
15ID51328979102.01.038.6917.015.0-38.844622144.093195-38.476630145.849992158.1022018-01-0515:55:00148.1518:23:09147.73
16ID12908898021.00.030.74219.014.0-37.178059144.403991-37.713867146.382965184.7832018-06-0513:03:43185.6016:09:1884.82
17ID52263555351.01.018.05510.018.0-37.141728147.256091-37.983127143.290272362.2272018-05-0407:42:51357.3213:40:10116.35
18ID18989783121.00.05.98623.018.0-37.706960145.718119-37.983127143.182464224.9972018-05-2312:53:12224.5116:37:4282.47
19ID52849086192.01.030.66413.014.0-38.250238147.366610-37.774617146.50356892.3712018-07-1112:52:5789.2914:22:14143.96
20ID15855564063.00.024.9423.039.0-38.322643145.505910-38.453191148.300405244.2512018-03-1422:16:33199.741:36:17167.45
21ID19019627791.00.034.01227.037.0-37.516090146.969053-38.852488147.816987166.2362018-07-1716:26:55167.6519:14:3488.98
22ID55902790601.01.055.2297.06.0-38.876145143.911302-36.875012145.759049275.6332018-05-0112:42:19273.5217:15:50688.24
23ID14737180593.00.040.74133.017.0-38.756634144.375564-38.817727147.071495234.0142018-06-2101:04:15191.664:15:54165.37
24ID55516467343.01.015.41935.021.0-36.922413146.362349-37.240137143.768208233.0682018-01-0507:19:48190.9110:30:42174.52
25ID17721229342.00.010.66329.01.0-37.019910142.798295-38.358124143.943955179.9292018-02-2704:42:03167.707:29:44124.95
26ID19876088521.00.020.6859.035.0-38.970092145.435801-37.068624146.317717225.3492018-05-1515:35:10224.8519:20:0080.66
27ID12493523582.00.010.2728.034.0-38.485641144.522135-38.505484143.311788105.4712018-02-0506:13:29101.027:54:30102.09
28ID16116144501.00.07.37316.038.0-36.600877143.566811-37.804113142.793139150.4832018-06-1814:01:42152.4016:34:0682.11
29ID12623792992.00.031.35813.016.0-38.195911147.436921-36.734698143.764279362.9412018-04-2422:40:54331.594:12:29147.21
...................................................
37839ID18794230812.00.017.08120.014.0-38.860358148.174855-37.562224146.443991209.2782018-02-1512:09:15193.9815:23:13121.33
37840ID57058408411.01.035.16429.023.0-37.250331142.837244-37.739285145.963420281.4032018-01-0100:07:35279.104:46:41136.58
37841ID12762392093.00.036.70412.011.0-36.578568145.273744-38.305557146.997297245.2692018-01-2406:48:05200.5410:08:37146.69
37842ID14328685833.00.013.19537.021.0-38.755802147.744770-37.487866143.585293390.5992018-04-0718:57:52315.280:13:08170.52
37844NaN1.00.022.49830.015.0-37.885792144.875305-38.680341145.874064124.2522018-01-2813:07:09127.0215:14:1074.69
37845NaN3.00.032.30016.036.0-36.571169143.741010-36.993435144.983048120.2972018-03-2602:34:49101.884:16:41162.20
37846NaN3.00.018.60138.029.0-37.694132142.851548-37.058014142.87369870.8382018-03-0220:33:1162.8321:36:00142.10
37850NaN2.01.028.20318.021.0-38.139260143.345778-37.279348143.60418998.3912018-04-1703:28:4194.685:03:21152.30
37851NaN2.01.045.69628.01.0-38.152835147.793072-38.162103144.043047328.2202018-03-2513:50:57300.5018:51:27167.39
37852NaN1.00.027.14338.032.0-37.613135142.854194-36.713765148.383062500.5002018-05-1208:54:52491.1317:05:5997.44
37854NaN1.00.013.00227.03.0-37.428594147.056992-38.383631145.590528167.0052018-02-2003:35:43168.396:24:0692.12
37857NaN1.00.019.4685.031.0-37.570514145.253281-37.824303143.862472125.7162018-04-2011:48:14128.4313:56:3969.86
37860NaN1.00.010.64721.05.0-37.239046143.524656-37.546884145.204154152.4342018-06-1323:05:39154.291:39:5695.65
37861NaN2.00.040.77522.019.0-36.678761144.345069-37.167099144.25704354.9222018-03-2704:31:4355.755:27:28120.70
37863NaN1.00.035.4109.016.0-39.006941145.406988-36.525200143.517906322.3992018-07-2106:48:39318.7712:07:2588.14
37866NaN1.01.040.15114.034.0-37.644061146.625820-38.518928143.369714301.4472018-04-1407:29:13298.5012:27:43118.89
37869NaN3.01.044.55933.010.0-38.733722144.474460-37.220132147.498884314.3232018-04-1804:57:37255.069:12:40201.89
37873NaN1.00.039.0387.022.0-38.664403143.702592-36.670478144.312065228.3612018-02-0218:03:25227.7721:51:1187.87
37874NaN1.00.07.17114.01.0-37.589382146.559422-38.229669143.710836260.1252018-02-0322:11:31258.512:30:01106.47
37875NaN1.00.039.1415.01.0-37.491746145.168879-38.168680143.936761131.9592018-04-1922:54:02134.481:08:3094.83
37876NaN2.00.042.33025.05.0-36.618297147.643165-37.563856145.038804254.0692018-01-2711:06:09234.0915:00:14116.10
37877NaN1.01.043.96525.038.0-36.542082147.863736-37.549355142.980000448.1042018-06-1422:15:50440.425:36:15149.26
37879NaN1.00.018.54036.018.0-36.974023145.036046-38.015640142.955101217.2992018-05-2807:14:50217.0610:51:5376.31
37885NaN1.00.013.54434.038.0-38.528632143.386140-37.561137142.835312118.0282018-04-0911:48:46120.9913:49:4569.88
37888NaN1.01.034.92336.028.0-37.044972144.920592-38.218799148.050871305.3072018-07-2003:13:54302.238:16:07130.70
37889ID12816537471.00.034.1308.036.0-38.434373144.730220-37.047801145.309517162.5542018-02-0606:27:55164.089:11:5969.44
37892NaN3.01.05.41620.033.0-38.959090148.294700-38.930545144.661887314.5142018-04-2720:53:02255.211:08:14185.96
37894NaN2.00.041.23238.039.0-37.657406142.777301-38.622040148.366529500.9012018-07-0208:59:29455.1416:34:37139.79
37897NaN1.00.08.8659.02.0-38.839254145.226776-37.695101148.251214293.3942018-03-1112:18:21290.7017:09:0288.71
37898NaN3.01.027.15339.016.0-38.446310148.292498-36.739777143.604529454.9682018-07-2308:29:19366.0914:35:24188.49

37869 rows × 16 columns

任务12:填补 'Post Type'的空值

提示:可能与id有关

data_1[data_1['Post Type'].isnull()]
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
2179ID57428607331.0NaN27.12622.034.0-36.679647144.463364-38.469625143.202877228.1812018-02-2814:01:50227.5917:49:25114.27
4212ID17104529072.0NaN37.46622.021.0-36.695380144.474875-37.357030143.508873113.1132018-07-2811:30:33107.8613:18:24103.59
4219ID17870367881.0NaN30.8773.06.0-38.311620145.478355-36.832092145.600747165.0502018-07-1811:58:01166.5014:44:3175.01
4253ID16800280382.0NaN13.86022.019.0-36.837363144.515369-37.089526144.53140228.1062018-06-1704:43:0031.745:14:44120.43
6299ID13777676191.0NaN45.41623.013.0-37.720326146.008038-38.222304147.498125142.1972018-01-1309:36:02144.3812:00:2469.85
...................................................
26778ID15715286761.0NaN15.06233.09.0-38.965718144.650812-38.962812145.29231455.5252018-04-2821:13:2560.5122:13:5589.97
28691ID58834614611.0NaN40.3254.029.0-36.542055143.024959-37.232887142.98051477.0032018-05-0104:18:0981.295:39:26124.92
30861ID19711835643.0NaN42.70019.04.0-37.120747144.287009-36.762751143.078632114.6562018-02-2516:19:2397.4317:56:48144.33
32921ID52915989332.0NaN27.1892.07.0-37.564903148.097962-38.797131143.873724394.2112018-01-0912:39:15359.5918:38:50165.54
36991ID14849860301.0NaN27.51737.036.0-38.938504147.755657-37.060130144.993660320.0032018-01-0720:24:23316.451:40:4999.58

20 rows × 16 columns

#用投递速度来区分
def speed(df):
    speed = df['Journey Distance']/df['Travel Time']
    return round(speed,2)

data_1.loc[data_1['Post Type']==0].apply(speed,axis=1)
0        1.06
1        0.88
2        1.09
3        1.10
4        1.04
         ... 
37837    0.99
37839    1.08
37841    1.22
37842    1.24
37889    0.99
Length: 26529, dtype: float64
#用投递价格来区分
def price(df):
    price = df['Delivery Fare']/df['Package Weight']
    return round(price,2)
data_1.loc[data_1['Post Type']==0,['Delivery Fare','Package Weight']].apply(price,axis=1)
0         4.58
1         3.81
2        19.53
3         9.18
4        13.78
         ...  
37837    13.20
37839     7.10
37841     4.00
37842    12.92
37889     2.03
Length: 26529, dtype: float64
#ID以5还是1开头为区别
data_1.loc[data_1['Post Type'].isnull(),'Post Type'] = data_1.loc[data_1['Post Type'].isnull(),'Id'].apply(lambda s:s[2]=='1')

任务13:修复 'Origin Longitude'与 'Origin Latitude' 列中错误的值

data_1['Origin Longitude'].describe()
#有负值
count    37844.000000
mean       145.423081
std          6.929107
min       -148.337157
25%        143.964265
50%        145.424189
75%        147.171954
max        148.450576
Name: Origin Longitude, dtype: float64
def fix_Longitude_Latitude(flt):
    if flt<0:
        return -flt
    else:
        return flt
data_1['Origin Longitude'].apply(fix_Longitude_Latitude).describe()
count    37844.000000
mean       145.577375
std          1.764044
min        142.769991
25%        143.966143
50%        145.426161
75%        147.172965
max        148.450576
Name: Origin Longitude, dtype: float64
data_1['Origin Longitude'] = data_1['Origin Longitude'].apply(fix_Longitude_Latitude)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
data_1['Origin Latitude'].describe()
count    37844.000000
mean       -37.728738
std          1.900393
min        -39.006941
25%        -38.442905
50%        -37.707244
75%        -37.094433
max         38.986998
Name: Origin Latitude, dtype: float64
data_1['Origin Latitude'] = data_1['Origin Latitude'].apply(fix_Longitude_Latitude)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

任务14:修复 'Destination Latitude' 与 'Destination Longitude'列中错误的值

data_1['Destination Latitude'] = data_1['Destination Latitude'].apply(fix_Longitude_Latitude)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
data_1['Destination Longitude'] = data_1['Destination Longitude'].apply(fix_Longitude_Latitude)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

任务 15:填补'Origin Region'和‘Destination Region’中的空值

data_1[data_1['Origin Region'].isnull()]
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
2080ID16348352351.0031.871NaN22.037.300712143.61085536.794684144.28008381.9042018-06-0903:21:4986.044:47:5195.70
4442ID13216591072.0012.991NaN10.038.180248146.77598537.396662147.31362499.2242018-04-1211:25:5495.4213:01:1998.48
8220ID15222120191.0025.282NaN23.036.756457148.43779237.665363145.818608253.2752018-03-2717:56:31251.8822:08:2394.56
8228ID19462049792.0012.779NaN25.037.391203148.11645236.520786147.557258108.9142018-07-0810:57:12104.1012:41:17102.49
15023ID18912622631.0036.466NaN7.037.537619145.20242738.723869143.685908187.2652018-03-1307:21:08188.0010:29:0872.78
20513ID14909684061.0042.648NaN9.036.654443143.69168938.849982145.479151290.6432018-02-1618:46:02288.0423:34:0499.81
20515ID51483103931.0144.505NaN7.038.973121142.98942238.823172143.99019988.2932018-04-1823:35:2792.221:07:40119.24
28710ID59413503071.0117.232NaN32.037.616289146.05285336.696768148.280560222.5632018-02-2202:17:46222.165:59:55125.46
34853ID52342947502.0129.063NaN7.037.048542142.84394538.852320143.963431223.5422018-07-1413:22:02206.7516:48:47151.47
36904ID17249436021.0031.774NaN31.036.736858143.68841637.548149143.88577091.9932018-02-1214:31:2495.8016:07:1181.57
data_1.loc[data_1['Origin Region'].notnull()]
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
0ID16452821282.0021.68619.038.037.089338144.42952937.639134142.891391149.2122018-01-1609:38:17140.1911:58:2899.25
1ID1697620764NaN039.07515.015.038.481935146.00956738.585528146.19982720.1852018-02-1004:28:1722.844:51:07149.04
2ID15439335032.007.24333.028.038.754167144.50966438.242224147.855342296.9752018-05-0501:38:03272.526:10:34141.48
3ID17565176082.0013.38310.038.037.240526147.56801937.687178142.991188407.3962018-06-1111:43:04371.4017:54:27122.82
4ID18323258342.008.1231.08.038.143985143.79829238.548315144.76922895.9742018-03-1614:50:2592.5116:22:55111.97
...................................................
37839ID18794230812.0017.08120.014.038.860358148.17485537.562224146.443991209.2782018-02-1512:09:15193.9815:23:13121.33
37840ID57058408411.0135.16429.023.037.250331142.83724437.739285145.963420281.4032018-01-0100:07:35279.104:46:41136.58
37841ID12762392093.0036.70412.011.036.578568145.27374438.305557146.997297245.2692018-01-2406:48:05200.5410:08:37146.69
37842ID14328685833.0013.19537.021.038.755802147.74477037.487866143.585293390.5992018-04-0718:57:52315.280:13:08170.52
37889NaN1.0034.1308.036.038.434373144.73022037.047801145.309517162.5542018-02-0606:27:55164.089:11:5969.44

37834 rows × 16 columns

data_2 = data_1.loc[data_1['Origin Region'].notnull()]
def fill_region(df):
    longitude,latitude = df['Origin Longitude'],df['Origin Latitude']
    distance_0 = 100
    region_out = ''

    for index in data_2.index:
        longitude_ = data_2.loc[index,'Origin Longitude']
        latitude_ = data_2.loc[index,'Origin Latitude']
        region = data_2.loc[index,'Origin Region']
        
        distance = (longitude-longitude_)**2+(latitude_-latitude)**2
        if distance<distance_0 :
            region_out = region
            distance_0 = distance
    return region_out
data_1.loc[data_1['Origin Region'].isnull()].apply(fill_region,axis=1)
2080     21.0
4442     11.0
8220     32.0
8228      2.0
15023     5.0
20513    16.0
20515    24.0
28710    23.0
34853    29.0
36904    16.0
dtype: float64
data_1.loc[data_1['Origin Region'].isnull()]
IdDrone TypePost TypePackage WeightOrigin RegionDestination RegionOrigin LatitudeOrigin LongitudeDestination LatitudeDestination LongitudeJourney DistanceDeparture DateDeparture TimeTravel TimeDelivery TimeDelivery Fare
2080ID16348352351.0031.871NaN22.037.300712143.61085536.794684144.28008381.9042018-06-0903:21:4986.044:47:5195.70
4442ID13216591072.0012.991NaN10.038.180248146.77598537.396662147.31362499.2242018-04-1211:25:5495.4213:01:1998.48
8220ID15222120191.0025.282NaN23.036.756457148.43779237.665363145.818608253.2752018-03-2717:56:31251.8822:08:2394.56
8228ID19462049792.0012.779NaN25.037.391203148.11645236.520786147.557258108.9142018-07-0810:57:12104.1012:41:17102.49
15023ID18912622631.0036.466NaN7.037.537619145.20242738.723869143.685908187.2652018-03-1307:21:08188.0010:29:0872.78
20513ID14909684061.0042.648NaN9.036.654443143.69168938.849982145.479151290.6432018-02-1618:46:02288.0423:34:0499.81
20515ID51483103931.0144.505NaN7.038.973121142.98942238.823172143.99019988.2932018-04-1823:35:2792.221:07:40119.24
28710ID59413503071.0117.232NaN32.037.616289146.05285336.696768148.280560222.5632018-02-2202:17:46222.165:59:55125.46
34853ID52342947502.0129.063NaN7.037.048542142.84394538.852320143.963431223.5422018-07-1413:22:02206.7516:48:47151.47
36904ID17249436021.0031.774NaN31.036.736858143.68841637.548149143.88577091.9932018-02-1214:31:2495.8016:07:1181.57

任务16:找出 'Departure Date'中错误的值

data_1['Departure Date'].sort_values(ascending=False)
31975    2018-30-06
4990     2018-28-06
17911    2018-28-06
28350    2018-28-05
35740    2018-28-05
            ...    
12702    2018-01-01
36755    2018-01-01
12684    2018-01-01
36782    2018-01-01
27230    2018-01-01
Name: Departure Date, Length: 37844, dtype: object
def fix_date(dt):
    split_ = dt.split('-')
    year,month,day = split_[0],split_[1],split_[2]
    if month > '08':
        return year+'-'+day+'-'+month
    else:
        return dt
data_1['Departure Date'] = data_1['Departure Date'].apply(fix_date)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
data_1['Departure Date'].sort_values(ascending=False)
17849    2018-07-28
17570    2018-07-28
988      2018-07-28
592      2018-07-28
12772    2018-07-28
            ...    
29092    2018-01-01
5232     2018-01-01
12584    2018-01-01
29041    2018-01-01
5519     2018-01-01
Name: Departure Date, Length: 37844, dtype: object

任务17:输出数据集为‘solution.csv’到当前目录下面

data_1.to_csv('solution.csv')
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值