2-6_Cleaning_Data

清洗数据

清洗和处理数据通常也是非常重要一个环节,这节提提这个内容。

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')
plt.style.use("bmh")
plt.rc('font', family='SimHei', size=25) #显示中文
pd.set_option('display.max_columns',1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth',1000)

什么样的数据叫做脏数据/有问题的数据?

我们用NYC 311服务请求数据来一起看看,这个数据量不算小,同时也有一些东西确实可以处理一下。

requests = pd.read_csv('311-service-requests.csv')
requests.head()
Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
02658965110/31/2013 02:08:41 AMNaNNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud TalkingStreet/Sidewalk1143290-03 169 STREET169 STREET90 AVENUE91 AVENUENaNNaNADDRESSJAMAICANaNPrecinctAssigned10/31/2013 10:08:41 AM10/31/2013 02:35:17 AM12 QUEENSQUEENS1042027.0197389.0UnspecifiedQUEENSUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.708275-73.791604(40.70827532593202, -73.79160395779721)
12659369810/31/2013 02:01:04 AMNaNNYPDNew York City Police DepartmentIllegal ParkingCommercial Overnight ParkingStreet/Sidewalk1137858 AVENUE58 AVENUE58 PLACE59 STREETNaNNaNBLOCKFACEMASPETHNaNPrecinctOpen10/31/2013 10:01:04 AMNaN05 QUEENSQUEENS1009349.0201984.0UnspecifiedQUEENSUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.721041-73.909453(40.721040535628305, -73.90945306791765)
22659413910/31/2013 02:00:24 AM10/31/2013 02:40:32 AMNYPDNew York City Police DepartmentNoise - CommercialLoud Music/PartyClub/Bar/Restaurant100324060 BROADWAYBROADWAYWEST 171 STREETWEST 172 STREETNaNNaNADDRESSNEW YORKNaNPrecinctClosed10/31/2013 10:00:24 AM10/31/2013 02:39:42 AM12 MANHATTANMANHATTAN1001088.0246531.0UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.843330-73.939144(40.84332975466513, -73.93914371913482)
32659572110/31/2013 01:56:23 AM10/31/2013 02:21:48 AMNYPDNew York City Police DepartmentNoise - VehicleCar/Truck HornStreet/Sidewalk10023WEST 72 STREETWEST 72 STREETCOLUMBUS AVENUEAMSTERDAM AVENUENaNNaNBLOCKFACENEW YORKNaNPrecinctClosed10/31/2013 09:56:23 AM10/31/2013 02:21:10 AM07 MANHATTANMANHATTAN989730.0222727.0UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.778009-73.980213(40.7780087446372, -73.98021349023975)
42659093010/31/2013 01:53:44 AMNaNDOHMHDepartment of Health and Mental HygieneRodentCondition Attracting RodentsVacant Lot10027WEST 124 STREETWEST 124 STREETLENOX AVENUEADAM CLAYTON POWELL JR BOULEVARDNaNNaNBLOCKFACENEW YORKNaNNaNPending11/30/2013 01:53:44 AM10/31/2013 01:59:54 AM10 MANHATTANMANHATTAN998815.0233545.0UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.807691-73.947387(40.80769092704951, -73.94738703491433)

6.1 怎么找到脏数据?

其实也没有特别好的办法,还是得先拿点数据出来看看。比如说我们这里观察到邮政编码可能有问题的字段。

需要提到的一点是 .unique() 函数有很巧的用处,我们把所有出现过的邮政编码列出来(之后再看看分布?),也许会有一些想法。

下面我们就把unique()用起来,然后你会发现,确确实实是存在一些问题的,比如:

  • 为什么大部分被解析出数值,而有些被解析出字符串了?
  • 好多缺省值(nan
  • 格式不一样,有些是29616-0759,有些是83
  • 有一些pandas不认的,比如’N/A’或者’NO CLUE’

那我们能做什么呢?

  • 规整’N/A’和’NO CLUE’到缺省值的“队列”里
  • 看看83是什么鬼,然后再决定怎么处理
  • 统一一下,全处理成字符串好啦
requests['Incident Zip'].unique()
array([11432.0, 11378.0, 10032.0, 10023.0, 10027.0, 11372.0, 11419.0,
       11417.0, 10011.0, 11225.0, 11218.0, 10003.0, 10029.0, 10466.0,
       11219.0, 10025.0, 10310.0, 11236.0, nan, 10033.0, 11216.0, 10016.0,
       10305.0, 10312.0, 10026.0, 10309.0, 10036.0, 11433.0, 11235.0,
       11213.0, 11379.0, 11101.0, 10014.0, 11231.0, 11234.0, 10457.0,
       10459.0, 10465.0, 11207.0, 10002.0, 10034.0, 11233.0, 10453.0,
       10456.0, 10469.0, 11374.0, 11221.0, 11421.0, 11215.0, 10007.0,
       10019.0, 11205.0, 11418.0, 11369.0, 11249.0, 10005.0, 10009.0,
       11211.0, 11412.0, 10458.0, 11229.0, 10065.0, 10030.0, 11222.0,
       10024.0, 10013.0, 11420.0, 11365.0, 10012.0, 11214.0, 11212.0,
       10022.0, 11232.0, 11040.0, 11226.0, 10281.0, 11102.0, 11208.0,
       10001.0, 10472.0, 11414.0, 11223.0, 10040.0, 11220.0, 11373.0,
       11203.0, 11691.0, 11356.0, 10017.0, 10452.0, 10280.0, 11217.0,
       10031.0, 11201.0, 11358.0, 10128.0, 11423.0, 10039.0, 10010.0,
       11209.0, 10021.0, 10037.0, 11413.0, 11375.0, 11238.0, 10473.0,
       11103.0, 11354.0, 11361.0, 11106.0, 11385.0, 10463.0, 10467.0,
       11204.0, 11237.0, 11377.0, 11364.0, 11434.0, 11435.0, 11210.0,
       11228.0, 11368.0, 11694.0, 10464.0, 11415.0, 10314.0, 10301.0,
       10018.0, 10038.0, 11105.0, 11230.0, 10468.0, 11104.0, 10471.0,
       11416.0, 10075.0, 11422.0, 11355.0, 10028.0, 10462.0, 10306.0,
       10461.0, 11224.0, 11429.0, 10035.0, 11366.0, 11362.0, 11206.0,
       10460.0, 10304.0, 11360.0, 11411.0, 10455.0, 10475.0, 10069.0,
       10303.0, 10308.0, 10302.0, 11357.0, 10470.0, 11367.0, 11370.0,
       10454.0, 10451.0, 11436.0, 11426.0, 10153.0, 11004.0, 11428.0,
       11427.0, 11001.0, 11363.0, 10004.0, 10474.0, 11430.0, 10000.0,
       10307.0, 11239.0, 10119.0, 10006.0, 10048.0, 11697.0, 11692.0,
       11693.0, 10573.0, 83.0, 11559.0, 10020.0, 77056.0, 11776.0,
       70711.0, 10282.0, 11109.0, 10044.0, '10452', '11233', '10468',
       '10310', '11105', '10462', '10029', '10301', '10457', '10467',
       '10469', '11225', '10035', '10031', '11226', '10454', '11221',
       '10025', '11229', '11235', '11422', '10472', '11208', '11102',
       '10032', '11216', '10473', '10463', '11213', '10040', '10302',
       '11231', '10470', '11204', '11104', '11212', '10466', '11416',
       '11214', '10009', '11692', '11385', '11423', '11201', '10024',
       '11435', '10312', '10030', '11106', '10033', '10303', '11215',
       '11222', '11354', '10016', '10034', '11420', '10304', '10019',
       '11237', '11249', '11230', '11372', '11207', '11378', '11419',
       '11361', '10011', '11357', '10012', '11358', '10003', '10002',
       '11374', '10007', '11234', '10065', '11369', '11434', '11205',
       '11206', '11415', '11236', '11218', '11413', '10458', '11101',
       '10306', '11355', '10023', '11368', '10314', '11421', '10010',
       '10018', '11223', '10455', '11377', '11433', '11375', '10037',
       '11209', '10459', '10128', '10014', '10282', '11373', '10451',
       '11238', '11211', '10038', '11694', '11203', '11691', '11232',
       '10305', '10021', '11228', '10036', '10001', '10017', '11217',
       '11219', '10308', '10465', '11379', '11414', '10460', '11417',
       '11220', '11366', '10027', '11370', '10309', '11412', '11356',
       '10456', '11432', '10022', '10013', '11367', '11040', '10026',
       '10475', '11210', '11364', '11426', '10471', '10119', '11224',
       '11418', '11429', '11365', '10461', '11239', '10039', '00083',
       '11411', '10075', '11004', '11360', '10453', '10028', '11430',
       '10307', '11103', '10004', '10069', '10005', '10474', '11428',
       '11436', '10020', '11001', '11362', '11693', '10464', '11427',
       '10044', '11363', '10006', '10000', '02061', '77092-2016', '10280',
       '11109', '14225', '55164-0737', '19711', '07306', '000000',
       'NO CLUE', '90010', '10281', '11747', '23541', '11776', '11697',
       '11788', '07604', 10112.0, 11788.0, 11563.0, 11580.0, 7087.0,
       11042.0, 7093.0, 11501.0, 92123.0, 0.0, 11575.0, 7109.0, 11797.0,
       '10803', '11716', '11722', '11549-3650', '10162', '92123', '23502',
       '11518', '07020', '08807', '11577', '07114', '11003', '07201',
       '11563', '61702', '10103', '29616-0759', '35209-3114', '11520',
       '11735', '10129', '11005', '41042', '11590', 6901.0, 7208.0,
       11530.0, 13221.0, 10954.0, 11735.0, 10103.0, 7114.0, 11111.0,
       10107.0], dtype=object)

6.3 处理缺省值和字符串/浮点混乱

我们可以在pd.read_csv读数据的时候,传一个na_values给它,清理掉一部分的脏数据,我们还可以指明说,我们就要保证邮政编码是字符串型的,不要给我整些数值型出来!!

na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str})
requests['Incident Zip'].unique()
array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
       '11417', '10011', '11225', '11218', '10003', '10029', '10466',
       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
       '10305', '10312', '10026', '10309', '10036', '11433', '11235',
       '11213', '11379', '11101', '10014', '11231', '11234', '10457',
       '10459', '10465', '11207', '10002', '10034', '11233', '10453',
       '10456', '10469', '11374', '11221', '11421', '11215', '10007',
       '10019', '11205', '11418', '11369', '11249', '10005', '10009',
       '11211', '11412', '10458', '11229', '10065', '10030', '11222',
       '10024', '10013', '11420', '11365', '10012', '11214', '11212',
       '10022', '11232', '11040', '11226', '10281', '11102', '11208',
       '10001', '10472', '11414', '11223', '10040', '11220', '11373',
       '11203', '11691', '11356', '10017', '10452', '10280', '11217',
       '10031', '11201', '11358', '10128', '11423', '10039', '10010',
       '11209', '10021', '10037', '11413', '11375', '11238', '10473',
       '11103', '11354', '11361', '11106', '11385', '10463', '10467',
       '11204', '11237', '11377', '11364', '11434', '11435', '11210',
       '11228', '11368', '11694', '10464', '11415', '10314', '10301',
       '10018', '10038', '11105', '11230', '10468', '11104', '10471',
       '11416', '10075', '11422', '11355', '10028', '10462', '10306',
       '10461', '11224', '11429', '10035', '11366', '11362', '11206',
       '10460', '10304', '11360', '11411', '10455', '10475', '10069',
       '10303', '10308', '10302', '11357', '10470', '11367', '11370',
       '10454', '10451', '11436', '11426', '10153', '11004', '11428',
       '11427', '11001', '11363', '10004', '10474', '11430', '10000',
       '10307', '11239', '10119', '10006', '10048', '11697', '11692',
       '11693', '10573', '00083', '11559', '10020', '77056', '11776',
       '70711', '10282', '11109', '10044', '02061', '77092-2016', '14225',
       '55164-0737', '19711', '07306', '000000', '90010', '11747',
       '23541', '11788', '07604', '10112', '11563', '11580', '07087',
       '11042', '07093', '11501', '92123', '00000', '11575', '07109',
       '11797', '10803', '11716', '11722', '11549-3650', '10162', '23502',
       '11518', '07020', '08807', '11577', '07114', '11003', '07201',
       '61702', '10103', '29616-0759', '35209-3114', '11520', '11735',
       '10129', '11005', '41042', '11590', '06901', '07208', '11530',
       '13221', '10954', '11111', '10107'], dtype=object)

6.4 那些用“-”连接的邮编是什么鬼?

requests.loc[requests['Incident Zip'].str.contains('-').fillna(False),'Incident Zip']
29136    77092-2016
30939    55164-0737
70539    11549-3650
85821    29616-0759
89304    35209-3114
Name: Incident Zip, dtype: object
rows_with_dashes = requests['Incident Zip'].str.contains('-').fillna(False)
len(requests[rows_with_dashes])
5

真心是很烦人啊,其实只有5个,输出来看看是什么

requests[rows_with_dashes]
Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
291362655055110/24/2013 06:16:34 PMNaNDCADepartment of Consumer AffairsConsumer ComplaintFalse AdvertisingNaN770922700 EAST SELTICE WAYEAST SELTICE WAYNaNNaNNaNNaNNaNHOUSTONNaNNaNAssigned11/13/2013 11:15:20 AM10/29/2013 11:16:16 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
309392654883110/24/2013 09:35:10 AMNaNDCADepartment of Consumer AffairsConsumer ComplaintHarassmentNaN55164P.O. BOX 6443764437NaNNaNNaNNaNNaNST. PAULNaNNaNAssigned11/13/2013 02:30:21 PM10/29/2013 02:31:06 PM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
705392648841710/15/2013 03:40:33 PMNaNTLCTaxi and Limousine CommissionTaxi ComplaintDriver ComplaintStreet11549365 HOFSTRA UNIVERSITYHOFSTRA UNIVERSITYNaNNaNNaNNaNNaNHEMSTEADNaNNaNAssigned11/30/2013 01:20:33 PM10/16/2013 01:21:39 PM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNLa Guardia AirportNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
858212646829610/10/2013 12:36:43 PM10/26/2013 01:07:07 AMDCADepartment of Consumer AffairsConsumer ComplaintDebt Not OwedNaN29616PO BOX 25759BOX 25759NaNNaNNaNNaNNaNGREENVILLENaNNaNClosed10/26/2013 09:20:28 AM10/26/2013 01:07:07 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
893042646113710/09/2013 05:23:46 PM10/25/2013 01:06:41 AMDCADepartment of Consumer AffairsConsumer ComplaintHarassmentNaN35209600 BEACON PKWYBEACON PKWYNaNNaNNaNNaNNaNBIRMINGHAMNaNNaNClosed10/25/2013 02:43:42 PM10/25/2013 01:06:41 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

本来就5个,打算直接把这些都设置成缺省值(nan)的:requests['Incident Zip'][rows_with_dashes] = np.nan
后来查了查,发现可能前5位置是真实的邮编,所以干脆截取一下好了。

long_zip_codes = requests['Incident Zip'].str.len() > 5
requests['Incident Zip'][long_zip_codes].unique()
array([], dtype=object)
requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5)

搞定啦!

妈蛋查了下00000,发现根本不是什么美国加拿大的邮编,所以这个是不能这么处理的,还真得重新设为缺省值。

requests[requests['Incident Zip'] == '00000']
Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
zero_zips = requests['Incident Zip'] == '00000'
requests.loc[zero_zips, 'Incident Zip'] = np.nan

完工!!再来看看现在的数据什么样了。

unique_zips = requests['Incident Zip'].unique()
#unique_zips.sort_values()
unique_zips
array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
       '11417', '10011', '11225', '11218', '10003', '10029', '10466',
       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
       '10305', '10312', '10026', '10309', '10036', '11433', '11235',
       '11213', '11379', '11101', '10014', '11231', '11234', '10457',
       '10459', '10465', '11207', '10002', '10034', '11233', '10453',
       '10456', '10469', '11374', '11221', '11421', '11215', '10007',
       '10019', '11205', '11418', '11369', '11249', '10005', '10009',
       '11211', '11412', '10458', '11229', '10065', '10030', '11222',
       '10024', '10013', '11420', '11365', '10012', '11214', '11212',
       '10022', '11232', '11040', '11226', '10281', '11102', '11208',
       '10001', '10472', '11414', '11223', '10040', '11220', '11373',
       '11203', '11691', '11356', '10017', '10452', '10280', '11217',
       '10031', '11201', '11358', '10128', '11423', '10039', '10010',
       '11209', '10021', '10037', '11413', '11375', '11238', '10473',
       '11103', '11354', '11361', '11106', '11385', '10463', '10467',
       '11204', '11237', '11377', '11364', '11434', '11435', '11210',
       '11228', '11368', '11694', '10464', '11415', '10314', '10301',
       '10018', '10038', '11105', '11230', '10468', '11104', '10471',
       '11416', '10075', '11422', '11355', '10028', '10462', '10306',
       '10461', '11224', '11429', '10035', '11366', '11362', '11206',
       '10460', '10304', '11360', '11411', '10455', '10475', '10069',
       '10303', '10308', '10302', '11357', '10470', '11367', '11370',
       '10454', '10451', '11436', '11426', '10153', '11004', '11428',
       '11427', '11001', '11363', '10004', '10474', '11430', '10000',
       '10307', '11239', '10119', '10006', '10048', '11697', '11692',
       '11693', '10573', '00083', '11559', '10020', '77056', '11776',
       '70711', '10282', '11109', '10044', '02061', '77092', '14225',
       '55164', '19711', '07306', '90010', '11747', '23541', '11788',
       '07604', '10112', '11563', '11580', '07087', '11042', '07093',
       '11501', '92123', '11575', '07109', '11797', '10803', '11716',
       '11722', '11549', '10162', '23502', '11518', '07020', '08807',
       '11577', '07114', '11003', '07201', '61702', '10103', '29616',
       '35209', '11520', '11735', '10129', '11005', '41042', '11590',
       '06901', '07208', '11530', '13221', '10954', '11111', '10107'],
      dtype=object)

看起来干净多了。

但是真的做完了吗?

zips = requests['Incident Zip']
# 用is_close表示0或者1开始的比较正确的邮编
is_close = zips.str.startswith('0') | zips.str.startswith('1')
# 非缺省值但不以0或者1开始的邮编认为是有些困惑的
is_far = ~(is_close) & zips.notnull()
zips[is_far]
12102    77056
13450    70711
29136    77092
30939    55164
44008    90010
47048    23541
57636    92123
71001    92123
71834    23502
80573    61702
85821    29616
89304    35209
94201    41042
Name: Incident Zip, dtype: object

可以排个序,然后对应输出一些东西

requests[is_far][['Incident Zip', 'Descriptor', 'City']].sort_values('Incident Zip')
Incident ZipDescriptorCity
7183423502HarassmentNORFOLK
4704823541HarassmentNORFOLK
8582129616Debt Not OwedGREENVILLE
8930435209HarassmentBIRMINGHAM
9420141042HarassmentFLORENCE
3093955164HarassmentST. PAUL
8057361702Billing DisputeBLOOMIGTON
1345070711Contract DisputeCLIFTON
1210277056Debt Not OwedHOUSTON
2913677092False AdvertisingHOUSTON
4400890010Billing DisputeLOS ANGELES
5763692123HarassmentSAN DIEGO
7100192123Billing DisputeSAN DIEGO

咳咳,突然觉得,恩,刚才做的一大堆工作,其实只是告诉你,我们可以这样去处理和补齐数据。

但你实际上会发现,好像其实用city直接对应一下就可以补上一些东西啊。

requests['City'].str.upper().value_counts()
BROOKLYN               31662
NEW YORK               22664
BRONX                  18438
STATEN ISLAND           4766
JAMAICA                 2246
FLUSHING                1803
ASTORIA                 1568
RIDGEWOOD               1073
CORONA                   707
OZONE PARK               693
LONG ISLAND CITY         678
FAR ROCKAWAY             652
ELMHURST                 647
WOODSIDE                 609
EAST ELMHURST            562
QUEENS VILLAGE           549
FOREST HILLS             541
JACKSON HEIGHTS          541
SOUTH RICHMOND HILL      521
MASPETH                  473
WOODHAVEN                464
FRESH MEADOWS            435
SPRINGFIELD GARDENS      434
BAYSIDE                  411
SOUTH OZONE PARK         410
RICHMOND HILL            404
REGO PARK                402
MIDDLE VILLAGE           396
SAINT ALBANS             387
WHITESTONE               348
                       ...  
NEW YORK CITY              1
LYNBROOK                   1
SYRACUSE                   1
CLIFTON                    1
VALLEY STREAM              1
HEMSTEAD                   1
EAST ROCKAWAY              1
NEW YOR                    1
NEWARK AIRPORT             1
FARMINGDALE                1
RYEBROOK                   1
ROSELYN                    1
NORTH BERGEN               1
CHEEKTOWAGA                1
NJ                         1
BRIDGE  WATER              1
MINEOLA                    1
BELLEVILLE                 1
JERSEY CITY                1
BOHIEMA                    1
WOODBURY                   1
NANUET                     1
ROSLYN                     1
UNION CITY                 1
GREENVILLE                 1
STAMFORD                   1
ELIZABETH                  1
ST. PAUL                   1
LOS ANGELES                1
HASBROCK HEIGHTS           1
Name: City, Length: 100, dtype: int64

6.5 汇个总

所以汇总一下,我们在邮编这个字段,是这样做数据清洗的:

na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('311-service-requests.csv', 
                       na_values=na_values, 
                       dtype={'Incident Zip': str})
def fix_zip_codes(zips):
    # Truncate everything to length 5 
    zips = zips.str.slice(0, 5)
    
    # Set 00000 zip codes to nan
    zero_zips = zips == '00000'
    zips[zero_zips] = np.nan
    
    return zips
requests['Incident Zip'] = fix_zip_codes(requests['Incident Zip'])
requests['Incident Zip'].unique()
array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
       '11417', '10011', '11225', '11218', '10003', '10029', '10466',
       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
       '10305', '10312', '10026', '10309', '10036', '11433', '11235',
       '11213', '11379', '11101', '10014', '11231', '11234', '10457',
       '10459', '10465', '11207', '10002', '10034', '11233', '10453',
       '10456', '10469', '11374', '11221', '11421', '11215', '10007',
       '10019', '11205', '11418', '11369', '11249', '10005', '10009',
       '11211', '11412', '10458', '11229', '10065', '10030', '11222',
       '10024', '10013', '11420', '11365', '10012', '11214', '11212',
       '10022', '11232', '11040', '11226', '10281', '11102', '11208',
       '10001', '10472', '11414', '11223', '10040', '11220', '11373',
       '11203', '11691', '11356', '10017', '10452', '10280', '11217',
       '10031', '11201', '11358', '10128', '11423', '10039', '10010',
       '11209', '10021', '10037', '11413', '11375', '11238', '10473',
       '11103', '11354', '11361', '11106', '11385', '10463', '10467',
       '11204', '11237', '11377', '11364', '11434', '11435', '11210',
       '11228', '11368', '11694', '10464', '11415', '10314', '10301',
       '10018', '10038', '11105', '11230', '10468', '11104', '10471',
       '11416', '10075', '11422', '11355', '10028', '10462', '10306',
       '10461', '11224', '11429', '10035', '11366', '11362', '11206',
       '10460', '10304', '11360', '11411', '10455', '10475', '10069',
       '10303', '10308', '10302', '11357', '10470', '11367', '11370',
       '10454', '10451', '11436', '11426', '10153', '11004', '11428',
       '11427', '11001', '11363', '10004', '10474', '11430', '10000',
       '10307', '11239', '10119', '10006', '10048', '11697', '11692',
       '11693', '10573', '00083', '11559', '10020', '77056', '11776',
       '70711', '10282', '11109', '10044', '02061', '77092', '14225',
       '55164', '19711', '07306', '90010', '11747', '23541', '11788',
       '07604', '10112', '11563', '11580', '07087', '11042', '07093',
       '11501', '92123', '11575', '07109', '11797', '10803', '11716',
       '11722', '11549', '10162', '23502', '11518', '07020', '08807',
       '11577', '07114', '11003', '07201', '61702', '10103', '29616',
       '35209', '11520', '11735', '10129', '11005', '41042', '11590',
       '06901', '07208', '11530', '13221', '10954', '11111', '10107'],
      dtype=object)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

安替-AnTi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值