2-6_Cleaning_Data

最新推荐文章于 2022-05-17 19:54:46 发布

安替-AnTi

最新推荐文章于 2022-05-17 19:54:46 发布

阅读量429

点赞数

分类专栏：七月在线数据挖掘文章标签：数据清洗数据预处理数据可视化

本文链接：https://blog.csdn.net/weixin_35770067/article/details/98172369

版权

七月在线数据挖掘专栏收录该内容

11 篇文章 3 订阅

订阅专栏

清洗数据

清洗和处理数据通常也是非常重要一个环节，这节提提这个内容。

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')
plt.style.use("bmh")
plt.rc('font', family='SimHei', size=25) #显示中文
pd.set_option('display.max_columns',1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth',1000)

什么样的数据叫做脏数据/有问题的数据？

我们用NYC 311服务请求数据来一起看看，这个数据量不算小，同时也有一些东西确实可以处理一下。

requests = pd.read_csv('311-service-requests.csv')

requests.head()

	Unique Key	Created Date	Closed Date	Agency	Agency Name	Complaint Type	Descriptor	Location Type	Incident Zip	Incident Address	Street Name	Cross Street 1	Cross Street 2	Intersection Street 1	Intersection Street 2	Address Type	City	Landmark	Facility Type	Status	Due Date	Resolution Action Updated Date	Community Board	Borough	X Coordinate (State Plane)	Y Coordinate (State Plane)	Park Facility Name	Park Borough	School Name	School Number	School Region	School Code	School Phone Number	School Address	School City	School State	School Zip	School Not Found	School or Citywide Complaint	Vehicle Type	Taxi Company Borough	Taxi Pick Up Location	Bridge Highway Name	Bridge Highway Direction	Road Ramp	Bridge Highway Segment	Garage Lot Name	Ferry Direction	Ferry Terminal Name	Latitude	Longitude	Location
0	26589651	10/31/2013 02:08:41 AM	NaN	NYPD	New York City Police Department	Noise - Street/Sidewalk	Loud Talking	Street/Sidewalk	11432	90-03 169 STREET	169 STREET	90 AVENUE	91 AVENUE	NaN	NaN	ADDRESS	JAMAICA	NaN	Precinct	Assigned	10/31/2013 10:08:41 AM	10/31/2013 02:35:17 AM	12 QUEENS	QUEENS	1042027.0	197389.0	Unspecified	QUEENS	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.708275	-73.791604	(40.70827532593202, -73.79160395779721)
1	26593698	10/31/2013 02:01:04 AM	NaN	NYPD	New York City Police Department	Illegal Parking	Commercial Overnight Parking	Street/Sidewalk	11378	58 AVENUE	58 AVENUE	58 PLACE	59 STREET	NaN	NaN	BLOCKFACE	MASPETH	NaN	Precinct	Open	10/31/2013 10:01:04 AM	NaN	05 QUEENS	QUEENS	1009349.0	201984.0	Unspecified	QUEENS	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.721041	-73.909453	(40.721040535628305, -73.90945306791765)
2	26594139	10/31/2013 02:00:24 AM	10/31/2013 02:40:32 AM	NYPD	New York City Police Department	Noise - Commercial	Loud Music/Party	Club/Bar/Restaurant	10032	4060 BROADWAY	BROADWAY	WEST 171 STREET	WEST 172 STREET	NaN	NaN	ADDRESS	NEW YORK	NaN	Precinct	Closed	10/31/2013 10:00:24 AM	10/31/2013 02:39:42 AM	12 MANHATTAN	MANHATTAN	1001088.0	246531.0	Unspecified	MANHATTAN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.843330	-73.939144	(40.84332975466513, -73.93914371913482)
3	26595721	10/31/2013 01:56:23 AM	10/31/2013 02:21:48 AM	NYPD	New York City Police Department	Noise - Vehicle	Car/Truck Horn	Street/Sidewalk	10023	WEST 72 STREET	WEST 72 STREET	COLUMBUS AVENUE	AMSTERDAM AVENUE	NaN	NaN	BLOCKFACE	NEW YORK	NaN	Precinct	Closed	10/31/2013 09:56:23 AM	10/31/2013 02:21:10 AM	07 MANHATTAN	MANHATTAN	989730.0	222727.0	Unspecified	MANHATTAN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.778009	-73.980213	(40.7780087446372, -73.98021349023975)
4	26590930	10/31/2013 01:53:44 AM	NaN	DOHMH	Department of Health and Mental Hygiene	Rodent	Condition Attracting Rodents	Vacant Lot	10027	WEST 124 STREET	WEST 124 STREET	LENOX AVENUE	ADAM CLAYTON POWELL JR BOULEVARD	NaN	NaN	BLOCKFACE	NEW YORK	NaN	NaN	Pending	11/30/2013 01:53:44 AM	10/31/2013 01:59:54 AM	10 MANHATTAN	MANHATTAN	998815.0	233545.0	Unspecified	MANHATTAN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.807691	-73.947387	(40.80769092704951, -73.94738703491433)

6.1 怎么找到脏数据？

其实也没有特别好的办法，还是得先拿点数据出来看看。比如说我们这里观察到邮政编码可能有问题的字段。

需要提到的一点是 .unique() 函数有很巧的用处，我们把所有出现过的邮政编码列出来（之后再看看分布？），也许会有一些想法。

下面我们就把unique()用起来，然后你会发现，确确实实是存在一些问题的，比如：

为什么大部分被解析出数值，而有些被解析出字符串了？
好多缺省值（nan）
格式不一样，有些是29616-0759，有些是83
有一些pandas不认的，比如’N/A’或者’NO CLUE’

那我们能做什么呢？

规整’N/A’和’NO CLUE’到缺省值的“队列”里
看看83是什么鬼，然后再决定怎么处理
统一一下，全处理成字符串好啦

requests['Incident Zip'].unique()

array([11432.0, 11378.0, 10032.0, 10023.0, 10027.0, 11372.0, 11419.0,
       11417.0, 10011.0, 11225.0, 11218.0, 10003.0, 10029.0, 10466.0,
       11219.0, 10025.0, 10310.0, 11236.0, nan, 10033.0, 11216.0, 10016.0,
       10305.0, 10312.0, 10026.0, 10309.0, 10036.0, 11433.0, 11235.0,
       11213.0, 11379.0, 11101.0, 10014.0, 11231.0, 11234.0, 10457.0,
       10459.0, 10465.0, 11207.0, 10002.0, 10034.0, 11233.0, 10453.0,
       10456.0, 10469.0, 11374.0, 11221.0, 11421.0, 11215.0, 10007.0,
       10019.0, 11205.0, 11418.0, 11369.0, 11249.0, 10005.0, 10009.0,
       11211.0, 11412.0, 10458.0, 11229.0, 10065.0, 10030.0, 11222.0,
       10024.0, 10013.0, 11420.0, 11365.0, 10012.0, 11214.0, 11212.0,
       10022.0, 11232.0, 11040.0, 11226.0, 10281.0, 11102.0, 11208.0,
       10001.0, 10472.0, 11414.0, 11223.0, 10040.0, 11220.0, 11373.0,
       11203.0, 11691.0, 11356.0, 10017.0, 10452.0, 10280.0, 11217.0,
       10031.0, 11201.0, 11358.0, 10128.0, 11423.0, 10039.0, 10010.0,
       11209.0, 10021.0, 10037.0, 11413.0, 11375.0, 11238.0, 10473.0,
       11103.0, 11354.0, 11361.0, 11106.0, 11385.0, 10463.0, 10467.0,
       11204.0, 11237.0, 11377.0, 11364.0, 11434.0, 11435.0, 11210.0,
       11228.0, 11368.0, 11694.0, 10464.0, 11415.0, 10314.0, 10301.0,
       10018.0, 10038.0, 11105.0, 11230.0, 10468.0, 11104.0, 10471.0,
       11416.0, 10075.0, 11422.0, 11355.0, 10028.0, 10462.0, 10306.0,
       10461.0, 11224.0, 11429.0, 10035.0, 11366.0, 11362.0, 11206.0,
       10460.0, 10304.0, 11360.0, 11411.0, 10455.0, 10475.0, 10069.0,
       10303.0, 10308.0, 10302.0, 11357.0, 10470.0, 11367.0, 11370.0,
       10454.0, 10451.0, 11436.0, 11426.0, 10153.0, 11004.0, 11428.0,
       11427.0, 11001.0, 11363.0, 10004.0, 10474.0, 11430.0, 10000.0,
       10307.0, 11239.0, 10119.0, 10006.0, 10048.0, 11697.0, 11692.0,
       11693.0, 10573.0, 83.0, 11559.0, 10020.0, 77056.0, 11776.0,
       70711.0, 10282.0, 11109.0, 10044.0, '10452', '11233', '10468',
       '10310', '11105', '10462', '10029', '10301', '10457', '10467',
       '10469', '11225', '10035', '10031', '11226', '10454', '11221',
       '10025', '11229', '11235', '11422', '10472', '11208', '11102',
       '10032', '11216', '10473', '10463', '11213', '10040', '10302',
       '11231', '10470', '11204', '11104', '11212', '10466', '11416',
       '11214', '10009', '11692', '11385', '11423', '11201', '10024',
       '11435', '10312', '10030', '11106', '10033', '10303', '11215',
       '11222', '11354', '10016', '10034', '11420', '10304', '10019',
       '11237', '11249', '11230', '11372', '11207', '11378', '11419',
       '11361', '10011', '11357', '10012', '11358', '10003', '10002',
       '11374', '10007', '11234', '10065', '11369', '11434', '11205',
       '11206', '11415', '11236', '11218', '11413', '10458', '11101',
       '10306', '11355', '10023', '11368', '10314', '11421', '10010',
       '10018', '11223', '10455', '11377', '11433', '11375', '10037',
       '11209', '10459', '10128', '10014', '10282', '11373', '10451',
       '11238', '11211', '10038', '11694', '11203', '11691', '11232',
       '10305', '10021', '11228', '10036', '10001', '10017', '11217',
       '11219', '10308', '10465', '11379', '11414', '10460', '11417',
       '11220', '11366', '10027', '11370', '10309', '11412', '11356',
       '10456', '11432', '10022', '10013', '11367', '11040', '10026',
       '10475', '11210', '11364', '11426', '10471', '10119', '11224',
       '11418', '11429', '11365', '10461', '11239', '10039', '00083',
       '11411', '10075', '11004', '11360', '10453', '10028', '11430',
       '10307', '11103', '10004', '10069', '10005', '10474', '11428',
       '11436', '10020', '11001', '11362', '11693', '10464', '11427',
       '10044', '11363', '10006', '10000', '02061', '77092-2016', '10280',
       '11109', '14225', '55164-0737', '19711', '07306', '000000',
       'NO CLUE', '90010', '10281', '11747', '23541', '11776', '11697',
       '11788', '07604', 10112.0, 11788.0, 11563.0, 11580.0, 7087.0,
       11042.0, 7093.0, 11501.0, 92123.0, 0.0, 11575.0, 7109.0, 11797.0,
       '10803', '11716', '11722', '11549-3650', '10162', '92123', '23502',
       '11518', '07020', '08807', '11577', '07114', '11003', '07201',
       '11563', '61702', '10103', '29616-0759', '35209-3114', '11520',
       '11735', '10129', '11005', '41042', '11590', 6901.0, 7208.0,
       11530.0, 13221.0, 10954.0, 11735.0, 10103.0, 7114.0, 11111.0,
       10107.0], dtype=object)

6.3 处理缺省值和字符串/浮点混乱

我们可以在pd.read_csv读数据的时候，传一个na_values给它，清理掉一部分的脏数据，我们还可以指明说，我们就要保证邮政编码是字符串型的，不要给我整些数值型出来！！

na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str})

requests['Incident Zip'].unique()

array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
       '11417', '10011', '11225', '11218', '10003', '10029', '10466',
       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
       '10305', '10312', '10026', '10309', '10036', '11433', '11235',
       '11213', '11379', '11101', '10014', '11231', '11234', '10457',
       '10459', '10465', '11207', '10002', '10034', '11233', '10453',
       '10456', '10469', '11374', '11221', '11421', '11215', '10007',
       '10019', '11205', '11418', '11369', '11249', '10005', '10009',
       '11211', '11412', '10458', '11229', '10065', '10030', '11222',
       '10024', '10013', '11420', '11365', '10012', '11214', '11212',
       '10022', '11232', '11040', '11226', '10281', '11102', '11208',
       '10001', '10472', '11414', '11223', '10040', '11220', '11373',
       '11203', '11691', '11356', '10017', '10452', '10280', '11217',
       '10031', '11201', '11358', '10128', '11423', '10039', '10010',
       '11209', '10021', '10037', '11413', '11375', '11238', '10473',
       '11103', '11354', '11361', '11106', '11385', '10463', '10467',
       '11204', '11237', '11377', '11364', '11434', '11435', '11210',
       '11228', '11368', '11694', '10464', '11415', '10314', '10301',
       '10018', '10038', '11105', '11230', '10468', '11104', '10471',
       '11416', '10075', '11422', '11355', '10028', '10462', '10306',
       '10461', '11224', '11429', '10035', '11366', '11362', '11206',
       '10460', '10304', '11360', '11411', '10455', '10475', '10069',
       '10303', '10308', '10302', '11357', '10470', '11367', '11370',
       '10454', '10451', '11436', '11426', '10153', '11004', '11428',
       '11427', '11001', '11363', '10004', '10474', '11430', '10000',
       '10307', '11239', '10119', '10006', '10048', '11697', '11692',
       '11693', '10573', '00083', '11559', '10020', '77056', '11776',
       '70711', '10282', '11109', '10044', '02061', '77092-2016', '14225',
       '55164-0737', '19711', '07306', '000000', '90010', '11747',
       '23541', '11788', '07604', '10112', '11563', '11580', '07087',
       '11042', '07093', '11501', '92123', '00000', '11575', '07109',
       '11797', '10803', '11716', '11722', '11549-3650', '10162', '23502',
       '11518', '07020', '08807', '11577', '07114', '11003', '07201',
       '61702', '10103', '29616-0759', '35209-3114', '11520', '11735',
       '10129', '11005', '41042', '11590', '06901', '07208', '11530',
       '13221', '10954', '11111', '10107'], dtype=object)

6.4 那些用“-”连接的邮编是什么鬼？

requests.loc[requests['Incident Zip'].str.contains('-').fillna(False),'Incident Zip']

29136    77092-2016
30939    55164-0737
70539    11549-3650
85821    29616-0759
89304    35209-3114
Name: Incident Zip, dtype: object

rows_with_dashes = requests['Incident Zip'].str.contains('-').fillna(False)
len(requests[rows_with_dashes])

真心是很烦人啊，其实只有5个，输出来看看是什么

requests[rows_with_dashes]

	Unique Key	Created Date	Closed Date	Agency	Agency Name	Complaint Type	Descriptor	Location Type	Incident Zip	Incident Address	Street Name	Cross Street 1	Cross Street 2	Intersection Street 1	Intersection Street 2	Address Type	City	Landmark	Facility Type	Status	Due Date	Resolution Action Updated Date	Community Board	Borough	X Coordinate (State Plane)	Y Coordinate (State Plane)	Park Facility Name	Park Borough	School Name	School Number	School Region	School Code	School Phone Number	School Address	School City	School State	School Zip	School Not Found	School or Citywide Complaint	Vehicle Type	Taxi Company Borough	Taxi Pick Up Location	Bridge Highway Name	Bridge Highway Direction	Road Ramp	Bridge Highway Segment	Garage Lot Name	Ferry Direction	Ferry Terminal Name	Latitude	Longitude	Location
29136	26550551	10/24/2013 06:16:34 PM	NaN	DCA	Department of Consumer Affairs	Consumer Complaint	False Advertising	NaN	77092	2700 EAST SELTICE WAY	EAST SELTICE WAY	NaN	NaN	NaN	NaN	NaN	HOUSTON	NaN	NaN	Assigned	11/13/2013 11:15:20 AM	10/29/2013 11:16:16 AM	0 Unspecified	Unspecified	NaN	NaN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
30939	26548831	10/24/2013 09:35:10 AM	NaN	DCA	Department of Consumer Affairs	Consumer Complaint	Harassment	NaN	55164	P.O. BOX 64437	64437	NaN	NaN	NaN	NaN	NaN	ST. PAUL	NaN	NaN	Assigned	11/13/2013 02:30:21 PM	10/29/2013 02:31:06 PM	0 Unspecified	Unspecified	NaN	NaN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
70539	26488417	10/15/2013 03:40:33 PM	NaN	TLC	Taxi and Limousine Commission	Taxi Complaint	Driver Complaint	Street	11549	365 HOFSTRA UNIVERSITY	HOFSTRA UNIVERSITY	NaN	NaN	NaN	NaN	NaN	HEMSTEAD	NaN	NaN	Assigned	11/30/2013 01:20:33 PM	10/16/2013 01:21:39 PM	0 Unspecified	Unspecified	NaN	NaN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	La Guardia Airport	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
85821	26468296	10/10/2013 12:36:43 PM	10/26/2013 01:07:07 AM	DCA	Department of Consumer Affairs	Consumer Complaint	Debt Not Owed	NaN	29616	PO BOX 25759	BOX 25759	NaN	NaN	NaN	NaN	NaN	GREENVILLE	NaN	NaN	Closed	10/26/2013 09:20:28 AM	10/26/2013 01:07:07 AM	0 Unspecified	Unspecified	NaN	NaN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
89304	26461137	10/09/2013 05:23:46 PM	10/25/2013 01:06:41 AM	DCA	Department of Consumer Affairs	Consumer Complaint	Harassment	NaN	35209	600 BEACON PKWY	BEACON PKWY	NaN	NaN	NaN	NaN	NaN	BIRMINGHAM	NaN	NaN	Closed	10/25/2013 02:43:42 PM	10/25/2013 01:06:41 AM	0 Unspecified	Unspecified	NaN	NaN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

本来就5个，打算直接把这些都设置成缺省值(nan)的：requests['Incident Zip'][rows_with_dashes] = np.nan
后来查了查，发现可能前5位置是真实的邮编，所以干脆截取一下好了。

long_zip_codes = requests['Incident Zip'].str.len() > 5
requests['Incident Zip'][long_zip_codes].unique()

array([], dtype=object)

requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5)

搞定啦！

妈蛋查了下00000，发现根本不是什么美国加拿大的邮编，所以这个是不能这么处理的，还真得重新设为缺省值。

requests[requests['Incident Zip'] == '00000']

	Unique Key	Created Date	Closed Date	Agency	Agency Name	Complaint Type	Descriptor	Location Type	Incident Zip	Incident Address	Street Name	Cross Street 1	Cross Street 2	Intersection Street 1	Intersection Street 2	Address Type	City	Landmark	Facility Type	Status	Due Date	Resolution Action Updated Date	Community Board	Borough	X Coordinate (State Plane)	Y Coordinate (State Plane)	Park Facility Name	Park Borough	School Name	School Number	School Region	School Code	School Phone Number	School Address	School City	School State	School Zip	School Not Found	School or Citywide Complaint	Vehicle Type	Taxi Company Borough	Taxi Pick Up Location	Bridge Highway Name	Bridge Highway Direction	Road Ramp	Bridge Highway Segment	Garage Lot Name	Ferry Direction	Ferry Terminal Name	Latitude	Longitude	Location

zero_zips = requests['Incident Zip'] == '00000'
requests.loc[zero_zips, 'Incident Zip'] = np.nan

完工！！再来看看现在的数据什么样了。

unique_zips = requests['Incident Zip'].unique()
#unique_zips.sort_values()
unique_zips

array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
       '11417', '10011', '11225', '11218', '10003', '10029', '10466',
       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
       '10305', '10312', '10026', '10309', '10036', '11433', '11235',
       '11213', '11379', '11101', '10014', '11231', '11234', '10457',
       '10459', '10465', '11207', '10002', '10034', '11233', '10453',
       '10456', '10469', '11374', '11221', '11421', '11215', '10007',
       '10019', '11205', '11418', '11369', '11249', '10005', '10009',
       '11211', '11412', '10458', '11229', '10065', '10030', '11222',
       '10024', '10013', '11420', '11365', '10012', '11214', '11212',
       '10022', '11232', '11040', '11226', '10281', '11102', '11208',
       '10001', '10472', '11414', '11223', '10040', '11220', '11373',
       '11203', '11691', '11356', '10017', '10452', '10280', '11217',
       '10031', '11201', '11358', '10128', '11423', '10039', '10010',
       '11209', '10021', '10037', '11413', '11375', '11238', '10473',
       '11103', '11354', '11361', '11106', '11385', '10463', '10467',
       '11204', '11237', '11377', '11364', '11434', '11435', '11210',
       '11228', '11368', '11694', '10464', '11415', '10314', '10301',
       '10018', '10038', '11105', '11230', '10468', '11104', '10471',
       '11416', '10075', '11422', '11355', '10028', '10462', '10306',
       '10461', '11224', '11429', '10035', '11366', '11362', '11206',
       '10460', '10304', '11360', '11411', '10455', '10475', '10069',
       '10303', '10308', '10302', '11357', '10470', '11367', '11370',
       '10454', '10451', '11436', '11426', '10153', '11004', '11428',
       '11427', '11001', '11363', '10004', '10474', '11430', '10000',
       '10307', '11239', '10119', '10006', '10048', '11697', '11692',
       '11693', '10573', '00083', '11559', '10020', '77056', '11776',
       '70711', '10282', '11109', '10044', '02061', '77092', '14225',
       '55164', '19711', '07306', '90010', '11747', '23541', '11788',
       '07604', '10112', '11563', '11580', '07087', '11042', '07093',
       '11501', '92123', '11575', '07109', '11797', '10803', '11716',
       '11722', '11549', '10162', '23502', '11518', '07020', '08807',
       '11577', '07114', '11003', '07201', '61702', '10103', '29616',
       '35209', '11520', '11735', '10129', '11005', '41042', '11590',
       '06901', '07208', '11530', '13221', '10954', '11111', '10107'],
      dtype=object)

看起来干净多了。

但是真的做完了吗？

zips = requests['Incident Zip']
# 用is_close表示0或者1开始的比较正确的邮编
is_close = zips.str.startswith('0') | zips.str.startswith('1')
# 非缺省值但不以0或者1开始的邮编认为是有些困惑的
is_far = ~(is_close) & zips.notnull()

zips[is_far]

12102    77056
13450    70711
29136    77092
30939    55164
44008    90010
47048    23541
57636    92123
71001    92123
71834    23502
80573    61702
85821    29616
89304    35209
94201    41042
Name: Incident Zip, dtype: object

可以排个序，然后对应输出一些东西

requests[is_far][['Incident Zip', 'Descriptor', 'City']].sort_values('Incident Zip')

	Incident Zip	Descriptor	City
71834	23502	Harassment	NORFOLK
47048	23541	Harassment	NORFOLK
85821	29616	Debt Not Owed	GREENVILLE
89304	35209	Harassment	BIRMINGHAM
94201	41042	Harassment	FLORENCE
30939	55164	Harassment	ST. PAUL
80573	61702	Billing Dispute	BLOOMIGTON
13450	70711	Contract Dispute	CLIFTON
12102	77056	Debt Not Owed	HOUSTON
29136	77092	False Advertising	HOUSTON
44008	90010	Billing Dispute	LOS ANGELES
57636	92123	Harassment	SAN DIEGO
71001	92123	Billing Dispute	SAN DIEGO

咳咳，突然觉得，恩，刚才做的一大堆工作，其实只是告诉你，我们可以这样去处理和补齐数据。

但你实际上会发现，好像其实用city直接对应一下就可以补上一些东西啊。

requests['City'].str.upper().value_counts()

BROOKLYN               31662
NEW YORK               22664
BRONX                  18438
STATEN ISLAND           4766
JAMAICA                 2246
FLUSHING                1803
ASTORIA                 1568
RIDGEWOOD               1073
CORONA                   707
OZONE PARK               693
LONG ISLAND CITY         678
FAR ROCKAWAY             652
ELMHURST                 647
WOODSIDE                 609
EAST ELMHURST            562
QUEENS VILLAGE           549
FOREST HILLS             541
JACKSON HEIGHTS          541
SOUTH RICHMOND HILL      521
MASPETH                  473
WOODHAVEN                464
FRESH MEADOWS            435
SPRINGFIELD GARDENS      434
BAYSIDE                  411
SOUTH OZONE PARK         410
RICHMOND HILL            404
REGO PARK                402
MIDDLE VILLAGE           396
SAINT ALBANS             387
WHITESTONE               348
                       ...  
NEW YORK CITY              1
LYNBROOK                   1
SYRACUSE                   1
CLIFTON                    1
VALLEY STREAM              1
HEMSTEAD                   1
EAST ROCKAWAY              1
NEW YOR                    1
NEWARK AIRPORT             1
FARMINGDALE                1
RYEBROOK                   1
ROSELYN                    1
NORTH BERGEN               1
CHEEKTOWAGA                1
NJ                         1
BRIDGE  WATER              1
MINEOLA                    1
BELLEVILLE                 1
JERSEY CITY                1
BOHIEMA                    1
WOODBURY                   1
NANUET                     1
ROSLYN                     1
UNION CITY                 1
GREENVILLE                 1
STAMFORD                   1
ELIZABETH                  1
ST. PAUL                   1
LOS ANGELES                1
HASBROCK HEIGHTS           1
Name: City, Length: 100, dtype: int64

6.5 汇个总

所以汇总一下，我们在邮编这个字段，是这样做数据清洗的：

na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('311-service-requests.csv', 
                       na_values=na_values, 
                       dtype={'Incident Zip': str})

def fix_zip_codes(zips):
    # Truncate everything to length 5 
    zips = zips.str.slice(0, 5)
    
    # Set 00000 zip codes to nan
    zero_zips = zips == '00000'
    zips[zero_zips] = np.nan
    
    return zips

requests['Incident Zip'] = fix_zip_codes(requests['Incident Zip'])

requests['Incident Zip'].unique()

array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
       '11417', '10011', '11225', '11218', '10003', '10029', '10466',
       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
       '10305', '10312', '10026', '10309', '10036', '11433', '11235',
       '11213', '11379', '11101', '10014', '11231', '11234', '10457',
       '10459', '10465', '11207', '10002', '10034', '11233', '10453',
       '10456', '10469', '11374', '11221', '11421', '11215', '10007',
       '10019', '11205', '11418', '11369', '11249', '10005', '10009',
       '11211', '11412', '10458', '11229', '10065', '10030', '11222',
       '10024', '10013', '11420', '11365', '10012', '11214', '11212',
       '10022', '11232', '11040', '11226', '10281', '11102', '11208',
       '10001', '10472', '11414', '11223', '10040', '11220', '11373',
       '11203', '11691', '11356', '10017', '10452', '10280', '11217',
       '10031', '11201', '11358', '10128', '11423', '10039', '10010',
       '11209', '10021', '10037', '11413', '11375', '11238', '10473',
       '11103', '11354', '11361', '11106', '11385', '10463', '10467',
       '11204', '11237', '11377', '11364', '11434', '11435', '11210',
       '11228', '11368', '11694', '10464', '11415', '10314', '10301',
       '10018', '10038', '11105', '11230', '10468', '11104', '10471',
       '11416', '10075', '11422', '11355', '10028', '10462', '10306',
       '10461', '11224', '11429', '10035', '11366', '11362', '11206',
       '10460', '10304', '11360', '11411', '10455', '10475', '10069',
       '10303', '10308', '10302', '11357', '10470', '11367', '11370',
       '10454', '10451', '11436', '11426', '10153', '11004', '11428',
       '11427', '11001', '11363', '10004', '10474', '11430', '10000',
       '10307', '11239', '10119', '10006', '10048', '11697', '11692',
       '11693', '10573', '00083', '11559', '10020', '77056', '11776',
       '70711', '10282', '11109', '10044', '02061', '77092', '14225',
       '55164', '19711', '07306', '90010', '11747', '23541', '11788',
       '07604', '10112', '11563', '11580', '07087', '11042', '07093',
       '11501', '92123', '11575', '07109', '11797', '10803', '11716',
       '11722', '11549', '10162', '23502', '11518', '07020', '08807',
       '11577', '07114', '11003', '07201', '61702', '10103', '29616',
       '35209', '11520', '11735', '10129', '11005', '41042', '11590',
       '06901', '07208', '11530', '13221', '10954', '11111', '10107'],
      dtype=object)