5 数据转换
5.1 日期格式数据处理
- pandas中使用to_datetime()方法将文本格式转换为日期格式
- dataframe数据类型如果为datatime64,可以使用dt方法取出年月日等
- 对于时间差数据,可以使用datedelta函数将其转化为指定时间单位的数值
- 时间差数据,可以使用dt方法访问其常用属性
import numpy as np
import pandas as pd
import os
os.chdir(r'D:\code\jupyter\course\代码和数据')
df = pd.read_csv('baby_trade_history.csv', encoding = 'utf-8', dtype = {'user_id':str})
df.head()
user_id auction_id cat_id cat1 property buy_mount day
0 786295544 41098319944 50014866 50022520 21458:86755362;13023209:3593274;10984217:21985... 2 20140919
1 532110457 17916191097 50011993 28 21458:11399317;1628862:3251296;21475:137325;16... 1 20131011
2 249013725 21896936223 50012461 50014815 21458:30992;1628665:92012;1628665:3233938;1628... 1 20131011
3 917056007 12515996043 50018831 50014815 21458:15841995;21956:3494076;27000458:59723383... 2 20141023
4 444069173 20487688075 50013636 50008168 21458:30992;13658074:3323064;1628665:3233941;1... 1 20141103
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29971 entries, 0 to 29970
Data columns (total 7 columns):
user_id 29971 non-null object
auction_id 29971 non-null int64
cat_id 29971 non-null int64
cat1 29971 non-null int64
property 29827 non-null object
buy_mount 29971 non-null int64
day 29971 non-null int64
dtypes: int64(5), object(2)
memory usage: 1.6+ MB
df['buy_date'] = pd.to_datetime(df['day'], format = '%Y%m%d', errors = 'coerce')#新建一列(buy_date),内容为:把day这一列从整数型转换为时间格式,指定读取的格式为年月日,如果不是这个格式,则自动填为空值,不会报错
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29971 entries, 0 to 29970
Data columns (total 8 columns):
user_id 29971 non-null object
auction_id 29971 non-null int64
cat_id 29971 non-null int64
cat1 29971 non-null int64
property 29827 non-null object
buy_mount 29971 non-null int64
day 29971 non-null int64
buy_date 29971 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(2)
memory usage: 1.8+ MB
df.head()
user_id auction_id cat_id cat1 property buy_mount day buy_date
0 786295544 41098319944 50014866 50022520 21458:86755362;13023209:3593274;10984217:21985... 2 20140919 2014-09-19
1 532110457 17916191097 50011993 28 21458:11399317;1628862:3251296;21475:137325;16... 1 20131011 2013-10-11
2 249013725 21896936223 50012461 50014815 21458:30992;1628665:92012;1628665:3233938;1628... 1 20131011 2013-10-11
3 917056007 12515996043 50018831 50014815 21458:15841995;21956:3494076;27000458:59723383... 2 20141023 2014-10-23
4 444069173 20487688075 50013636 50008168 21458:30992;13658074:3323064;1628665:3233941;1... 1 20141103 2014-11-03
df['buy_date'].dt.year#提取年份
0 2014
1 2013
2 2013
3 2014
4 2014
5 2014
6 2012
7 2012
8 2012
9 2012
10 2012
11 2012
12 2012
13 2012
14 2013
15 2013
16 2013
17 2013
18 2013
19 2014
20 2014
21 2014
22 2014
23 2014
24 2014
25 2013
26 2014
27 2015
28 2014
29 2014
...
29941 2013
29942 2013
29943 2013
29944 2014
29945 2014
29946 2014
29947 2014
29948 2013
29949 2013
29950 2013
29951 2014
29952 2014
29953 2015
29954 2015
29955 2014
29956 2014
29957 2014
29958 2014
29959 2014
29960 2013
29961 2013
29962 2013
29963 2012
29964 2012
29965 2012
29966 2014
29967 2014
29968 2013
29969 2013
29970 2013
Name: buy_date, Length: 29971, dtype: int64
df['buy_date'].dt.month
0 9
1 10
2 10
3 10
4 11
5 11
6 12
7 12
8 11
9 11
10 11
11 11
12 11
13 11
14 10
15 10
16 11
17 11
18 11
19 2
20 2
21 5
22 5
23 8
24 8
25 2
26 9
27 2
28 10
29 10
..
29941 1
29942 1
29943 1
29944 12
29945 12
29946 10
29947 10
29948 7
29949 7
29950 7
29951 3
29952 3
29953 1
29954 1
29955 8
29956 8
29957 10
29958 11
29959 11
29960 9
29961 9
29962 9
29963 12
29964 12
29965 12
29966 1
29967 1
29968 10
29969 10
29970 10
Name: buy_date, Length: 29971, dtype: int64
df['diff_day'] = pd.datetime.now() - df['buy_date']
df.head()
user_id auction_id cat_id cat1 property buy_mount day buy_date diff_day
0 786295544 41098319944 50014866 50022520 21458:86755362;13023209:3593274;10984217:21985... 2 20140919 2014-09-19 1974 days 14:12:47.176183
1 532110457 17916191097 50011993 28 21458:11399317;1628862:3251296;21475:137325;16... 1 20131011 2013-10-11 2317 days 14:12:47.176183
2 249013725 21896936223 50012461 50014815 21458:30992;1628665:92012;1628665:3233938;1628... 1 20131011 2013-10-11 2317 days 14:12:47.176183
3 917056007 12515996043 50018831 50014815 21458:15841995;21956:3494076;27000458:59723383... 2 20141023 2014-10-23 1940 days 14:12:47.176183
4 444069173 20487688075 50013636 50008168 21458:30992;13658074:3323064;1628665:3233941;1... 1 20141103 2014-11-03 1929 days 14:12:47.176183
df.dtypes
user_id object
auction_id int64
cat_id int64
cat1 int64
property object
buy_mount int64
day int64
buy_date datetime64[ns]
diff_day timedelta64[ns]
dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29971 entries, 0 to 29970
Data columns (total 9 columns):
user_id 29971 non-null object
auction_id 29971 non-null int64
cat_id 29971 non-null int64
cat1 29971 non-null int64
property 29827 non-null object
buy_mount 29971 non-null int64
day 29971 non-null int64
buy_date 29971 non-null datetime64[ns]
diff_day 29971 non-null timedelta64[ns]
dtypes: datetime64[ns](1), int64(5), object(2), timedelta64[ns](1)
memory usage: 2.1+ MB
df['diff_day'].dt.days
0 1974
1 2317
2 2317
3 1940
4 1929
5 1929
6 2620
7 2620
8 2661
9 2661
10 2661
11 2639
12 2639
13 2639
14 2323
15 2323
16 2268
17 2268
18 2268
19 2195
20 2195
21 2114
22 2114
23 2020
24 2020
25 2544
26 1975
27 1839
28 1954
29 1954
...
29941 2594
29942 2594
29943 2594
29944 1871
29945 1871
29946 1947
29947 1947
29948 2391
29949 2391
29950 2391
29951 2155
29952 2155
29953 1860
29954 1860
29955 2009
29956 2009
29957 1936
29958 1928
29959 1928
29960 2346
29961 2346
29962 2346
29963 2626
29964 2626
29965 2626
29966 2227
29967 2227
29968 2320
29969 2320
29970 2320
Name: diff_day, Length: 29971, dtype: int64
df['diff_day'].head()
0 1974 days 14:12:47.176183
1 2317 days 14:12:47.176183
2 2317 days 14:12:47.176183
3 1940 days 14:12:47.176183
4 1929 days 14:12:47.176183
Name: diff_day, dtype: timedelta64[ns]
df['diff_day'].dt.seconds
0 51167
1 51167
2 51167
3 51167
4 51167
5 51167
6 51167
7 51167
8 51167
9 51167
10 51167
11 51167
12 51167
13 51167
14 51167
15 51167
16 51167
17 51167
18 51167
19 51167
20 51167
21 51167
22 51167
23 51167
24 51167
25 51167
26 51167
27 51167
28 51167
29 51167
...
29941 51167
29942 51167
29943 51167
29944 51167
29945 51167
29946 51167
29947 51167
29948 51167
29949 51167
29950 51167
29951 51167
29952 51167
29953 51167
29954 51167
29955 51167
29956 51167
29957 51167
29958 51167
29959 51167
29960 51167
29961 51167
29962 51167
29963 51167
29964 51167
29965 51167
29966 51167
29967 51167
29968 51167
29969 51167
29970 51167
Name: diff_day, Length: 29971, dtype: int64
14*3600+12*60+47
51167
df['diff_day'].dt.microseconds
0 176183
1 176183
2 176183
3 176183
4 176183
5 176183
6 176183
7 176183
8 176183
9 176183
10 176183
11 176183
12 176183
13 176183
14 176183
15 176183
16 176183
17 176183
18 176183
19 176183
20 176183
21 176183
22 176183
23 176183
24 176183
25 176183
26 176183
27 176183
28 176183
29 176183
...
29941 176183
29942 176183
29943 176183
29944 176183
29945 176183
29946 176183
29947 176183
29948 176183
29949 176183
29950 176183
29951 176183
29952 176183
29953 176183
29954 176183
29955 176183
29956 176183
29957 176183
29958 176183
29959 176183
29960 176183
29961 176183
29962 176183
29963 176183
29964 176183
29965 176183
29966 176183
29967 176183
29968 176183
29969 176183
29970 176183
Name: diff_day, Length: 29971, dtype: int64
df['时间差'] = df['diff_day'] / pd.Timedelta('1 D')
df['时间差'].head()
0 1974.592213
1 2317.592213
2 2317.592213
3 1940.592213
4 1929.592213
Name: 时间差, dtype: float64
df['时间差'].round(decimals = 3)
0 1974.592
1 2317.592
2 2317.592
3 1940.592
4 1929.592
5 1929.592
6 2620.592
7 2620.592
8 2661.592
9 2661.592
10 2661.592
11 2639.592
12 2639.592
13 2639.592
14 2323.592
15 2323.592
16 2268.592
17 2268.592
18 2268.592
19 2195.592
20 2195.592
21 2114.592
22 2114.592
23 2020.592
24 2020.592
25 2544.592
26 1975.592
27 1839.592
28 1954.592
29 1954.592
...
29941 2594.592
29942 2594.592
29943 2594.592
29944 1871.592
29945 1871.592
29946 1947.592
29947 1947.592
29948 2391.592
29949 2391.592
29950 2391.592
29951 2155.592
29952 2155.592
29953 1860.592
29954 1860.592
29955 2009.592
29956 2009.592
29957 1936.592
29958 1928.592
29959 1928.592
29960 2346.592
29961 2346.592
29962 2346.592
29963 2626.592
29964 2626.592
29965 2626.592
29966 2227.592
29967 2227.592
29968 2320.592
29969 2320.592
29970 2320.592
Name: 时间差, Length: 29971, dtype: float64
df['diff_day'].astype('timedelta64[D]')#转化为整数天(D)
0 1974.0
1 2317.0
2 2317.0
3 1940.0
4 1929.0
5 1929.0
6 2620.0
7 2620.0
8 2661.0
9 2661.0
10 2661.0
11 2639.0
12 2639.0
13 2639.0
14 2323.0
15 2323.0
16 2268.0
17 2268.0
18 2268.0
19 2195.0
20 2195.0
21 2114.0
22 2114.0
23 2020.0
24 2020.0
25 2544.0
26 1975.0
27 1839.0
28 1954.0
29 1954.0
...
29941 2594.0
29942 2594.0
29943 2594.0
29944 1871.0
29945 1871.0
29946 1947.0
29947 1947.0
29948 2391.0
29949 2391.0
29950 2391.0
29951 2155.0
29952 2155.0
29953 1860.0
29954 1860.0
29955 2009.0
29956 2009.0
29957 1936.0
29958 1928.0
29959 1928.0
29960 2346.0
29961 2346.0
29962 2346.0
29963 2626.0
29964 2626.0
29965 2626.0
29966 2227.0
29967 2227.0
29968 2320.0
29969 2320.0
29970 2320.0
Name: diff_day, Length: 29971, dtype: float64
5.2 高阶函数数据处理
- 在dataframe中使用apply方法,调用自定义函数对数据进行处理
- 函数apply,axis = 0表示对行进行操作,1表示对列进行操作
- 可以使用astype函数对数据进行转换
- 可以使用map函数进行数据转换
df2 = pd.read_csv('sam_tianchi_mum_baby.csv', dtype = str, encoding = 'utf-8')
df2.head()
user_id birthday gender
0 2757 20130311 1
1 415971 20121111 0
2 1372572 20120130 1
3 10339332 20110910 0
4 10642245 20130213 0
def f(x):
if '0' in str(x):
return '女'
elif '1' in str(x):
return '男'
else:
return '未知'
df2['性别'] = df2['gender'].apply(f)
df2.head()
user_id birthday gender 性别
0 2757 20130311 1 男
1 415971 20121111 0 女
2 1372572 20120130 1 男
3 10339332 20110910 0 女
4 10642245 20130213 0 女
df2[df2['gender'] == '2']
user_id birthday gender 性别
46 49167150 20130818 2 未知
47 49983255 20140206 2 未知
51 52529655 20130611 2 未知
58 57711375 20130420 2 未知
106 99665637 20130926 2 未知
132 151742451 20140121 2 未知
163 277629531 20130830 2 未知
235 690267186 20140104 2 未知
283 902549448 20130615 2 未知
284 914074800 20140219 2 未知
292 1060518588 20130815 2 未知
333 17830874 20130927 2 未知
611 1134007073 20131231 2 未知
646 16033918 20131024 2 未知
673 45002392 20131212 2 未知
701 64581913 20130827 2 未知
718 80273467 20150125 2 未知
735 101254996 20130717 2 未知
744 115378612 20140412 2 未知
754 133546192 20130707 2 未知
761 144253378 20131021 2 未知
764 156156175 20130824 2 未知
778 198627397 20130617 2 未知
814 363860560 20140216 2 未知
917 930683689 20131013 2 未知
931 1609674769 20140703 2 未知
del df2['性别']
df2['性别'] = df2['gender'].map({'0':'女','1':'男','2':'未知'})#map映射
df2.head()
user_id birthday gender 性别
0 2757 20130311 1 男
1 415971 20121111 0 女
2 1372572 20120130 1 男
3 10339332 20110910 0 女
4 10642245 20130213 0 女
del df2['性别']
df2['性别'] = df2['gender'].map(f)#也可以用自己定义的函数
df2.head()
user_id birthday gender 性别
0 2757 20130311 1 男
1 415971 20121111 0 女
2 1372572 20120130 1 男
3 10339332 20110910 0 女
4 10642245 20130213 0 女
df2['user_id'].apply(lambda x:str(x).replace(x[1:3],'**'))#对user_id的第2位到第3位做脱敏处理
0 2**7
1 4**971
2 1**2572
3 1**39332
4 1**42245
5 1**23201
6 1**68880
7 1**19465
8 1**50574
9 1**35440
10 1**10892
11 1**05422
12 1**86531
13 1**65490
14 1**31245
15 1**90851
16 2**87991
17 2**70454
18 2**37271
19 2**15917
20 2**87268
21 2**02471
22 2**08537
23 2**27133
24 2**29944
25 2**07593
26 2**29842
27 2**13666
28 3**95206
29 3**35454
...
923 1**3363684
924 1**0284684
925 1**1311286
926 1**7643545
927 1**4274418
928 1**3802043
929 1**3628211
930 1**758**31
931 1**9674769
932 1**5254677
933 1**9930323
934 1**1493753
935 1**3328510
936 1**2276954
937 1**1109878
938 1**0825303
939 1**4594929
940 1**2586633
941 1**0529477
942 1**6568285
943 1**2594497
944 1**0948217
945 1**1509424
946 1**3092345
947 1**3235563
948 2**0957900
949 2**0304899
950 2**4469016
951 2**6831536
952 2**4611367
Name: user_id, Length: 953, dtype: object
df2['birthday'].apply(lambda x: x[4:8])
0 0311
1 1111
2 0130
3 0910
4 0213
5 0830
6 0107
7 0705
8 0708
9 0323
10 0812
11 0429
12 0922
13 1209
14 0115
15 0101
16 0808
17 1017
18 0204
19 0801
20 0526
21 0601
22 0416
23 1029
24 0826
25 1122
26 1124
27 0413
28 0918
29 0319
...
923 1114
924 1229
925 1031
926 0629
927 0803
928 0301
929 0903
930 0321
931 0703
932 0123
933 1004
934 1222
935 0817
936 0821
937 0823
938 0721
939 0826
940 0309
941 0617
942 1123
943 0105
944 0917
945 0531
946 0303
947 0709
948 0430
949 0713
950 0416
951 0519
952 1031
Name: birthday, Length: 953, dtype: object
5.3 字符串数据处理
- pandas中提供了字符串的函数,但只能对字符型变量使用
- 通过str方法访问相关属性
- 可以使用字符串的相关方法进行数据处理
df1 = pd.read_csv('MotorcycleData.csv',encoding = 'gbk')
df1.head()
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model ... Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count
0 Used mint!!! very low miles $11,412 McHenry, Illinois, United States 2013.0 16,000 Black Harley-Davidson Unspecified Touring ... NaN FALSE 8.1 NaN 2427 Private Seller Clear True FALSE 28.0
1 Used Perfect condition $17,200 Fort Recovery, Ohio, United States 2016.0 60 Black Harley-Davidson Vehicle has an existing warranty Touring ... NaN FALSE 100 17 657 Private Seller Clear True TRUE 0.0
2 Used NaN $3,872 Chicago, Illinois, United States 1970.0 25,763 Silver/Blue BMW Vehicle does NOT have an existing warranty R-Series ... NaN FALSE 100 NaN 136 NaN Clear True FALSE 26.0
3 Used CLEAN TITLE READY TO RIDE HOME $6,575 Green Bay, Wisconsin, United States 2009.0 33,142 Red Harley-Davidson NaN Touring ... NaN FALSE 100 NaN 2920 Dealer Clear True FALSE 11.0
4 Used NaN $10,000 West Bend, Wisconsin, United States 2012.0 17,800 Blue Harley-Davidson NO WARRANTY Touring ... NaN FALSE 100 13 271 OWNER Clear True TRUE 0.0
5 rows × 22 columns
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 22 columns):
Condition 7493 non-null object
Condition_Desc 1656 non-null object
Price 7493 non-null object
Location 7491 non-null object
Model_Year 7489 non-null float64
Mileage 7468 non-null object
Exterior_Color 6778 non-null object
Make 7489 non-null object
Warranty 5109 non-null object
Model 7370 non-null object
Sub_Model 2426 non-null object
Type 6011 non-null object
Vehicle_Title 268 non-null object
OBO 7427 non-null object
Feedback_Perc 6611 non-null object
Watch_Count 3517 non-null object
N_Reviews 7487 non-null object
Seller_Status 6868 non-null object
Vehicle_Tile 7439 non-null object
Auction 7476 non-null object
Buy_Now 7256 non-null object
Bid_Count 2190 non-null float64
dtypes: float64(2), object(20)
memory usage: 1.3+ MB
df1['Price'].head()
0 $11,412
1 $17,200
2 $3,872
3 $6,575
4 $10,000
Name: Price, dtype: object
df1['价格'] = df1['Price'].str.strip('$')
df1['价格']
0 11,412
1 17,200
2 3,872
3 6,575
4 10,000
5 1,500
6 24,900
7 1,400
8 5,100
9 8,000
10 2,125
11 11,100
12 1,125
13 3,550
14 5,500
15 9,000
16 8,100
17 14,000
18 20,000
19 13,000
20 10,900
21 9,999
22 5,700
23 7,400
24 26,500
25 9,850
26 45,900
27 10,600
28 12,000
29 20,000
...
7463 7,800
7464 4,400
7465 7,900
7466 6,500
7467 7,400
7468 8,800
7469 13,570
7470 7,900
7471 6,200
7472 7,500
7473 5,500
7474 3,400
7475 3,900
7476 8,500
7477 4,900
7478 9,900
7479 2,900
7480 6,500
7481 10,500
7482 7,200
7483 7,500
7484 5,000
7485 5,000
7486 8,900
7487 4,900
7488 3,900
7489 8,900
7490 7,800
7491 7,900
7492 12,970
Name: 价格, Length: 7493, dtype: object
df1['价格'] = df1['价格'].str.replace(',','')#把逗号转化为空字符串
df1['价格']
0 11412
1 17200
2 3872
3 6575
4 10000
5 1500
6 24900
7 1400
8 5100
9 8000
10 2125
11 11100
12 1125
13 3550
14 5500
15 9000
16 8100
17 14000
18 20000
19 13000
20 10900
21 9999
22 5700
23 7400
24 26500
25 9850
26 45900
27 10600
28 12000
29 20000
...
7463 7800
7464 4400
7465 7900
7466 6500
7467 7400
7468 8800
7469 13570
7470 7900
7471 6200
7472 7500
7473 5500
7474 3400
7475 3900
7476 8500
7477 4900
7478 9900
7479 2900
7480 6500
7481 10500
7482 7200
7483 7500
7484 5000
7485 5000
7486 8900
7487 4900
7488 3900
7489 8900
7490 7800
7491 7900
7492 12970
Name: 价格, Length: 7493, dtype: object
df1['价格'] = df1['价格'].astype(int)
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 23 columns):
Condition 7493 non-null object
Condition_Desc 1656 non-null object
Price 7493 non-null object
Location 7491 non-null object
Model_Year 7489 non-null float64
Mileage 7468 non-null object
Exterior_Color 6778 non-null object
Make 7489 non-null object
Warranty 5109 non-null object
Model 7370 non-null object
Sub_Model 2426 non-null object
Type 6011 non-null object
Vehicle_Title 268 non-null object
OBO 7427 non-null object
Feedback_Perc 6611 non-null object
Watch_Count 3517 non-null object
N_Reviews 7487 non-null object
Seller_Status 6868 non-null object
Vehicle_Tile 7439 non-null object
Auction 7476 non-null object
Buy_Now 7256 non-null object
Bid_Count 2190 non-null float64
价格 7493 non-null int32
dtypes: float64(2), int32(1), object(20)
memory usage: 1.3+ MB
df1.head()
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model ... OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count 价格
0 Used mint!!! very low miles $11,412 McHenry, Illinois, United States 2013.0 16,000 Black Harley-Davidson Unspecified Touring ... FALSE 8.1 NaN 2427 Private Seller Clear True FALSE 28.0 11412
1 Used Perfect condition $17,200 Fort Recovery, Ohio, United States 2016.0 60 Black Harley-Davidson Vehicle has an existing warranty Touring ... FALSE 100 17 657 Private Seller Clear True TRUE 0.0 17200
2 Used NaN $3,872 Chicago, Illinois, United States 1970.0 25,763 Silver/Blue BMW Vehicle does NOT have an existing warranty R-Series ... FALSE 100 NaN 136 NaN Clear True FALSE 26.0 3872
3 Used CLEAN TITLE READY TO RIDE HOME $6,575 Green Bay, Wisconsin, United States 2009.0 33,142 Red Harley-Davidson NaN Touring ... FALSE 100 NaN 2920 Dealer Clear True FALSE 11.0 6575
4 Used NaN $10,000 West Bend, Wisconsin, United States 2012.0 17,800 Blue Harley-Davidson NO WARRANTY Touring ... FALSE 100 13 271 OWNER Clear True TRUE 0.0 10000
5 rows × 23 columns
df1['Location'].str.split(',')#字符串的分割。df1['Location'].str.split(',').str[0]提取第一个元素
0 [McHenry, Illinois, United States]
1 [Fort Recovery, Ohio, United States]
2 [Chicago, Illinois, United States]
3 [Green Bay, Wisconsin, United States]
4 [West Bend, Wisconsin, United States]
5 [Watervliet, Michigan, United States]
6 [Sterling, Illinois, United States]
7 [Williamston, Michigan, United States]
8 [Palatine, Illinois, United States]
9 [Chicago, Illinois, United States]
10 [Williamston, Michigan, United States]
11 [Oak Park, Illinois, United States]
12 [Pewaukee, Wisconsin, United States]
13 [Madison, Wisconsin, United States]
14 [Davenport, Iowa, United States]
15 [Plymouth, Indiana, United States]
16 [Goshen, Indiana, United States]
17 [Plainfield, Indiana, United States]
18 [Bellevue, Iowa, United States]
19 [Des Plaines, Illinois, United States]
20 [Ottawa, Illinois, United States]
21 [Orland Park, Illinois, United States]
22 [Cross Plains, Wisconsin, United States]
23 [Watervliet, Michigan, United States]
24 [Kewanee, Illinois, United States]
25 [Chicago, Illinois, United States]
26 [South Haven, Michigan, United States]
27 [Wheeling, Illinois, United States]
28 [Chicago, Illinois, United States]
29 [Chicago, Illinois, United States]
...
7463 [Raymond, New Hampshire, United States]
7464 [Raymond, New Hampshire, United States]
7465 [Raymond, New Hampshire, United States]
7466 [Raymond, New Hampshire, United States]
7467 [Raymond, New Hampshire, United States]
7468 [Raymond, New Hampshire, United States]
7469 [Scott City, Missouri, United States]
7470 [Raymond, New Hampshire, United States]
7471 [Raymond, New Hampshire, United States]
7472 [Raymond, New Hampshire, United States]
7473 [Raymond, New Hampshire, United States]
7474 [Raymond, New Hampshire, United States]
7475 [Raymond, New Hampshire, United States]
7476 [Raymond, New Hampshire, United States]
7477 [Raymond, New Hampshire, United States]
7478 [Raymond, New Hampshire, United States]
7479 [Raymond, New Hampshire, United States]
7480 [Raymond, New Hampshire, United States]
7481 [Raymond, New Hampshire, United States]
7482 [Raymond, New Hampshire, United States]
7483 [Raymond, New Hampshire, United States]
7484 [Raymond, New Hampshire, United States]
7485 [Raymond, New Hampshire, United States]
7486 [Raymond, New Hampshire, United States]
7487 [Raymond, New Hampshire, United States]
7488 [Raymond, New Hampshire, United States]
7489 [Raymond, New Hampshire, United States]
7490 [Raymond, New Hampshire, United States]
7491 [Raymond, New Hampshire, United States]
7492 [Scott City, Missouri, United States]
Name: Location, Length: 7493, dtype: object
df1['Location'].str.len()
0 32.0
1 34.0
2 32.0
3 35.0
4 35.0
5 35.0
6 33.0
7 36.0
8 33.0
9 32.0
10 36.0
11 33.0
12 34.0
13 33.0
14 30.0
15 32.0
16 30.0
17 34.0
18 29.0
19 36.0
20 31.0
21 36.0
22 38.0
23 35.0
24 32.0
25 32.0
26 36.0
27 33.0
28 32.0
29 32.0
...
7463 37.0
7464 37.0
7465 37.0
7466 37.0
7467 37.0
7468 37.0
7469 35.0
7470 37.0
7471 37.0
7472 37.0
7473 37.0
7474 37.0
7475 37.0
7476 37.0
7477 37.0
7478 37.0
7479 37.0
7480 37.0
7481 37.0
7482 37.0
7483 37.0
7484 37.0
7485 37.0
7486 37.0
7487 37.0
7488 37.0
7489 37.0
7490 37.0
7491 37.0
7492 35.0
Name: Location, Length: 7493, dtype: float64
欢迎阅读数据清洗系列文章:python数据清洗工具、方法、过程整理归纳