经过人指点,解决了一个问题
使用数据集air_data.csv
为了程序运行需要,仅给出一部分数据集
MEMBER_NO | FFP_DATE | FIRST_FLIGHT_DATE | GENDER | FFP_TIER | WORK_CITY | WORK_PROVINCE | WORK_COUNTRY | AGE | LOAD_TIME | FLIGHT_COUNT | BP_SUM | EP_SUM_YR_1 | EP_SUM_YR_2 | SUM_YR_1 | SUM_YR_2 | SEG_KM_SUM | WEIGHTED_SEG_KM | LAST_FLIGHT_DATE | AVG_FLIGHT_COUNT | AVG_BP_SUM | BEGIN_TO_FIRST | LAST_TO_END | AVG_INTERVAL | MAX_INTERVAL | ADD_POINTS_SUM_YR_1 | ADD_POINTS_SUM_YR_2 | EXCHANGE_COUNT | avg_discount | P1Y_Flight_Count | L1Y_Flight_Count | P1Y_BP_SUM | L1Y_BP_SUM | EP_SUM | ADD_Point_SUM | Eli_Add_Point_Sum | L1Y_ELi_Add_Points | Points_Sum | L1Y_Points_Sum | Ration_L1Y_Flight_Count | Ration_P1Y_Flight_Count | Ration_P1Y_BPS | Ration_L1Y_BPS | Point_NotFlight |
54993 | 2006/11/02 | 2008/12/24 | 男 | 6 | 0 | 北京 | CN | 31 | 2014/03/31 | 210 | 505308 | 0 | 74460 | 239560 | 234188 | 580717 | 558440.14 | 2014/03/31 | 26.25 | 63163.5 | 2 | 1 | 3.483253589 | 18 | 3352 | 36640 | 34 | 0.961639043 | 103 | 107 | 246197 | 259111 | 74460 | 39992 | 114452 | 111100 | 619760 | 370211 | 0.50952381 | 0.49047619 | 0.487220691 | 0.51277733 | 50 |
28065 | 2007/02/19 | 2007/08/03 | 男 | 6 | 北京 | CN | 42 | 2014/03/31 | 140 | 362480 | 0 | 41288 | 171483 | 167434 | 293678 | 367777.2 | 2014/03/25 | 17.5 | 45310 | 2 | 7 | 5.194244604 | 17 | 0 | 12000 | 29 | 1.25231444 | 68 | 72 | 177358 | 185122 | 41288 | 12000 | 53288 | 53288 | 415768 | 238410 | 0.514285714 | 0.485714286 | 0.489289094 | 0.510708147 | 33 | |
55106 | 2007/02/01 | 2007/08/30 | 男 | 6 | 0 | 北京 | CN | 40 | 2014/03/31 | 135 | 351159 | 0 | 39711 | 163618 | 164982 | 283712 | 355966.5 | 2014/03/21 | 16.875 | 43894.875 | 10 | 11 | 5.298507463 | 18 | 3491 | 12000 | 20 | 1.254675516 | 65 | 70 | 169072 | 182087 | 39711 | 15491 | 55202 | 51711 | 406361 | 233798 | 0.518518519 | 0.481481481 | 0.481467137 | 0.518530015 | 26 |
21189 | 2008/08/22 | 2008/08/23 | 男 | 5 | Los Angeles | CA | US | 64 | 2014/03/31 | 23 | 337314 | 0 | 34890 | 116350 | 125500 | 281336 | 306900.88 | 2013/12/26 | 2.875 | 42164.25 | 21 | 97 | 27.86363636 | 73 | 0 | 0 | 11 | 1.090869565 | 13 | 10 | 186104 | 151210 | 34890 | 0 | 34890 | 34890 | 372204 | 186100 | 0.434782609 | 0.565217391 | 0.551721684 | 0.448275351 | 12 |
#-*- coding: utf-8 -*-
#数据清洗,过滤掉不符合规则的数据
import pandas as pd
datafile= '../data/air_data.csv' #航空原始数据,第一行为属性标签
cleanedfile = '../tmp/data_cleaned.csv' #数据清洗后保存的文件
data = pd.read_csv(datafile,encoding='utf-8') #读取原始数据,指定UTF-8编码(需要用文本编辑器将数据装换为UTF-8编码)
data = data[data['SUM_YR_1'].notnull()*data['SUM_YR_2'].notnull()] #票价非空值才保留
#只保留票价非零的,或者平均折扣率与总飞行公里数同时为0的记录。
index1 = data['SUM_YR_1'] != 0
index2 = data['SUM_YR_2'] != 0
index3 = (data['SEG_KM_SUM'] == 0) & (data['avg_discount'] == 0) #该规则是“与”
data = data[index1 | index2 | index3] #该规则是“或”csv
data.to_csv(cleanedfile,sep="\t", encoding="utf-8")
#data.to_excel(cleanedfile) #导出结果
如果编辑文本后运行,会出现如下错误:
ValueError: No engine for filetype: 'csv'
如果python console中运行,会出现如下错误:
File "<input>", line 0
SyntaxError: encoding declaration in Unicode string
这两个问题都是由于最后一句使用了to_excel导致的,改成to_csv即可
另外,to_csv可以生成csv或者xls文件