The Dataset
legislators.csv
last_name,first_name,birthday,gender,type,state,party
Bassett,Richard,1745-04-02,M,sen,DE,Anti-Administration
Bland,Theodorick,1742-03-21,,rep,VA,
Burke,Aedanus,1743-06-16,,rep,SC,
Carroll,Daniel,1730-07-22,M,rep,MD,
- last_name – the last name of the Congressperson.
- first_name – the first name of the Congressperson.
- birthday – the birthday of the Congressperson.
- gender – the gender of the Congressperson.
- type – whether they were in the Senate (sen), or in the House of Representatives (rep).
- state – the state the Congressperson represents.
- party – the party affiliation of the Congressperson.
观察这个数据集的前几行,就会发现数据是有缺失值的,有缺失值就会造成错误,因此,这篇文章就是来解决如何处理error.
- 通常将list数据转换为set数据会发现意想不到的而数据,因为set对象的元素是独一无二没有重复的。此处将gender数据转换为set对象。
f = open("legislators.csv", 'r')
legislators = list(csv.reader(f))
gender = []
for item in legislators:
gender.append(item[3])
gender = set(gender)
print(gender)
Exploring The Dataset
- 开始关注数据的某些列来探索数据的模式
party = []
for item in legislators:
party.append(item[6])
party = set(party)
print(party)
print(legislators)
上面两个gender和party打印的数据都发现里面的”空字符串,这意味着这两个数据里面都有缺失值。
Missing Values
处理缺失值的方法:
Remove any rows that contain missing data.
Fill the missing fields with a specified value.
Fill the missing fields with a calculated value.
Use analysis techniques that work with missing data.此处采取的措施是用specified value值来填充Party的缺失值,用M来填充gender的缺失值(因为前面发现gender大部分为M):
for row in legislators:
if row[6] == "":
row[6] = "No Party"
for row in legislators:
if row[3] == "":
row[3] = "M"
- 关于birthday属性,它的取值形式:1820-01-02,这样的格式不方便分析,因此需要将其拆分开并且只保留year这个属性
birth_years = []
for row in legislators:
birthday = row[2]
parts = birthday.split("-")
birth_years.append(parts[0])
Try/Except Blocks
对于上面的birth_years属性我们或许需要计算它的平均值,但是这个属性是string类型,因此可能需要用int()函数将其转换,但是倘若有缺失值,int(”)就会报错,因此需要采用常见的try/except block。因为任何错误都会导致停止运行程序,因此用try/except block可以过滤某些错误使得程序继续运行。
- 当创建了一个exception时,它实际上是创建了一个Exception类的一个实例,此时我们可以调用这个实例的属性:
try:
int('')
except Exception as exc:
print(type(exc))
- 也可以将这个类转换为string并将其打印出来 (This will print a message that will help us debug the error)。
try:
int('')
except Exception as exc:
print(str(exc))
- 当在一个循环里面使用try/except block我们不必将错误信息打印出来,因为这将很多且很混乱,我们对其不做任何处理,但是像下面这样又会报错,因为冒号后面一定要有内容:
numbers = [1,2,3,4,5,6,7,8,9,10]
for i in numbers:
try:
int('')
except Exception:
- pass是一个很好的关键字
try:
int('')
except Exception:
pass
- 将birth转换为Int类型:
converted_years = []
for year in birth_years:
try:
year = int(year)
except Exception:
pass
converted_years.append(year)
- 整体上,将数据中的birth分离,提取出year,转换为int类型(缺失值设为0),然后添加到row后面,因此,legisilators多了一列数据。
for row in legislators:
birthday = row[2]
birth_year = birthday.split("-")[0]
try:
birth_year = int(birth_year)
except Exception:
birth_year = 0
row.append(birth_year)
- 但是在之前观察数据的时候发现,数据的顺序是根据年代来排序的,因此将year从0调整为它前面的那个最接近的year比较合适。
last_value = 1
for row in legislators:
if row[7] == 0:
row[7] = last_value
last_value = row[7]