全部源码和数据文件下载:仅供参考
《鲜活的数据:数据可视化指南》书中的python代码为2.x,在看书过程中把部分源代码转换为Python3.3格式。
- 运行环境:Windows7, Python3.3
(1) get-weather-data.py 从网页抓取天气信息(类似于小爬虫)
- 说明:主要修改包括urllib2的变化,timestamp的修改,采集数据的地点,数据位置等
View Code
1 # coding = utf-8 2 __author__ = 'hillfree' 3 4 from urllib.request import urlopen 5 from bs4 import BeautifulSoup 6 7 # Create/open a file called wunder.txt (which will be a comma-delimited file) 8 f = open('wunder-data.txt', 'a') 9 10 # Iterate through months and day 11 for month in range(1, 13): 12 for day in range(1, 32): 13 14 # Check if already gone through month 15 if (month == 2 and day > 28): 16 break 17 elif (month in [4, 6, 9, 11] and day > 30): 18 break 19 20 # Open wunderground.com url 21 url = "http://www.wunderground.com/history/airport/ZBAA/2012/{0}/{1}/DailyHistory.html".format(month, day, ) 22 page = urlopen(url) 23 24 # Get temperature from page 25 soup = BeautifulSoup(page) 26 # 取得最高温度 27 max_temp = soup.findAll(attrs={"class":"nobr"})[3].span.string 28 29 # Build day record with timestamp 30 record = "2012{0:02d}{1:02d}, {2}\n".format(month, day, max_temp) 31 print(record) 32 # Write timestamp and temperature to file 33 f.write(record) 34 35 # Done getting data! Close file. 36 f.close()
(2) get-weather-data-full.py 从网页抓取天气信息强化版
- 说明:在get-weather-data.py基础上增加的年份的循环,以及闰年的判断等
View Code
1 # coding = utf-8 2 __author__ = 'hillfree' 3 4 5 from urllib.request import urlopen 6 from bs4 import BeautifulSoup 7 8 # Create/open a file called wunder.txt (which will be a comma-delimited file) 9 f = open('wunder-data.txt', 'a') 10 11 # Iterate through year, months and day 12 for year in range(2013, 2014): 13 for month in range(1, 2): 14 for day in range(1, 32): 15 16 # Check if leap year 17 if year % 400 == 0: 18 leap = True 19 elif year % 100 == 0: 20 leap = False 21 elif year % 4 == 0: 22 leap = True 23 else: 24 leap = False 25 26 # Check if already gone through month 27 if (month == 2 and leap and day > 29): 28 continue 29 elif (month == 2 and day > 28): 30 continue 31 elif (month in [4, 6, 9, 10] and day > 30): 32 continue 33 34 # Check if already gone through month 35 if (month == 2 and day > 28): 36 break 37 elif (month in [4, 6, 9, 11] and day > 30): 38 break 39 40 # Open wunderground.com url 41 url = "http://www.wunderground.com/history/airport/ZBAA/{0}/{1}/{2}/DailyHistory.html".format(year, month, day, ) 42 page = urlopen(url) 43 44 # Get temperature from page 45 soup = BeautifulSoup(page) 46 # 取得最高温度 47 max_temp = soup.findAll(attrs={"class":"nobr"})[3].span.string 48 49 # Build day record with timestamp 50 record = "{0:04d}{1:02d}{2:02d}, {3}\n".format(year, month, day, max_temp) 51 print(record) 52 # Write timestamp and temperature to file 53 f.write(record) 54 55 # Done getting data! Close file. 56 f.close()
(3)add-csv-flag.py 为CSV文件内容增加标志位
- 说明:在之前生成的CSV文件的基础上,进行判断,并添加is_freezing的标志位,用print输出
View Code
1 # coding = utf-8 2 __author__ = 'hillfree' 3 4 import csv 5 6 7 reader = csv.reader(open('wunder-data.txt', 'r'), delimiter=",") 8 9 for row in reader: 10 if int(row[1]) < 0: 11 is_freezing = '1' 12 print("{0}, {1}, {2}".format(row[0], row[1], is_freezing)) # 列出冰冻日 13 else: 14 is_freezing = '0' 15 16 # print("{0}, {1}, {2}".format(row[0], row[1], is_freezing)) # 可写入文件
(4)csv-to-xml.py 把CSV文件转换为XML格式
- 说明:在print输出的基础上,增加了写入文件“wunder-data.xml"
View Code
1 # coding = utf-8 2 3 """ 4 source: <Visualize This> by Nathan Yau 5 name: csv-to-xml.py 6 python: v3.3 7 description: Convert CSV file to Xml format 8 """ 9 __author__ = 'hillfree' 10 11 import csv 12 13 reader = csv.reader(open('wunder-data.txt', 'r'), delimiter=",") 14 output = open("wunder-data.xml", 'w') 15 16 print('<weather_data>') 17 output.write('<weather_data>\n') 18 19 for row in reader: 20 print('<observation>') 21 print('<date>' + row[0] + '</date>') 22 print('<max_temperature>' + row[1] + '</max_temperature>') 23 print('</observation>') 24 25 output.write('<observation>\n') 26 output.write('<date>' + row[0] + '</date>\n') 27 output.write('<max_temperature>' + row[1] + '</max_temperature>\n') 28 output.write('</observation>\n') 29 30 print('</weather_data>') 31 output.write('</weather_data>\n')
(5)csv-to-json.py 把CSV文件转换为Json格式
- 说明:原文件是利用csv模块引入,然后print输出。但是原文件中利用365条记录作为文件的终结过于僵硬。这里没有引入csv模块,而是利用file的操作,并且做了简单的格式化。
View Code
1 # coding = utf-8 2 3 """ 4 source: <Visualize This> by Nathan Yau 5 name: csv-to-json.py 6 python: v3.3 7 description: Convert CSV file to json format, also write to file. 8 use file module operation instead of csv module 9 """ 10 __author__ = 'hillfree' 11 12 lines = open('wunder-data.txt', 'r').readlines() 13 output = open("wunder-data.json", 'w') 14 15 output.write('{"observations": [\n') 16 17 max = len(lines) 18 count = 0 19 for line in lines: 20 count += 1 21 row = line.split(',') 22 record ='\t{\n\t\t"date": "%s", \n\t\t"temperature": %s' % (row[0], row[1].lstrip()) 23 24 if count < max: 25 record += '\t},\n' 26 else: 27 record += '\t}]\n}' 28 29 output.write(record)
(6)xml-to-csv.py 把xml格式文件转换为csv格式
- 说明:原文件是利用BeautifulSoup模块的BeautifulStoneSoup来处理xml。但在python3.3中相应的lxml无法使用(?)。因此只好采用了python3.3自带的xml.dom.minidom模块,第一次使用感觉很别扭,尤其是子节点的使用很不习惯。加入了很多判断才能将日期和最高气温捏合在一起。求其他更好的办法。
View Code
1 # coding = utf-8 2 3 """ 4 source: <Visualize This> by Nathan Yau 5 name: xml-to-csv.py 6 python: v3.3 7 description: Convert xml file to csv format, also write to file. 8 因为Python3.3下lxml无法应用,所以采用python自带的minidom, 9 用起来比较别扭,不知道如何改进? 10 """ 11 __author__ = 'hillfree' 12 13 from xml.dom.minidom import parse, Node, NodeList 14 15 xml_file =parse("wunder-data.xml") 16 17 date = "" 18 max_temperature = "" 19 for node1 in xml_file.getElementsByTagName("observation"): 20 for node2 in node1.childNodes: 21 22 for node3 in node2.childNodes: 23 if node3.nodeType == Node.TEXT_NODE and node2.nodeName == "date": 24 if node3.nodeValue != "": 25 date = node3.nodeValue 26 if node3.nodeType == Node.TEXT_NODE and node2.nodeName == "max_temperature": 27 if node3.nodeValue != "": 28 max_temperature = node3.nodeValue.lstrip() 29 30 print(date + "," + max_temperature)