概述
数据来源:https://www.kaggle.com/fivethirtyeight/2016-election-polls
因为下载数据需要注册登录,比较麻烦,这边为了方便,我直接把需要分析的数据表导出来啦
链接:https://pan.baidu.com/s/1IasBj6DcqXvFkJox4Zg2VQ?pwd=7ctn
提取码:7ctn
读取CSV文件格式:
loadtxt(fname, dtype = , comments = '#', delimiter = None, converters = None, skiprows = 0,
usecols = None, unpack = False, nbmin = 0, enconding = 'bytes')
主要参数及其说明:
参数 | 说明 |
---|---|
fname | 读取的CSV文件名 |
delimiter | 数据的分隔符 |
stype | 数据类型,默认float |
comments | 注释 |
delimiter | 分隔符,默认是空格 |
converters | 转换元素类型 |
skiprows | 跳过前几行读取,默认是0,必须是int整型 |
usecols | 要读取哪些列,0是第一列 |
unpack | 如果为True,将分列读取 |
ndmin | 指定生成数组的最小维度 |
enconding | 要使用的编码 |
题目要求
利用Numpy所学知识完成2016年美国大选数据统计,将Clinton和Trump自2015-11到2016-08每月得票数进行统计输出
1:导入模块
import datetime as dt
import pandas as pd
import numpy as np
import csv
2:获取数据
获取日期、Clinton的票数数据、特朗普投票数据三列需要的数据
data = np.loadtxt('presidential_polls.csv',dtype=str,usecols=(7,17,18),delimiter=',')
查看数据
print(data)
[['enddate' 'adjpoll_clinton' 'adjpoll_trump']
['10/31/2016' '42.6414' '40.86509']
['10/30/2016' '43.29659' '44.72984']
...
['9/22/2016' '45.9713' '39.97518']
['6/21/2016' '45.2939' '46.66175']
['8/18/2016' '31.62721' '44.65947']]
3:数据处理
把数据转化为列表并去掉第一行的的标题
data_poll = data.tolist()[1:]
查看数据
i = 0
for x in data_poll:
i += 1
print(x, end = ' ')
if i % 3 == 0:
print()
['10/31/2016', '42.6414', '40.86509'] ['10/30/2016', '43.29659', '44.72984'] ['10/30/2016', '46.29779', '40.72604']
['10/24/2016', '46.35931', '45.30585'] ['10/25/2016', '45.32744', '42.20888'] ['10/25/2016', '44.6508', '42.26663']
['10/31/2016', '46.21834', '43.56017'] ['10/30/2016', '46.89049', '43.50333'] ['10/27/2016', '41.22576', '37.24948']
['10/31/2016', '42.21983', '41.6954'] ['10/31/2016', '44.53217', '43.84845'] ['10/27/2016', '41.81832', '47.92262']
['10/23/2016', '55.68839', '29.50605'] ['10/30/2016', '43.31551', '40.34972'] ['10/26/2016', '45.20793', '42.01937']
['10/26/2016', '43.19458', '45.07725'] ['10/24/2016', '50.18283', '39.33826'] ['10/24/2016', '42.67789', '46.11255']
['10/28/2016', '47.77047', '39.80679'] ['10/25/2016', '45.74354', '41.34735'] ['10/17/2016', '46.84417', '39.99571']
['10/28/2016', '38.51061', '50.7572'] ['10/28/2016', '41.75385', '38.87231'] ['10/26/2016', '45.63602', '41.55637']
['10/28/2016', '41.76', '43.84806'] ['10/26/2016', '45.78602', '45.0337'] ['10/28/2016', '47.77576', '44.78595']
['10/31/2016', '44.50363', '44.1804'] ['10/30/2016', '45.66489', '40.41809'] ['10/25/2016', '38.42823', '49.47709']
['10/24/2016', '43.04313', '38.24964'] ['10/30/2016', '44.7114', '41.14791'] ['10/30/2016', '42.7114', '46.14791']
['10/30/2016', '44.7114', '45.14791'] ['10/30/2016', '45.7114', '40.14791'] ['10/30/2016', '46.38828', '44.13978']
['10/23/2016', '45.74498', '41.64333'] ['10/24/2016', '43.73338', '39.62985'] ['10/24/2016', '45.73579', '46.35058']
['10/30/2016', '46.7114', '41.14791'] ['10/26/2016', '44.08772', '44.58124'] ['10/30/2016', '48.63733', '43.2056']
['10/26/2016', '47.3517', '39.01773'] ['10/18/2016', '47.20443', '41.15833'] ['10/22/2016', '42.63636', '46.87188']
['10/24/2016', '46.95812', '39.36292'] ['10/25/2016', '43.94143', '42.24168'] ['10/30/2016', '43.7114', '46.14791']
['10/30/2016', '45.63095', '40.2258'] ['10/31/2016', '42.73449', '45.10929'] ['10/24/2016', '41.09849', '44.92952']
['10/30/2016', '45.78382', '40.53563'] ['10/29/2016', '39.79761', '50.76878'] ['10/30/2016', '48.5308', '39.71922']
......
['6/12/2016', '46.06344', '38.65057'] ['10/19/2016', '31.53417', '29.49314'] ['11/15/2015', '47.57453', '37.87221']
['2/21/2016', '46.96003', '39.42957'] ['10/13/2016', '38.10209', '53.95455'] ['7/19/2016', '50.60115', '33.0715']
['7/11/2016', '43.29751', '41.88533'] ['8/16/2016', '29.94538', '36.82408'] ['9/22/2016', '30.45553', '47.80848']
['8/10/2016', '42.62525', '42.01089'] ['1/7/2016', '42.07473', '45.06726'] ['8/4/2016', '26.74404', '40.16534']
['7/11/2016', '40.33774', '41.5603'] ['10/13/2016', '37.30964', '54.76821'] ['10/6/2016', '49.13094', '39.41588']
['9/22/2016', '45.9713', '39.97518'] ['6/21/2016', '45.2939', '46.66175'] ['8/18/2016', '31.62721', '44.65947']
1:日期处理
将日期由mm/dd/yyyy转化为yyyy/mm的格式:
知识点:
(1)%y 两位数的年份表示(00 - 99)
(2)%Y 四位数的年份表示(000 - 9999)
(3)%m 月份(01 - 12)
(4)%d 月内中的一天(0 - 31)
1:用列表解析式把日期提取出来
date = [i[0] for i in data_poll]
2:把日期的f分成三个参数分别用m,d,Y保存
date1 = [dt.datetime.strptime(date,'%m/%d/%Y') for date in date]
3:以yyyy-mm的格式保存日期的年和月
date2 = [i.strftime('%Y-%m') for i in date1]
2:投票数据处理
1:处理Clinton的投票数据,先遍历一边数据,把空的数值数据初始化为零,最后把数据转化为浮点数类型
Clinton_poll = [i[1] for i in data_poll]
for i in range(len(Clinton_poll)):
if Clinton_poll[i] =='':
Clinton_poll[i]='0'
Clinton_poll_arr = np.array( Clinton_poll,dtype=np.float64)
2:处理Trump的投票数据,同样先遍历一边数据,把空的数值数据初始化为零,最后把数据转化为浮点数类型
Trump_poll = [i[2] for i in data_poll]
for i in range(len(Trump_poll)):
if Trump_poll[i] =='':
Trump_poll[i]='0'
Trump_poll_arr = np.array(Trump_poll,dtype=np.float64)
3:使用DataFrame合并日期,Clinton的投票数据以及Trump的投票数据为一个二维数组
my_data = pd.DataFrame({'Date':date2, 'Clinton': Clinton_poll_arr, 'Trump':Trump_poll_arr},
columns=['Date','Clinton','Trump'])
4:对Clinton每月的数据求和并输出2015-11到2016-08的投票数据
sum_Clinton = my_data['Clinton'].groupby(my_data['Date']).sum()
print('Clinton从2015-11到2016-08的投票数据如下:')
i = 0
for key_value in sum_Clinton.items():
i += 1
print(key_value)
if i == 10:
break;
5:对Trump每月的数据求和并输出2015-11到2016-08的投票数据
sum_Trump = my_data['Trump'].groupby(my_data['Date']).sum()
print('Trump从2015-11到2016-08的投票数据如下:')
i = 0
for key_value in sum_Trump.items():
i += 1
print(key_value)
if i == 10:
break;
完整代码
import numpy as np
import pandas as pd
import datetime as dt
import csv
data = np.loadtxt('presidential_polls.csv',dtype=str,usecols=(7,17,18),delimiter=',')
data_poll = data.tolist()[1:]
date = [i[0] for i in data_poll]
date1 = [dt.datetime.strptime(date,'%m/%d/%Y') for date in date]
date2 = [i.strftime('%Y-%m') for i in date1]
Clinton_poll = [i[1] for i in data_poll]
for i in range(len(Clinton_poll)):
if Clinton_poll[i] =='':
Clinton_poll[i]='0'
Clinton_poll_arr = np.array( Clinton_poll,dtype=np.float64)
Trump_poll = [i[2] for i in data_poll]
for i in range(len(Trump_poll)):
if Trump_poll[i] =='':
Trump_poll[i]='0'
Trump_poll_arr = np.array(Trump_poll,dtype=np.float64)
my_data = pd.DataFrame({'Date':date2, 'Clinton': Clinton_poll_arr, 'Trump':Trump_poll_arr},
columns=['Date','Clinton','Trump'])
sum_Clinton = my_data['Clinton'].groupby(my_data['Date']).sum()
print('Clinton从2015-11到2016-08的投票数据如下:')
i = 0
for key_value in sum_Clinton.items():
i += 1
print(key_value)
if i == 10:
break;
print('------------------------------------------')
sum_Trump = my_data['Trump'].groupby(my_data['Date']).sum()
print('Trump从2015-11到2016-08的投票数据如下:')
i = 0
for key_value in sum_Trump.items():
i += 1
print(key_value)
if i == 10:
break;
Clinton从2015-11到2016-08的投票数据如下:
('2015-11', 1916.6980600000002)
('2015-12', 4637.256880000004)
('2016-01', 6585.16702)
('2016-02', 7946.2286100000065)
('2016-03', 11156.098239999998)
('2016-04', 11579.426779999998)
('2016-05', 12242.275380000008)
('2016-06', 19771.335760000005)
('2016-07', 23233.11167999999)
('2016-08', 67909.28209999984)
------------------------------------------
Trump从2015-11到2016-08的投票数据如下:
('2015-11', 1937.3290100000002)
('2015-12', 4088.921899999999)
('2016-01', 6253.249349999999)
('2016-02', 7672.339800000001)
('2016-03', 9991.593580000008)
('2016-04', 9884.156190000002)
('2016-05', 12069.761289999995)
('2016-06', 18154.906229999993)
('2016-07', 22757.07327000001)
('2016-08', 66428.29714000005)
实验总结
实训的过程不是特别顺利,尤其是在网站里面获取数据尝试了很多种方法才成功,正如陆游笔下的
——山重水复疑无路,柳暗花明又一村
后面的过程还是比较顺利的