在数据分析中,我们需要把磁盘中的数据导入导出到系统,python,numpy,pandas中都提供了对数据的读取操作。
pandas中的解析函数:
read_csv | Load delimited data from a file, URL, or file-like object. Use comma as default delimiter |
read_table | Load delimited data from a file, URL, or file-like object. Use tab ('\t') as default delimiter |
read_fwf | Read data in fixed-width column format (that is, no delimiters) |
read_clipboard | Version ofread_table that reads data from the clipboard. Useful for converting tables from web pages |
read_csv()从磁盘或者缓存中读取数据,函数参数如下:
read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, doublequote=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
默认打开一个以 ' , ' 分割的CSV文件:
In [69]: !type D:\workspaces\datatest\ex1.csv # linux 中用 !cat a,b,c,d,message 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
将数据读取为Dataframe 对象:
In [70]: df = pd.read_csv('D:\workspaces\datatest\ex1.csv') # 用read_table也可以,要指定分隔符
# df = pd.read_table('D:\workspaces\datatest\ex1.csv',sep=',')
In [71]: df
Out[71]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
再实际中,并不是所有的文件都有标题行:
In [72]: !type D:\workspaces\datatest\ex2.csv 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
此时导入有两种方式:1)默认pandas分配的列名、 2)自己指定
In [76]: df = pd.read_csv('D:\workspaces\datatest\ex2.csv',header=None) # header=None 一定记得带 In [77]: df Out[77]: 0 1 2 3 4 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo In [78]: names = ['a','b','c','d','message'] In [79]: df = pd.read_csv('D:\workspaces\datatest\ex2.csv',names=names) In [80]: df Out[80]: a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
将 'message' 列指定为 Dataframe 的索引:
In [81]: df = pd.read_csv('D:\workspaces\datatest\ex2.csv',names=names,index_col='message')
In [82]: df
Out[82]:
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
还可以将多个列做成一个层次化的索引:
In [83]: !type D:\workspaces\datatest\ex3.csv key1,key2,value1,value2 one,a,1,2 one,b,3,4 one,c,5,6 one,d,7,8 two,a,9,10 two,b,11,12 two,c,13,14 two,d,15,16 In [85]: df = pd.read_csv('D:\workspaces\datatest\ex3.csv',index_col=['key1','key2']) In [86]: df Out[86]: value1 value2 key1 key2 one a 1 2 b 3 4 c 5 6 d 7 8 two a 9 10 b 11 12 c 13 14 d 15 16
如果你想跳过文件中的某些行则可以用 skiprows :
In [88]: df = pd.read_csv('D:\workspaces\datatest\ex3.csv',skiprows=[1,4]) In [89]: df Out[89]: key1 key2 value1 value2 0 one b 3 4 1 one c 5 6 2 two a 9 10 3 two b 11 12 4 two c 13 14 5 two d 15 16 In [90]: df = pd.read_csv('D:\workspaces\datatest\ex3.csv',skiprows=[1,5]) In [91]: df Out[91]: key1 key2 value1 value2 0 one b 3 4 1 one c 5 6 2 one d 7 8 3 two b 11 12 4 two c 13 14 5 two d 15 16
缺失值:缺失值在实际中要么用空字符串表示,要么用某个标记值标记。
In [93]: !type D:\workspaces\datatest\ex1.csv something,a,b,c,d,message one,1,2,3,4,NA two,5,6,,8,world three,9,10,11,12,foo In [94]: result = pd.read_csv('D:\workspaces\datatest\ex1.csv') In [95]: result Out[95]: something a b c d message 0 one 1 2 3.0 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11.0 12 foo
na_values 可以接受一组用于表示缺失值的字符串:
In [96]: result = pd.read_csv('D:\workspaces\datatest\ex1.csv',na_values= ['NULL']) In [97]: result Out[97]: something a b c d message 0 one 1 2 3.0 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11.0 12 foo用一个字典为各列指定不同的NA标记值:
In [98]: sentinels = {'message': ['foo', 'NA'], 'something': ['two']} In [99]: result = pd.read_csv('D:\workspaces\datatest\ex1.csv',na_values=sentinels) In [100]: result Out[100]: something a b c d message 0 one 1 2 3.0 4 NaN 1 NaN 5 6 NaN 8 world 2 three 9 10 11.0 12 NaN
read_csv、read_table的参数:
将文件中的数据逐块读取:如果要读取的文件太大,一次读取到内存会导致资源的严重占用,这时你希望读取一部分或者逐块读取:
nrows : 设置读取的行数;
In [122]: result = pd.read_csv('D:\workspaces\datatest\ex6.csv',nrows=5) In [123]: result Out[123]: one two three four key 0 0.467976 -0.038649 -0.295344 -1.824726 L 1 -0.358893 1.404453 0.704965 -0.200638 B 2 -0.501840 0.659254 -0.421691 -0.057688 G 3 0.204886 1.074134 1.388361 -0.982404 R 4 0.354628 -0.133116 0.283763 -0.837063 Q
chunksize : 设置每块的行数
In [128]: chunk = pd.read_csv('D:\workspaces\datatest\ex6.csv',chunksize=
...: 1000)
In [129]: count = pd.Series([])
In [130]: for pie in chunk:
...: count = count.add(pie['key'].value_counts(),fill_value=0)
In [131]: count # sort_index() sort_values()
Out[131]:
0 151.0
1 146.0
2 152.0
3 162.0
4 171.0
5 157.0
6 166.0
7 164.0
8 162.0
9 150.0
A 320.0
B 302.0
C 286.0
D 320.0
E 368.0
dtype: float64
将数据写入到文件:to_csv() 默认使用 ',' 分割,也可以自己指定。
to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, co mpression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')
In [159]: data = pd.read_csv('D:\workspaces\datatest\ex1.csv') In [160]: data Out[160]: something a b c d message 0 one 1 2 3.0 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11.0 12 foo In [161]: data.to_csv('D:\workspaces\datatest\out.csv') In [162]: !type D:\workspaces\datatest\out.csv ,something,a,b,c,d,message 0,one,1,2,3.0,4, 1,two,5,6,,8,world 2,three,9,10,11.0,12,foo In [163]: data.to_csv('D:\workspaces\datatest\out.csv',sep='_') In [164]: !type D:\workspaces\datatest\out.csv _something_a_b_c_d_message 0_one_1_2_3.0_4_ 1_two_5_6__8_world 2_three_9_10_11.0_12_foo
在输出的时候,缺失值会被表示为空字符串,若想用其他值表示可用 na_rep:
In [169]: data = pd.read_csv('D:\workspaces\datatest\ex1.csv')
In [170]: data
Out[170]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
In [171]: data.to_csv('D:\workspaces\datatest\out.csv',na_rep='QS')
In [172]: !type D:\workspaces\datatest\out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,QS
1,two,5,6,QS,8,world
2,three,9,10,11.0,12,foo
默认情况下会把行和列的标签都写入文件,可以指定不写入,也可以指定写入某些列:
In [183]: data.to_csv('D:\workspaces\datatest\out.csv',index=False,columns=['a','b','c']) In [184]: !type D:\workspaces\datatest\out.csv a,b,c 1,2,3.0 5,6, 9,10,11.0
Series也有 to_csv, pandas.read_csv (Series.from_csv 过时的方法) :
In [209]: dates = pd.date_range('6/18/2018', periods=7) In [210]: dates Out[210]: DatetimeIndex(['2018-06-18', '2018-06-19', '2018-06-20', '2018-06-21', '2018-06-22', '2018-06-23', '2018-06-24'], dtype='datetime64[ns]', freq='D') In [211]: ts = pd.Series(np.arange(7),index=dates) In [213]: ts.to_csv('D:\workspaces\datatest\out_ts.csv') In [214]: ts.from_csv('D:\workspaces\datatest\out_ts.csv') D:\projects\env\Lib\site-packages\pandas\core\series.py:3726: FutureWarnin g: from_csv is deprecated. Please use read_csv(...) instead. Note that som e of the default arguments are different, so please refer to the documenta tion for from_csv when changing your function calls infer_datetime_format=infer_datetime_format) Out[214]: 2018-06-18 0 2018-06-19 1 2018-06-20 2 2018-06-21 3 2018-06-22 4 2018-06-23 5 2018-06-24 6 dtype: int64 In [215]: pd.read_csv('D:\workspaces\datatest\out_ts.csv') Out[215]: 2018-06-18 0 0 2018-06-19 1 1 2018-06-20 2 2 2018-06-21 3 3 2018-06-22 4 4 2018-06-23 5 5 2018-06-24 6
手工处理分隔符:
pandas提供的方法并不是通用的,有的时候还是需要我们收工处理,对于任何单字符分割的文件,可以直接使用python内置的CSV模块:
reader(...) csv_reader = reader(iterable [, dialect='excel'] [optional keyword args]) for row in csv_reader: process(row) The "iterable" argument can be any object that returns a line of input for each iteration, such as a file object or a list. The optional "dialect" parameter is discussed below. The function also accepts optional keyword arguments which override settings provided by the dialect.
In [217]: !type D:\workspaces\datatest\ex1.csv "w","f","c" "4","5","6" "4","3","2","1" In [218]: import csv In [219]: file = open('D:\workspaces\datatest\ex1.csv') In [220]: reader = csv.reader(file) In [221]: for line in reader: ...: print(line) ...: ['w', 'f', 'c'] ['4', '5', '6'] ['4', '3', '2', '1']
In [238]: file = open('D:\workspaces\datatest\ex1.csv') In [239]: reader = csv.reader(file) In [240]: lines = list(reader) In [241]: header,values = lines[0],lines[1:] In [242]: data = {header:val for header,val in zip(header,zip(*values))} In [243]: data Out[243]: {'c': ('6', '2'), 'f': ('5', '3'), 'w': ('4', '4')}
csv文件的形式有很多,只需要定义csv.Dialect的子类即可定义新格式:
class dialect(csv.Dialect): lineterminator = '\n' delimiter = ';' quotechar = '"' reader = csv.reader(f, dialect=dialect)
csv.reader参数的选项:
delimiter | One-character string to separate fields. Defaults to ','. |
lineterminator | Line terminator for writing, defaults to '\r\n'. Reader ignores this and recognizes cross-platform line terminators. |
quotechar | Quote character for fields with special characters (like a delimiter). Default is '"'. |
quoting | Quoting convention. Options include csv.QUOTE_ALL (quote all fields), csv.QUOTE_MINIMAL (only fields with special characters like the delimiter), csv.QUOTE_NONNUMERIC, and csv.QUOTE_NON (no quoting). See Python’s documentation for full details. Defaults to QUOTE_MINIMAL. |
skipinitialspace | Ignore whitespace after each delimiter. Default False. |
doublequote | How to handle quoting character inside a field. If True, it is doubled. See online documentation for full detail and behavior. |
escapechar | String to escape the delimiter if quoting is set to csv.QUOTE_NONE. Disabled by default |
手工输出分隔符文件:
with open('data.csv', 'w') as f: writer = csv.writer(f, dialect=dialect) writer.writerow(('w', 'f', 'c')) writer.writerow(('1', '2', '3')) writer.writerow(('4', '5', '6')) writer.writerow(('7', '8', '9'))
处理JSON 数据
JSON(JavaScript Object Natation)一种比较灵活的数据格式,现在网络传输数据多用JSON。
python 标准库中,就有对 json 的处理模块 json。
一个简单的JSON格式的数据:
obj = """ { "name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"}, {"name": "Katie", "age": 33, "pet": "Cisco"}] } """
将JSON格式的数据转换成python对象形式 json.loads:
In [2]: import json In [6]: result = json.loads(obj) In [7]: result Out[7]: {'name': 'Wes', 'pet': None, 'places_lived': ['United States', 'Spain', 'Germany'], 'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'}, {'age': 33, 'name': 'Katie', 'pet': 'Cisco'}] }
json.dumps 则相反,把python对象转换成JSON格式:
In [8]: toJson = json.dumps(result) In [9]: toJson Out[9]: '{ "pet": null, "name": "Wes", "siblings": [{"age": 25, "name": "Scott", "pet": "Zuko"}, {"age": 33, "name": "Katie", "pet": "Cisco"}], "places_lived": ["United States", "Spain", "Germany"] }'
JSON数据格式转换为python对象后,既可以转换为 Dataframe 了:
In [14]: df = pd.DataFrame(result['siblings'],columns=['name','age','pet']) In [15]: df Out[15]: name age pet 0 Scott 25 Zuko 1 Katie 33 Cisco
处理的数据还可以是 html:
这里我自己写了一个简单爬虫,从 https://finance.yahoo.com/quote/AAPL/options?ltr=1 收集了部分数据,感兴趣的话,也可以自己改改代码,自己玩玩。
from urllib import request from bs4 import BeautifulSoup import pandas as pd class ParseHtml(object): def __init__(self): self.title = [] self.contexts = [] def _download(self, url): if url is None: return None result = request.urlopen(url) if result.getcode() != 200: # 判断返回的状态码 200 表示正常返回 return None return result.read() def _parser(self, context): soup = BeautifulSoup(context, 'html.parser', from_encoding='utf-8') # 用 BeautifulSoup 根据下载的内容解析内容 table = soup.find('table', class_='puts table-bordered W(100%) Pos(r) list-options') heads = table.find('tr').find_all('th') # 获取表头(所有的 th) t_body_trs = table.find('tbody').find_all('tr') # 获取表格的所有行(除去表头tr外,所有的tr) for head in heads: self.title.append(head.get_text()) # 获取所有的表头值 for bt in t_body_trs: tds = bt.find_all('td') # 获取每行的所有单元格 temp = [] for td in tds: temp.append(td.get_text()) # 获取每行单元格的内容 self.contexts.append(temp) return self.title, self.contexts def gen_data_frame(self): html = self._download('https://finance.yahoo.com/quote/AAPL/options?ltr=1') t, c = self._parser(html) data_frame = pd.DataFrame(c, columns=t) # 根据提取的内容生成 pd.Dataframe return data_frame if __name__ == '__main__': ph = ParseHtml() df = ph.gen_data_frame() print(df)
pandas 中提供了一个 to_pickle方法用于保存二进制数据和read_pickle方法把数据读回python,
numpy中提供了一个 save方法用于保存二进制数据和load方法把数据读回python,
python中对二进制进行存储是使用内置的 pickle序列化。
In [13]: frame = pd.read_csv('D:\workspaces\datatest\ex2.csv',header=None) In [14]: frame Out[14]: 0 1 2 3 4 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo In [15]: frame.to_pickle('D:\workspaces\datatest\_frame_pickle') In [16]: pd.read_pickle('D:\workspaces\datatest\_frame_pickle') Out[16]: 0 1 2 3 4 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
使用 HDF5 格式:
HDF5的是层次性数据格式(hierarchical Data Format),能高效的读取磁盘上的二进制文件,是一个流行的工业级 C 语言库。每个HDF5文件都含有一个文件系统式的节点结构,能够存储多个数据集,并支持元数据。HDF5支持多种压缩器的及时压缩,高效的存储重复模式数据。
Python中的HDF5库有两个接口 PyTables 和 h5py。h5py提供了一种直接而高级的HDF5 API访问接口。而PyTables 则抽象HDF5法人许多细节已提供多种灵活的数据容器,表索引,查询功能以及对核外计算技术的某些支持。
pandas 有一个最小化的类似于字典的 HDFstore类,他通过PyTables 存储pandas对象:
若出现:ImportError: HDFStore requires PyTables, "No module named 'tables'" problem importing 安装 tables。pip install tables
In [4]: store = pd.HDFStore('mydata.h5') In [5]: frame = pd.read_csv('D:\workspaces\datatest\ex2.csv',header=None) In [6]: store['obj1'] = frame In [7]: store Out[7]: <class 'pandas.io.pytables.HDFStore'> File path: mydata.h5 In [8]: store['obj1'] Out[8]: 0 1 2 3 4 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
还可以处理 Microsoft Excel 文件 和 数据库中的数据,这里就不介绍了,有机会后面介绍有关 python数据库相关的知识。