文章目录
1.读取文件
1.1 以指定分隔符的方式读取文件
read_table(sep=)
In [3]: pd.read_table('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.csv',sep=',')
Out[3]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
1.2 忽略表头
read_csv()读取csv文件是,默认第一行为表头,索引列默认为整数索引
header=None:不要将第一行作为表头
In [4]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',header=None)
Out[4]:
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
1.3 自定义表头
names=[]
In [5]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',names=['a','b','c','d','messag
...: e'])
Out[5]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
1.4 指定特定列为索引列
index_col = 列的序号或者列名
In [7]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',names=['a','b','c','d','messag
...: e'],index_col=4)
Out[7]:
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
In [8]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.csv',names=['a','b','c','d','messag
...: e'],index_col='message')
Out[8]:
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
index_col可以传入一个列名的列表
In [10]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/csv_mindex.csv',index_col=['key1','key
...: 2'])
Out[10]:
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16
1.5 读取文件时,跳过指定行
skiprows=[]
In [12]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex4.csv',skiprows=[0,2,3])
Out[12]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
In [18]: !type F:\\python程序文件\\jupyter\\pydata-book-2nd-edition\\examples\\ex4.csv#预览文件内容
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
1.6 自定义缺失值标识
na_values=[]或字典
In [19]: result = pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex5.csv',na_values=['NULL'])
In [20]: result
Out[20]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
In [21]: sentinels = {'message':['foo','NA'],'something':['two']}
In [22]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex5.csv',na_values=sentinels)
Out[22]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN
1.7 设置pandas的显示的行数
pd.options.display.max_rows = 10#若DataFrame的行数超过10行,则只会显示前五行和后五行,中间用省略号代替
1.8 只读一个文件的前几行
In [23]: pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex6.csv',nrows=5)
Out[23]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
1.9 分块读文件
In [24]: chunker = pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex6.csv',chunksize=1000)
In [25]: chunker
Out[25]: <pandas.io.parsers.TextFileReader at 0x183e5e4eeb8>
In [26]: tot = pd.Series([])
In [27]: for piece in chunker:
...: tot = tot.add(piece['key'].value_counts(),fill_value=0)
...:
In [28]: tot = tot.sort_values(ascending=False)
In [32]: tot.head(5).append(tot.tail(5))
Out[32]:
E 304.0
X 287.0
M 281.0
P 273.0
L 271.0
5 128.0
2 127.0
0 123.0
9 120.0
1 117.0
dtype: float64
2. 写文件
df.to_csv()默认会将行和列的标签都写入进文件
2.1 忽略行标签和列标签
to_csv(index=False,header=False)
In [33]: data = pd.read_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex5.csv')
In [34]: data
Out[34]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
In [35]: data.to_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/out.csv',index = False,header=False)
In [36]: !type F:\\python程序文件\\jupyter\\pydata-book-2nd-edition\\examples\\out.csv
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
2.1 仅写入dataframe的子集
In [37]: data.to_csv('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/out.csv',index=False,columns=['a','b',
...: 'c'])
In [38]: !type F:\\python程序文件\\jupyter\\pydata-book-2nd-edition\\examples\\out.csv
a,b,c
1,2,3.0
5,6,
9,10,11.0
Series也有to_csv()方法
3. 读取JSON数据
JSON(JavaScript Object Notation)已经成为web浏览器和其他应用间通过HTTP请求发送数据的标准格式。它是一种比CSV等表格文本形式更为自由的数据形式。字典字符串
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""
- 不允许列表末尾出现逗号
- 基本类型是对象(字典)、数组(列表)、字符串、数字、布尔、空值
- 所有键必须是字符串
3.1 解析JSON数据
In [3]: import json
In [4]: obj = """^M
...: {"name": "Wes",^M
...: "places_lived": ["United States", "Spain", "Germany"],^M
...: "pet": null,^M
...: "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},^M
...: {"name": "Katie", "age": 38,^M
...: "pets": ["Sixes", "Stache", "Cisco"]}]^M
...: }^M
...: """
In [5]: result = json.loads(obj)
In [6]: result
Out[6]:
{'name': 'Wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
{'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
3.2 生成JSON数据
In [11]: result
Out[11]:
{'name': 'Wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
{'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
In [12]: json.dumps(result)
Out[12]: '{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'
3.3 pd.read_json()
3.4 data_json.to_json()
3.3和3.4不常用,容易出问题
4. 读取html文件
pd.read_html(url),会将html文件中所有的table解析成DataFrame并以列表的形式返回
5. 正则表达式
‘.’不能匹配\n
6. 读写二进格式
- pd.read_pickle()
- pd.to_pickle()
pickle仅被推荐作为短期的存储格式。问题在pickle很难确保格式的长期有效;一个今天被pickle化的对象可能明天因为库的新版本而无法反序列化。
7. 读写HDF5
- HDF5用于存储大量的科学数组数据
- HDF代表分层数据格式
- HDF5文件可以存储多个数据集并且支持元数据(元数据:描述数据的数据)
8.读写Excel文件
- pd.ExcelFile()
- pd.read_excel()
上面两个函数都是基于附加包xlrd和openpyxl来分别读取xls和xlsx文件的
当读取含有多个表的文件,生成ExcelFile更快
#方法一:通过pd.ExcelFile()来读取文件
In [14]: pd.ExcelFile('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.xlsx')
Out[14]: <pandas.io.excel._base.ExcelFile at 0x2ae34f4b550>
In [15]: xlsx = pd.ExcelFile('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.xlsx')
In [16]: pd.read_excel(xlsx)
Out[16]:
Unnamed: 0 a b c d message
0 0 1 2 3 4 hello
1 1 5 6 7 8 world
2 2 9 10 11 12 foo
#方法二:直接通过pd.read_excel()来读取文件
In [17]: pd.read_excel('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex1.xlsx','Sheet1')#此处的Sheet1必须首字母大写,否则报错
Out[17]:
Unnamed: 0 a b c d message
0 0 1 2 3 4 hello
1 1 5 6 7 8 world
2 2 9 10 11 12 foo
- pd.ExcelWriter()
- to_excel()
In [20]: data
Out[20]:
Unnamed: 0 a b c d message
0 0 1 2 3 4 hello
1 1 5 6 7 8 world
2 2 9 10 11 12 foo
#方法一:通过pd.ExcelWriter()来写入文件
In [21]: writer = pd.ExcelWriter('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.xlsx')
In [22]: data.to_excel(writer,'Sheet1')
#方法二:通过pd.to_excel()来写入文件
In [23]: data.to_excel('F:/python程序文件/jupyter/pydata-book-2nd-edition/examples/ex2.xlsx','Sheet1')
10. 爬虫
11.与数据库交互
import pymssql
connect = pymssql.connect('localhost', '用户名', '密码', '数据库名称')
if connect:
print("连接成功!")
cursor = connect.cursor()
cursor.execute("create table C_test02(id varchar(20))")
connect.commit()
cursor.close()
connect.close()